clock menu more-arrow no yes

Filed under:

How much should we trust 2020 player statistics?

New, 5 comments

Small sample size fun, or meaningful trends?

Wild Card Round - Houston Astros v Minnesota Twins - Game Two Photo by Brace Hemmelgarn/Minnesota Twins/Getty Images

In 2020, everything has been out of whack. Wacky sports seasons, altered day-to-day life, Cleveland refusing to pay their good players (oh wait, that one’s normal). But much like our news sources these days, it’s hard to know how much we should trust the baseball player statistics that were recorded in the 2020 season. 37% of a normal season is hardly a large enough sample size to take meaningful conclusions from (hear that, White Sox fans?).

Further complicating the question is the 2019 offensive breakouts from Max Kepler and Jorge Polanco. The pair experienced a pair of reasonably expected breakout seasons in 2019, followed by steeper-than-expected regression in 2020. However, particularly in these two cases, other factors may have also contributed to their drop-offs. Polanco underwent ankle surgery immediately after the season, and it was revealed he played hurt for most of the season, which may help explain the cratering of his power numbers in 2020. Kepler also spoke out on the mental health challenges of playing in the unique off-field circumstances of 2020, which may have contributed to his struggles.

Case 1: Jorge Polanco

Using Polanco for an example, let’s examine how (un)stable a sample size of 55 games (how many he played in 2020) really is.

Polanco 55-Game Samples

Year Games Batting AVG OPS
Year Games Batting AVG OPS
2020 55 0.258 0.658
2019 Last 55 0.272 0.778
2019 First 55 0.338 0.989
2018 First 55 0.273 0.743
2017 Last 55 0.314 0.927
2017 First 55 0.250 0.662
2016 First 55 0.289 0.759
Average 55 0.285 0.788

As this table illustrates, there’s a lot of variance in a sample size this small. Polanco almost certainly isn’t going to be a .338 hitter, as he was through 55 games of 2019, and he almost certainly isn’t a .258 hitter, like he was this season. This was the worst (non-overlapping) 55-game sample I examined by OPS, and second-worst by batting average. Coming off a career year, some regression was expected. However, just entering his prime, regressing back to his age-23 season would be very unusual. I think it’s reasonable to pin 2020 on a bad small sample size aided and abetted by a balky ankle.

Official sources stated that Jorge Polanco’s 2020 is false and misleading

Case 2: Max Kepler

Max Kepler has never really been a high-average hitter, but what he lacks in average, he makes up for in extra base hits. His 48-game sample in 2020 signaled some stark regression from his 2019 numbers, but not necessarily his career norms.

Kepler 48-Game Samples

Year Games Batting AVG OPS
Year Games Batting AVG OPS
2020 48 0.228 0.760
2019 Last 48 0.236 0.855
2019 First 48 0.280 0.894
2018 Last 48 0.185 0.641
2018 First 48 0.254 0.814
2017 Last 48 0.220 0.730
2017 First 48 0.260 0.779
2016 Last 48 0.203 0.539
2016 First 48 0.229 0.770
Average 48 0.233 0.754

As we can see, Kepler is nearly as volatile in 48-game samples as Polanco is in 55 games. However, his 2020 really isn’t out of the ordinary, falling toward the median of his samples, rather than the bottom end (as Polanco’s did). On the other hand, Kepler has an (albeit weak) upward trend in his batting numbers over these samples that is pretty typical of hitters as they progress towards their primes, and this did not continue in 2020. I think we should expect Kepler to wind up closer to his 2019 numbers than his 2020 numbers in 2021, but his 2020 is more telling than Polanco’s.

Official sources stated that Max Kepler’s 2020 is not really false and misleading

Case 3: Byron Buxton

LOL Just kidding. There are no conclusions to be made from any sample size in Buxton’s career.

Statistically speaking, Nelson Cruz had the most telling (and simultaneously useless) season of all the Twins in 2020. Cruz had another great season, his 13th (out of 16) with an OPS of over .800. Seasons like his, which affirm what we already have concluded (that Cruz is a great hitter), are more trustworthy than seasons that buck trends or are outliers (Kepler and Polanco, respectively). However, we already knew that Cruz was a great hitter, and we’re waiting to find out what Polanco and Kepler will be in their final forms, so even though his stats were trustworthy, they’re just as useless as anyone else’s were in 2020. There are no conclusions to be drawn from 2020. Had Kepler or Polanco produced closer to their 2019 levels in 2020, we would all be pointing to their seasons and saying “see! They really are that good!”; however, their stats would be just as meaningless had they been good. The extenuating circumstances, paired with the small sample sizes, render 2020’s stats rather untrustworthy.

The Twins also have a pitcher who is a clear example of the small sample size problem. Taylor Rogers was disappointing to Twins fans in 2020, posting the worst ERA of his career (4.09). He only pitched about a third of his normal innings load, with 20.0. Had he given up one earned run over his next 11 innings (pretty reasonable for a guy who finished 2018 with 26 consecutive scoreless innings), his ERA would have come in at 2.90, and we would all feel very differently about his season than we do now.

As far as player statistics go, there just isn’t much to be gleaned from 2020. The extenuating circumstances, paired with the small sample sizes, render 2020’s stats untrustworthy. Luckily for the Twins, they can mostly expect positive regression in 2021, while the White Sox some other teams will likely experience the opposite.