Is it really too early in the season to make judgements about how the Twins are doing? Nah. Is it too early to totally give up on players? Yeah, it's still a little early to get really upset or pleased with players. But, it's as good a time as any to see how well the Twins are doing compared to our expectations for them.
What follows, if you are brave enough to read further, is a fairly extensive analysis into how well the Twins have done this season compared to their PECOTA projections. I got a little carried away, but I think it's pretty good stuff.
In addition to being a fun diversion into looking at predictions, it provides a good example of why R squared values are not always a good way to analyze a model.
If you're not much interested in the stats details and are looking mostly for what the baseball conclusions are, I suggest skipping ahead to the three sections with graphs, and then skipping to the conclusions. In order to try to get you to read further, I'll leave you with a graph that shows the Twins' actual hits compared with their predicted hits.
(Yeah, I'm intentionally avoiding talking about this weekend's series with the White Sox. Can we all just agree that didn't happen?)
At some point during spring training, I put up my offensive predictions for the Twins. Rate-wise these were simply the Twins' weighted mean PECOTA projections for the season. Needless to say, I don't think my playing time projections are very accurate right now, as a big part of those projections had to do with expectations of injuries as the season goes on. So since the playing time projections weren't even meant to be accurate at this point during the season, I'll ignore them for the moment and just look at how accurate the rate projections have been given each player's actual playing time thus far.
How should we judge the predictions?
I'm going to take a very stats-oriented approach for evaluating these predictions. The three stats examined will be batting average (AVG), on-base percentage (OBP), and slugging average (SLG). The first two statistics are strictly binomial statistics. In a nutshell, this means that there are two outcomes for each trial in our experiment. In each at-bat or plate appearance, there are two possible outcomes--a success or a failure. This makes it very easy to model these statistics with weighted coin flips. Slugging average isn't a binomial statistic, but for the purposes of this exercise, we'll treat it like one, as that will get us close to the answer we want without doing a whole lot of extra work.
The method: an example
The best way to illustrate my method is probably with an example. Let's take Lew Ford, as it doesn't seem like many people are talking about him these days. PECOTA had Lew pegged for a 0.278 batting average this season. Our null hypothesis will then be that Ford is a 0.278 hitter. If we let Lew take one million at-bats, we'd expect him to get about 278,000 hits. In the same way, if we have a coin that lands heads 27.8% of the time, we would expect it to get heads 278,000 times if we flipped it one million times. Of course, from experience with coins, we know that if we flip it a limited number of times, sometimes it will land heads more than expected and sometimes it will land heads less than expected.
So far this season (through April 22nd) Lew has had 31 at-bats and hit safely 8 times. If we flipped our 0.278 coin 31 times, we would expect 8.63 heads on average. Lew's performance has been as close to predicted as we can practically get. But how do we quantify how close he is to the predicted number of hits? We'll use a combination of 1) the simple difference between actual and predicted hits and 2) the standard deviation of the coin's statistical distribution.
The standard deviation, in a very non-technical sense, is a way of quantifying the reliability of a prediction. If we predict something will happen 5 times, on average, and it has a standard deviation of 1, then we expect that thing will happen between 4(=5-1) and 6(=5+1) times about two-thirds of the time (thanks to the central limit theorem.) If it has a standard deviation of 3, then we expect that thing will happen between 2 and 8 times two-thirds of the time. So, our prediction with a lower standard deviation will be closer to the observed behavior more often than our prediction with a higher standard deviation.
If you're interested in how you calculate the standard deviation for a binomial distribution, you can find an explanation in one of the above links, so I won't go into it here since it will suffice to understand what the standard deviation tells us. For our Lew Ford example, the standard deviation of the expected 8.63 heads is 2.5. We'll take the difference between his actual and predicted hits (8 - 8.63) and divide it by that standard deviation (2.5) to give us what I'm calling the z-score (here it turns out to be -0.252.) (Note: I'm not sure if the z-score terminology is standard or not, but I think that's what it's generally called.) Also, note that if the standard deviation had been very small, say 0.01, then Ford's z-score would be much higher, indicating that his actual performance wasn't very close to expected because we expected him to be very close (within 0.01) most of the time.
The z-score will be the key in evaluating how good the predictions are overall.
Back to looking at the whole team
Once we have a z-score for each player on the team, we can calculate the standard deviation of the z-scores and call that the goodness-of-fit. (Again, I'm not sure if this is standard terminology, but in this case it is descriptive.) If the goodness-of-fit is equal to 1, then the deviations in actual performance are consistent with the statistical uncertainties. If the goodness-of-fit is less than 1, then the deviations from acutal performance are smaller than the statistical uncertainties would predict. If the goodness-of-fit is greater than 1, then the deviations from actual performance are greater than the statistical uncertainties would predict. (Note: a lower value of the goodness-of-fit is just as bad as a higher value of the goodness-of-fit when evaluating how likely it is that your predictions were correct.)
When I calculate this for the Twins' AVG, OBP, and SLG, I get the following results:
From this table, we see that the batting average predictions have been the most accurate, followed by the OBP predictions and the SLG predictions. Values for the goodness-of-fit between 0.67 and 1.5 are pretty good for our purposes. The slugging predictions lie outside this range, but as mentioned at the beginning, I haven't taken the time to carefully compute the standard deviation for slugging average, so this goodness-of-fit is only approximate.
A look at each set of predictions separately
1. Batting Average
What follows is a graph of the predicted number of hits plotted against the actual number of hits. Each point represents a different player on the team. The red line represents what a "perfect" prediction would look like, with each predicted number of hits exactly equalling the actual number of hits. The error bars on each point illustrate the standard deviation of each measurement. Larger error bars indicate, in general, that a player has taken fewer at-bats, so we are more uncertain about how good he actually is than a player with more at-bats and smaller error bars.
What about this graph makes me think the predictions are good? The first thing is to look at how close to the red line each of the data points is, as judged by the error bars. If the fit is a good one, about 2/3 of the points should see their error bars overlap with the red line and about 1/3 shouldn't. I count 5 of 15 points that don't overlap with the red line, which is right in line with expectations.
Somewhat important technical note: It might seem tempting to say that the fit would be better if those 5 points also overlapped the line, but this is not the case. Say someone told you they were flipping a fair coin (50% chance for heads) 100 times, and that the coin landed heads exactly 50 times. Then say you asked them to do that again, and they told you 50 times again. And so on and so forth, with them telling you that the coin landed heads exactly 50 times. One of two things is happening here: either they are lying to you (faking their data) or they aren't actually flipping a coin. Whatever they are doing is very predictable, happening 50% of the time every time, but if they were really flipping a coin then sometimes it would be more than 50 and sometimes it would be less than 50. In the same way, if all of our error bars overlapped the red line, then our predictions would not be described very well by binomial statistics. I suppose this would be good in the sense that we're closer to the actual predictions than binomial statistics would expect, but it's bad in that we wouldn't have any model that tells us how close we should actually be.
Another way to judge whether or not the fit is good is by looking at how many points fall above the red line and how many fall below the red line. I see 3 points essentially right on the line, 6 above, and 6 below. Having an equal number of over-performers and under-performers is a good sign that the predictions are neither over-rating the team as a whole or under-rating the team as a whole.
The one baseball note I'll make about the fit is where Rondell White is on this graph. He's the underperformer you'll see in the lower right corner of the graph, with about 18 expected hits and only 8 actual hits. This is by far the biggest difference between actual and expected performance. The probability that a true .292 hitter would get 8 or fewer hits in 63 at-bats is 0.1%. Thus, it is highly unlikely that bad luck has been the true culprit behind Rondell White's start. It's been pretty clear to observers that White's approach of late has been poor. If he can get things fixed, he might turn out to be the .292 hitter PECOTA expected him to be. If nothing changes, then there's not much reason to expect White to be anywhere near .292 just by hoping for better luck.
2. On-base Percentage
Here is the graph for the on-base percentage predictions. In this graph times-on-base is plotted on each axis instead of hits. Everything else is as in the batting average graph.
As the goodness-of-fit number for OBP indicates, this fit isn't quite as good as the AVG fit. We can see this by looking at the graph and noticing 9 points above the red line and only 6 below it. This isn't all that far from a 7 and 8 split, like we would expect, so it's still a pretty reasonable fit, just not quite as nice as the one for batting average. OBP is a big part of running a good offense, and this would seem to predict that, if anything, the Twins are overachieving in that category, if just slightly overachieving, right now. The big outlier in the bottom right is again Mr. Rondell White.
3. Slugging Average
Here is the graph for the slugging average predictions, with total bases plotted along each axis.
This is basically more of the same. What makes this fit worse than the others isn't so much that the points are farther from the fit line, it's that the error bars on each point are smaller relative to the points' distance from the fit line. As mentioned twice above already, the error bars here are only estimates, so I won't go into a whole lot of detail analyzing this plot, just noting that at a first approximation, it seems like the predictions were reasonably good.
How NOT to evaluate the predictions
Most baseball statistical studies begin and end with looking at a correlation coefficient, commonly referred to as an "R squared" value. We are told that an R squared value close to one represents a really good fit and that an R squared value near zero means the fit was a bad one. For each of my above fits, I quickly calculated the R^2 value and got this table:
In a typical analysis, these values would lead to a strange statement by the researcher to the effect of "this means the model explains about 60% of the data" or something like that. This would lead us to believe that maybe we had some more work to do when predicting average, when in fact our model already explains the data almost as well as we could possibly hope, as I found with the goodness-of-fit analysis. Or maybe it would lead us to looking into some other way to predict batting averages, when we really can't do significantly better than we already have.
Where did the R^2 analysis go wrong? I mean, lots of people use it, so there must be some merit to it, right? Sure. If each of our points above had a very small error bar, then the goodness-of-fit analysis gives you practically the same result as the R^2 analysis for most data sets. So in studies where the researchers are looking at season-ending totals, it's a pretty reasonable tool, and it's much quicker than doing the goodness-of-fit analysis. But I think this makes a good cautionary tale that one should not blindly apply an R^2 analysis and expect the results to be particularly meaningful.
So far, with the notable exception of Rondell White, the Twins have performed within PECOTA's expectations. Only White has given us any reason over the first three weeks of the season to change our performance expectation from our predictions at the beginning of the season. Whether this is good news or bad news depends on how many runs you thought the Twins would score if they performed according to their PECOTA projections.
How do you feel about the level of math/stats in this story?
This poll is closed
I loved it!
It was fine
I skimmed over a lot of it
I didn't pay any attention to the math
Can't we just let the players play the games?