I've been playing with a dataset I pulled from baseball-reference last weekend (should have been doing stats homework... ah well).
My idea stemmed from Montanatwinsfan's post and the respective dialog from the community on the validity of "new" vs. "old" statistics. The ultimate goal is to compile a dataset of team statistics by year and run a series of multiple regressions on the old and new stats to see which ones hold up under statistical scrutiny. The research question is "Which statistics contribute the most to predicting wins, playoff appearances and ultimately World Series appearances?"
I currently have all data from 2000 through 2007, but unfortunately haven't yet bought SPSS to run the regressions. In the meantime, I've been playing with the dataset in pivot tables a la Excel 2007 (which, for you data miners out there is pretty slick compared to the 2003 version).
I realize that most of this is common sense stuff, but thought it might be interesting to have the numbers to back up the common sense. In what I hope will be the first of several posts based on the data, here is a look at average salary by year, broken into three groups: Teams that did not make the playoffs, teams that made the playoffs but not the WS, and teams that played in the WS.
I did not break it out any further because the variance becomes too large to find any meaningful data. This means that for each WS data point, there are two teams, and for every playoff data point there are eight teams (includes the WS teams).
The first observation I took from the data is that teams that did not make the playoffs have steadily increased their salary, but fall well short of the kind of money teams that make the playoffs are spending. The second observation I made from this data is that the average salary of WS teams wildly fluctuates to the point where it's almost impossible to draw conclusions on the data. Four of the WS data points fall below the playoff average (2002, 2005, 2006, 2007) and the other four fall above the playoff average (2000, 2001, 2003, 2004). We'll approach this anomoly in a moment.
Next, it was time to take a step further into the data. I created linear regression lines and removed the trend lines from the graph. You will also note that I provided the regression equation and R-squared calculation, if you're into that sort of thing.
The first thing to note about this graph is that the R-squared value for WS is extremely low, again validating that there is just too little data and too much variance to make an accurate inference about the regression line. The other two lines, however, seem to fit the data quite well and tell an interesting story. Simply put, non-playoff teams are averaging a 2.4 million increase per year, while playoff teams are averaging a 6.8 million increase per year. To put it in context, playoff teams in 2007 spent an average of 24 million more than non-playoff teams. If this regression holds true, by 2012, playoff teams will spend an average of 46 million more than non-playoff teams!
While these results may not be earth-shattering to most people who have been lamenting about the Yankees (and now Boston) "buying" the World Series, I wanted to slice this data one more time. This time, I removed both Boston and New York (Yankees) from the data to see how our regression lines would change:
Wow! The increase in playoff team salaries is now only 1 million more than non-playoff teams. Additionally, the differences between playoff teams and non-playoff teams is holding fairly constant at about 8 million.
While again, we can't make any actual inferences about the World Series regression line because the R-squared is too small, it's really crazy to see the shear randomness (is that a word?) of the World Series team's payroll. With New York and Boston out of the picture, the Marlins (2003) and Rockies (2007) look almost comical when compared to the rest of the data.
The conclusion drawn from the data above are as follows:
1. The ability to "buy" a world series is still inconclusive because there is just too little data to accurately predict world series based on salary. In fact, several of the past WS teams had significantly lower salaries than even the non-playoff teams.
2. The ability to "buy" a playoff berth seems to be a fairly established trend. The rate of increase for playoff teams is more than non-playoff teams, so we can expect the margin to grow over time, regardless of whether you include Boston and New York or not.
3. New York and Boston's combined salary will soon be larger than the sum of the other 28 teams. (Note: Intentional hyperbole thrown in for comic relief... it's late and I've just spent an hour writing about statistics, give me a break)