Buying a World Series
I've been playing with a dataset I pulled from baseball-reference last weekend (should have been doing stats homework... ah well).
My idea stemmed from Montanatwinsfan's post and the respective dialog from the community on the validity of "new" vs. "old" statistics. The ultimate goal is to compile a dataset of team statistics by year and run a series of multiple regressions on the old and new stats to see which ones hold up under statistical scrutiny. The research question is "Which statistics contribute the most to predicting wins, playoff appearances and ultimately World Series appearances?"
I currently have all data from 2000 through 2007, but unfortunately haven't yet bought SPSS to run the regressions. In the meantime, I've been playing with the dataset in pivot tables a la Excel 2007 (which, for you data miners out there is pretty slick compared to the 2003 version).
I realize that most of this is common sense stuff, but thought it might be interesting to have the numbers to back up the common sense. In what I hope will be the first of several posts based on the data, here is a look at average salary by year, broken into three groups: Teams that did not make the playoffs, teams that made the playoffs but not the WS, and teams that played in the WS.
I did not break it out any further because the variance becomes too large to find any meaningful data. This means that for each WS data point, there are two teams, and for every playoff data point there are eight teams (includes the WS teams).
The first observation I took from the data is that teams that did not make the playoffs have steadily increased their salary, but fall well short of the kind of money teams that make the playoffs are spending. The second observation I made from this data is that the average salary of WS teams wildly fluctuates to the point where it's almost impossible to draw conclusions on the data. Four of the WS data points fall below the playoff average (2002, 2005, 2006, 2007) and the other four fall above the playoff average (2000, 2001, 2003, 2004). We'll approach this anomoly in a moment.
Next, it was time to take a step further into the data. I created linear regression lines and removed the trend lines from the graph. You will also note that I provided the regression equation and R-squared calculation, if you're into that sort of thing.
The first thing to note about this graph is that the R-squared value for WS is extremely low, again validating that there is just too little data and too much variance to make an accurate inference about the regression line. The other two lines, however, seem to fit the data quite well and tell an interesting story. Simply put, non-playoff teams are averaging a 2.4 million increase per year, while playoff teams are averaging a 6.8 million increase per year. To put it in context, playoff teams in 2007 spent an average of 24 million more than non-playoff teams. If this regression holds true, by 2012, playoff teams will spend an average of 46 million more than non-playoff teams!
While these results may not be earth-shattering to most people who have been lamenting about the Yankees (and now Boston) "buying" the World Series, I wanted to slice this data one more time. This time, I removed both Boston and New York (Yankees) from the data to see how our regression lines would change:
Wow! The increase in playoff team salaries is now only 1 million more than non-playoff teams. Additionally, the differences between playoff teams and non-playoff teams is holding fairly constant at about 8 million.
While again, we can't make any actual inferences about the World Series regression line because the R-squared is too small, it's really crazy to see the shear randomness (is that a word?) of the World Series team's payroll. With New York and Boston out of the picture, the Marlins (2003) and Rockies (2007) look almost comical when compared to the rest of the data.
The conclusion drawn from the data above are as follows:
1. The ability to "buy" a world series is still inconclusive because there is just too little data to accurately predict world series based on salary. In fact, several of the past WS teams had significantly lower salaries than even the non-playoff teams.
2. The ability to "buy" a playoff berth seems to be a fairly established trend. The rate of increase for playoff teams is more than non-playoff teams, so we can expect the margin to grow over time, regardless of whether you include Boston and New York or not.
3. New York and Boston's combined salary will soon be larger than the sum of the other 28 teams. (Note: Intentional hyperbole thrown in for comic relief... it's late and I've just spent an hour writing about statistics, give me a break)
3 recs |
18 comments
|
Comments
Love the research.
I like where you're going with this, and I enjoyed this article. Well done.
My perspective: I'm not sure you can support your conclusions based on the data above. Your conclusions are based on linear regressions, which are themselves based on just eight data points. I'm not sure that we can generalize from this set.
Second, you haven't supported your causation. Your inference - that money buys the playoffs, that y is caused by x - is based on correlation. I'm not suggesting that there is no relationship between the two, only that there are other factors in play.
I like the statistical posts, though - I encourage you to keep working on this type of thing.
Causation vs. correlation
Thanks for bringing this up Jon. It's a good point to make when looking at statistics and I completely agree with your concerns.
First, while I currently only have data 2000 - 2007, I hope to eventually build the 90's era into the dataset when I get more time. With more data, and not breaking the results out by year, there should be enough data in the "playoff - no ws" category to accurately make these assumptions.
Second, as mentioned in my original post, I am hoping to get a statistical software package like SPSS to help with running regressions instead of correlations. Not only will this allow me to add multiple variables to the model, but also help to isolate the effect that team salary (or any other variable) has on playoff appearances and ultimately World Series appearances.
What would my life be like without the '91 World Series?
If anyone has the spare time
Jon - well put. These are GREAT posts for starting discussions. The causation/correlation question, to me, is that the teams that make the playoffs tend to get more revenue allowing them to spend more, but they were already good, so they keep going to the playoffs (think yankees, red sox, braves). Obviously this is a chicken/egg, dog chasing its tail kind of a question.
The statistic I'd really like to see, if anyone is interested in putting it together, is the year over year wins change relative to the year over year payroll change. I think this helps isolate the impact of spending MORE money. It won't account for the fact that many good teams have to start increasing their payroll to keep their young stars, so the results may not be intuitive.
This is great
Nice work on this. I agree with you, too, that the World Series data has so few datapoints as to be pretty much useless - it's especially pronounced in your chart without Boston and New York, because they represent over half the AL representatives in the World Series over the length of the study (meaning that most of the "averages" are a single datapoint), and in 2003 and 2007 they faced extremely low-payroll teams, throwing the trend data way out of whack.
I think another useful way to plot the data would be to compare payrolls to league average, rather than overall dollars - it would be interesting to see, for example, what percentage of league payroll is spent by the playoff teams vs. non-playoff teams, and how those payrolls are increasing/decreasing in relative terms, which probably more accurately represents how teams are participating in the market. It's also another situation in which you may want to consider excluding the Yankees and possibly the Red Sox for at least some of the comparisons.
"There are only two things that are infinite, the universe and human stupidity, and I'm not sure about the former." - Albert Einstein
Like it.
"I think another useful way to plot the data would be to compare payrolls to league average, rather than overall dollars."
I like this. This is one of the things I was thinking about last night - payrolls throughout the league have been going up, not just those of the top teams, and so it would be good to remove (or at least slightly remove) the effect of "inflation" from this.
You may want to correct for outliers when calculating the league average, too.
by Jon Marthaler on Apr 9, 2008 12:43 PM EDT up reply actions
One thing that might escape these models
is the whole complex of factors of how baseball is changing.
The '90s were an era of big boppers and steroids. Now, we seem to be moving toward an appreciation for OBP, OPS, pitching, defense and overall team speed.
While such changes might easily be captured by salary analysis, in that teams all spend trying to capture the best pool of talent, what might end up happening is that some desirable qualities, like pitching, defense and team speed, may be more likely possessed by young and fairly cheap players.
Given that there is a huge salary premium in MLB for players with ML experience, we may soon enter an era where young, inexpensive talent can make high quality contributions to a team's overall success. And thus, more and more money will be spent on scouting and minor league development as a near term competitive strategy.
Baseball, like any industry, is never static, and what you might find in such an analysis is that there is a premium that goes to a team that finds the best young talent on the planet.
Moneyball
While such changes might easily be captured by salary analysis, in that teams all spend trying to capture the best pool of talent, what might end up happening is that some desirable qualities, like pitching, defense and team speed, may be more likely possessed by young and fairly cheap players.
That's an interesting hypothesis, and it also made me think of an interesting avenue of research - has anyone looked into correlations between statistics and player salaries? I'd guess there's a fairly strong correlation with obvious stuff like homeruns (and likely, by extension, slugging average), but I'd also wonder whether it would be possible to use the data to spot trends in what teams are looking for in players - for example, I'd guess the last decade has seen an increase in the money paid for higher walk rates, possibly along with a decreased emphasis on batting average.
That thought actually wouldn't go along with your idea at all, though, since younger players' salaries aren't determined at all by the market. Even arbitration-eligible players' salaries would have to be taken with a grain of salt in that data, since they're determined largely by seniority in addition to performance.
"There are only two things that are infinite, the universe and human stupidity, and I'm not sure about the former." - Albert Einstein
Great post
When I have more time, I'll do my best to digest it.
"You're thinking too much. Just have fun." -- Bennie "The Jet" Rodriguez in Sandlot
On thought
I wonder what the graphs would look like if you take Boston and New York out of the pool of teams.
"You're thinking too much. Just have fun." -- Bennie "The Jet" Rodriguez in Sandlot
Already done
It's the third graph.
"There are only two things that are infinite, the universe and human stupidity, and I'm not sure about the former." - Albert Einstein
It seems a bit silly to me...
...to take them out of the equation completely. They've won 5 of the last 10 WS and have 16 playoff appearances in the last 20 years. They are exhibits 1A and 1B in the case for a huge payroll being a huge competitive advantage.
Considering this issue without considering those two is like considering how good the '48 Braves were without Warren Spahn and Johnny Sain.
The point
The point of removing Boston and New York from the third graph was more an exercise in showing that hypothesis 2 from my original post still holds true even when the extreme outliers that could significantly affect the "Playoff - Non WS" trend line were removed.
What would my life be like without the '91 World Series?
I've added this to the front page
because it's some good work; at the very least it's a great discussion.
I'll continue to do this for excellent posts. Keep up the good work!
You might not be able to buy a WS...
...but by having a huge payroll you never have to rebuild and you get yourself to the playoffs more often than a team with similar know-how and a lower budget. More chances to win the WS will lead to more WS victories in the long run. Baseball has a ways to go if it truly wants parity in the league.
Exactly
This is exactly the point I'm trying to make. I think that baseball has been quite "lucky" the past couple years that extremely low payroll teams have made it to the Series.
Assuming the trend lines hold reasonably true (which may be a bit of a leap of logic as Jon mentioned above), these low payroll teams will have an increasingly hard time to even MAKE the playoffs, let alone go to the Series. This is why I attempted to to focus on the differences and the increasing "gap" between Playoff and Non-Playoff teams and tried to downplay the WS numbers.
What would my life be like without the '91 World Series?
Well, as a graduate student in Economics about to take my econometrics comprehensive...
...I have to comment. I haven't read it too carefully because I am in a hurry, but I will say the following things.
What is your regression? From those graphs, it looks like you are regressing salary on year. This would mean you are assuming that "year" is an exogenous (non-random) variable which is obviously reasonable, but you are also assuming that it EXPLAINS salary. This might be true (salaries go up every year), but it also looks like that data might be deflated. Also, by not including anything else in the regression you assume that ONLY year explains salary. Also, the meaning of R^2 in this context is the percentage of salary that is explained by year. which, again, isn't really that interesting of a question. I think a more interesting regression would be something like wins = a+ b*salary + c*(other performance or city characteristics), but that depends on what it is that you are trying to explain.
Also, I didn't realize base-ball reference had salary data, I'll have to check it out. I have a couple of cool baseball data sets and if you want them I can send them to you. I pulled a couple from baseball reference (like all hitting and pitching data for the last 100 years), I got salary data from USA Today and I have some attendance data from somewhere else (I might even have ticket prices).
http://noblingblings.blogspot.com/
Data
You're right in regards to the regression, it's more like a simple correlation. I used the regression line charts more to show general trends in graphical form. I'm trying to get my hands on SPSS, but for now, I'm forced to use the limited linear regression provided in the excel 2007 chart wizard. If you have any ideas on how to run multiple regression using excel, I'm all ears!
As for the data, it took some work to find (and more work to compile the dataset), but it's all there.
Here's a sample of what I pulled for 2007 - AL: http://www.baseball-reference.com/leagues/AL_2006.shtml
You can find team payroll under the League Miscellaneous Stats section. I'm looking forward to playing with some of the other information to see what pops. I started work on some of the defensive stats, but since there are several variables, I'd rather run multiple regression so that I can isolate the effects. I also have some ideas about looking at All Star players and win percentages. We'll see.
Unfortunately the data is all rolled up to the team level, so you're limited in that regard, but for what I'm ultimately trying to accomplish, it should work fine. If you have any thoughts for future posts, let me know.
What would my life be like without the '91 World Series?
Great post
even though I didn't understand a word of it. :) What can I say, I'm a lawyer and I can barely do simple arithmetic even with a calculator.
Keep it coming, I enjoy learning, but also please remember to put some of this in english for me because my eyes start to glaze over when you guys throw out mathematical equations and superscript numbers.......
by montanatwinsfan on Apr 9, 2008 10:14 PM EDT reply actions

by 






















