clock menu more-arrow no yes mobile

Filed under:

Statistical Projections and you, Part 1

Projections, the systems that create them, and tips for using them

Minnesota Twins v Tampa Bay Rays
Spring Training also brings projection season!
Photo by Joe Robbins/Getty Images

It’s projections season! Every year in the late off-season through the beginning of Spring Training, the internet is flooded with predictions and data backed projections about what might happen in the upcoming season. Today I want to offer a quick primer on what projections are, investigate a few of the most popular projection systems and how they work, and offer a few tips and rules of thumb for using and consuming projections. In part two, we’ll dig into the interesting things the projection systems have to say about the 2020 Minnesota Twins.

Projections 101:

For readers interested in deeper, more detailed information on this and related baseball analytics topics I highly recommend perusing Fangraphs’ indispensable library.

Let’s begin with some simple questions. What are projections and why should we care? First and foremost, we need to understand that most baseball statistics, both traditional (like batting average and counting stats such as RBI) and advanced, describe the past – what has already happened. The challenge is that what has happened in the past may or may not be representative or a good predictor of what might happen in the future. If we want to try to understand or predict what might happen in the future, we need a projection.

And projections come in many shapes and sizes. For example, any good self-respecting baseball fan has probably made their share of subjective projections about their team and players, using their opinions and own individual knowledge. This has likely been true since the beginning of the game. These kinds of “projections” are common conversation topics among fans debating whether their team should trade for player X, or which player you want at the plate with the bases loaded. I might want to make a projection that Max Kepler will hit a home run in his next plate appearance versus Reds’ starter Trevor Bauer, because MAX KEPLER OWNS TREVOR BAUER.

While the historical data used to make this projection might seem to support it on the surface, we shouldn’t be so certain in its ability to accurately predict what might happen in the future. It might be predictive, or it might be small sample size randomness.

Similarly, baseball fans often consume numerous projections about minor league prospects. It is very common to hear or read that Prospect X is projected to become a #3 starter or a league average center fielder. These kinds of projections are particularly challenging because there is limited performance data about prospects and the performance data we do have – say, the results of a season in High Class A Fort Myers – is not very representative of playing a season in the Major Leagues. It’s very hard to support a prospect projection with useful data. Because it’s difficult we often turn to relying heavily on expert opinion to develop hard to quantify scouting reports that assess a player’s current and projected tools. For example, many scouts assess that Twins’ prospect Trevor Larnach has significantly above average power. These opinions and assessments are then used to make a projection about how a player with these particular tools will fare in the Major Leagues in the future. This is another form of a projection.

The point is, a projection is simply an attempt to understand and predict what might happen in the future and most projections are a subjective mix of incomplete historical data, small sample sizes, expert opinion, and domain knowledge. These kinds of subjective projections have their place and value in the context of baseball. But they may or may not be accurate or useful to accurately predict the future.

The Sabermetric Revolution has given us better data collection, access to data, new analysis, and better tools to attempt to understand what has happened, why it happened, and whether it was random, only part of the story, or a true representation of a player’s abilities. One way to look at it is that analytics is about sorting the data points into those that are useful from those that are not or those that are incomplete. While there are still many areas to improve and learn more about when it comes to analytics, we have clearly made important strides towards more objectively and completely understanding players and performance. Batting average versus on-base percentage is the classic example. Batting average is an incomplete measurement of batting performance because it does not account for a batter reaching base via a walk (a positive outcome) or that a double is more valuable than a single. Just because we can measure something does not mean it is useful for answering the questions we might have. And we have lots of questions about the game, especially when it comes to predicting the future. So, if we are better able to sort out which measurements and data points about baseball are useful for answering the questions we have (like predicting the future), we can make more objective and complete projections about what might happen for teams and players in the future. And, objective and more complete projections have been proven many times to be more accurate than subjective ones.

As fans we should want to be able to more accurately predict the future. We want to know if our team is getting better or worse with the transactions and player development it is doing. We want to know if the Kenta Maeda trade is worth losing a prospect like Brusdar Graterol and if signing Josh Donaldson to a big contract is likely to payoff. We want to know if our team’s opponents are getting better or worse relative to our team. Is the White Sox 2020 spending spree going to make them good enough to contend? Can they threaten the Twins chances in the AL Central? As fans, we want to know if it’s worth buying those tickets, or that team merchandise, or if it’s worth spending the hours to watch our team play. Projections give us tools to use to try to answer these kinds of questions and more accurate and complete projections help us make more informed decisions, which in turn impacts how we experience being a fan of our team.

Minnesota Twins Introduce Josh Donaldson
Fans want to know, will Josh Donaldson’s big contract be worth it?
Photo by Hannah Foslien/Getty Images

Fortunately for us, there are a number of different systems designed to more objectively and completely project the future. They apply methods and variables in different ways, but the basic premise is the same – given all that we know about past performance of a player or team, what we know about which statistics are valuable for predicting and telling more of the story, and the assumptions we hold about the future – what can we expect from the player or team in the future?

Overview of Popular Projections Systems

Note: Below I do not go into significant detail on the methodologies of these projection systems. There are some nice detailed overviews available online, including here and linked within each section below.

Let’s take a look at some of the most popular projections systems. There are many others and the nearly ubiquitous availability of data and computing power have made it so almost anyone can develop data driven projections. I’m highlighting four of the most well known and popularly utilized systems available today.

Marcel

It is appropriate that we begin with Marcel, even though its projections are no longer widely published. Marcel is perhaps the first and the simplest major projection system. It can be considered a Founding Father of sorts and its simplicity makes it the baseline to which all other projections systems can be compared. Developed by Tom Tango in 2004 it simply uses the past 3 years of MLB data (read: no minor league or college data), with the most recent data weighted more heavily. It regresses towards the mean – which is a statistical concept that accounts for the fact a sample data point may not be in line with the underlying average and a future data point should be expected to be closer to the underlying average (For a more detailed explanation of regression to the mean, click here). And Marcel has an age factor. According to Tango, Marcel is short for Marcel the Monkey Forecasting System which is alluding to its simplicity and implying it’s so simple a monkey could do it. Despite its simplicity, Marcel projections have proven to be on par with more complex systems over the years.

PECOTA

PECOTA, which stands for Player Empirical Comparison and Optimization Test Algorithm, is Baseball Prospectus’ proprietary system that projects player and team performance. It was created by Nate Silver and debuted in the 2003 BP Annual. PECOTA is a system that takes a player’s past performance and tries to project the most likely outcome for the following season. It looks at all the numbers, and all the numbers that make up the numbers, to see which players are more likely to repeat their success and which ones benefited from good fortune.

PECOTA, like Marcel, begins by calculating a baseline for each player using their past performances, with more recent years weighted more heavily. PECOTA then uses that baseline and other factors like the player’s body type, position, and age, to identify comparison players. The careers of those comparison players lead to the forecasts. Because every player has numerous comparable players to choose from, PECOTA can calculate not just the mean performance level of those comps (or the performance level where 50% of comps were worse), but the performance level where 90% of comps were worse, or 10%, or 40%. PECOTA therefore offers a range of possibilities, and its best guess at the downside and upside cases for a given player.

It is worth noting PECOTA’s projections are not freely available for download and are only available to Baseball Prospectus subscribers.

Steamer

Steamer is a projection system developed by Jared Cross, a high school science teacher and two of his former students, Dash Davidson and Peter Rosenbloom. It is unique in that it was not developed by recognized sabermetricians at the time, but instead is the result of a high school project. It first put forward projections for the 2009 season.

Steamer, while recognized popularly as being a more complex forecasting system, uses a simpler methodology that is more like Marcel than PECOTA. It uses a weighted average of past performance regressed toward league average, though how much each year is weighted and to what degree each year is regressed varies between statistics and is set using regression analysis of past players. Despite its relative simplicity, Steamer forecasts have proven to be consistently among the most accurate forecasts and they are also updated daily and available for download at Fangraphs.

ZiPS

The last system that we’ll spend time on today is ZiPS, short for the Szymborski Projection System and created by current Fangraphs writer Dan Szymborski and officially released in 2004. ZiPS uses methodologies similar to that of PECOTA, developing a baseline for each player and historical comparison players to determine probable forecasts for each player. The ZiPS database includes every player since the end of the deadball era.

The ZiPS projections are updated daily and available for download at Fangraphs. One of the newer features for projection nerds is that ZiPS now offers 3 years of projections for public consumption at Fangraphs. This update will prove to be very valuable for trade and transaction analysis, relative to single year projections. Also useful is that ZiPS projections are used to develop team standing and playoff odds projections.

Tips and Cautions

Lastly, I think it’s worth laying out some tips and rules of thumb for using and consuming projections. Keeping these in mind will help you better interpret and understand what a projection is telling you and will help you use it properly.

o There is an old saying in statistics that “All models are wrong, some are useful”. Said more simply, there is no such thing as a “right” projection or a “wrong” projection. What we should focus on is whether a projection is useful and accurate relative to the question we are trying to answer.

o The methods used to develop objective projections cover the full spectrum from simple to highly complex. We frequently hear broadcasters and fans make comments like “Player X is on pace for Y# of homeruns.” This is a very simple projection that uses objective data, but it is an incomplete extrapolation of a player’s production to date to a 162 game season. These are often entertaining but are not usually very accurate due to small sample sizes, randomness, and systematic error.

o More data generally leads to more useful projections. On the more complex side of projection systems, a system might include multiple seasons of historical data; multiple algorithms to account for variables like aging, injury risk, luck, and player comparisons; and approaches to develop a range of possible outcomes and a probabilistic estimate of which are most likely to occur. These tend to be more complete and accurate projections. Projections based on small sample sizes, focused on single or non-predictive statistics, or using only recent data points are less complete and contain more error.

o Projections generally have difficulty accurately accounting for unforeseen changes or very low probability events, such as injuries, a coach driven changes to a player’s role, or learning a new pitch / making a dramatic swing change. Most complex projection systems attempt to include assumptions about these kinds of variables, but they are very difficult to project accurately and can have dramatic impacts on results.

o It is important to understand the bottom line projections published or publicly available for download most often represent mid-points (meaning most likely) of a range of possible outcomes. For example, the projections available on Fangraphs are medians – implying approximately 50% of the actual outcomes will be worse than projected and approximately 50% of the actual outcomes will be better than projected. To illustrate an example, the 2020 ZiPS projections are only projecting three batters with a mid-point projected batting average over .300 (including Twins second basemen Luis Arraez!), however the system expects there to be 24 hitters that actually will hit above .300 in 2020. But for simplicity in communication, it’s easier to present a single point estimate.

o Because of the above point, it is important to remember very few projections will be exactly correct. If we project all the players in Major League Baseball, some players will perform better than our projection and some will perform worse.

In part 2, we’ll dive into the ZiPS and Steamer projections for 2020 and explore what they have to say about the 2020 Minnesota Twins, including projected standings, playoff odds, highlights on individual players, including the Twins off-season additions, and how some of the top prospects might fare.

A note on sources: The following links and articles were referenced and drawn from frequently throughout this post.

http://m.mlb.com/glossary/projection-systems

https://library.fangraphs.com/getting-started/

https://library.fangraphs.com/principles/projections/

https://www.beyondtheboxscore.com/2016/2/22/11079186/projections-marcel-pecota-zips-steamer-explained-guide-math-is-fun