You are on page 1of 11

1

Dupps

Can Minor League Performance Predict Major League


Success in Baseball?
Jacob Dupps
ECO 461
2
Dupps

I. Abstract
Baseball is America’s oldest pastime, but since the early 2000s, it is also America’s most
numerical sport. With no limit on how much each team can spend, there becomes a large
competitive disadvantage between the wealthy large market teams, and the destitute small
market teams. Aiming to even the playing field, this project attempts to estimate players’
major league performance for hitters based on their statistics in the minor leagues using data
from the 2022 MLB season. Using weighted On Base Average as the measure of success, it
finds that only a small portion of the success in the MLB is explained using past minor
league success.
3
Dupps

II. Introduction
In Major League Baseball (MLB), there is no salary cap, leading to a large variation in the
payroll each team dishes out to its players. Along with players receiving more money than ever
before, this variation leads to a competitive disadvantage in baseball. How can a team that can
only afford to pay their team $50 million compete with teams that are paying their players
upwards of $300 million? There are a few ways, but the primary way is through developing and
acquiring players while they are still advancing through the minor league system. Players are at
their cheapest in the minor leagues, since it is difficult to predict a what the future has in store for
a young player. What if a team was able to predict a player’s Major League success before they
reach the MLB? They would be able to trade for players who are expected to have success in the
MLB while they are at their cheapest, developing a successful team around cheap players. This
would allow teams with less resources to defeat teams with much more resources.
Why does winning matter to anyone besides baseball fans? Money. According to a study
done by Nate Silver of Baseball Prospectus in 2006, an additional MLB win generates $740,000
in marginal revenue for the team, while making the playoffs adds an additional $28.9 million in
revenue on top of that. This comes from ticket sales, jersey sales, increased television exposure,
etc. Acquiring better players for cheap due to predicting their expected success in the minor
leagues would allow a team to achieve more wins, and therefore giving the team more revenue.
If it is possible to predict Major League success based off minor league performance, it would
change the game.
In baseball, there are multiple levels of Minor Leagues. A recently drafted player begins their
career at the Rookie level, then advances up from Single A(A) to Double A(AA) to Triple
A(AAA), then finally to the Major Leagues. This project will look at players’ statistics at AAA
and see if they can predict players’ performance in the MLB. AAA statistics are used in the
project because it is the level directly below the MLB, so theory suggests it is most suited to
predict Major League performance.
Success in the MLB can be defined in many ways. Pitchers define success by not allowing
runs, but just like econometrics, there are multiple factors that impact how many runs they allow
that are out of their control and need to be accounted for. Hitters not only hit the ball, but also
play defense, and run the bases. There are hundreds of different statistics to define success in
baseball. To make the definition of “success” simpler, this research will only look at success for
Major League hitters. The main statistic that defines success for hitters is weighted On Base
Average(wOBA). wOBA is a variation of On-Base Percentage (Number of times a player
reaches base/number of times a player hits) that accounts for how a player reaches base.
Obviously, a home run is a better outcome than a walk. On-Base Percentage does not account for
this, but wOBA does. wOBA is also adjusted for the specific season. Some seasons, hitters
perform better than pitchers than in other seasons (for a variety of reasons), and some seasons,
pitchers perform better. wOBA accounts for this variation and adjusts its equation accordingly by
using factors for seasons. The wOBA equation is (Walks+Hit by Pitch+ Singles factor x Singles
+ Doubles factor x Doubles + Triples Factor x Triples + Home Run Factor x Home Runs)/ (At
Bats + Walks + Sacrifice Flies + Hit by Pitch).
The 2022 wOBA for hitters will be the dependent variable in the regression. To qualify for
the data set, hitters must have taken 100 MLB plate appearances in 2022, and 100 plate
appearances at some point in their career in AAA so that the distribution of player’s wOBA will
be approximately normal. This eliminates the issue of sample size within the data. If a player had
five MLB plate appearances and got four hits, their MLBwOBA would be incredibly high, which
4
Dupps

would cause bias in the results. The most recent 100 AAA plate appearances were used in the
regression (if a player had 100 plate appearances in AAA in 2019, then 100 more AAA plate
appearances in 2021, the 2021 statistics were used. Overall, 400 players qualified and therefore
were used in the study. The average MLB wOBA in 2022 was .30129. For improved readability,
MLBwOBA was multiplied by 1000, as is the standard in baseball, making the average 301.29
points.
There were five independent variables used in the regression. The first independent variable
was On Base Percentage (OBP). As mentioned before, OBP accounts for a player’s ability to get
on base. Baseball theory suggests OBP is one of the most important statistics for players to focus
on. Since the goal of hitting is to get on base, OBP calculates how successful players are at
accomplishing that goal. OBP is one of the top statistics used to evaluate players, so its inclusion
in this regression is justified. The average OBP in the study was .36778. For the sake of
readability, this number was multiplied by 1000, making the average 367.78 OBP points.
Baseball theory suggests that a player who gets on base at AAA is likely to have success in the
MLB, so this variable is expected to be positive.
The next independent variable is Walk percentage (BBPC). Walk percentage is the number
of times a player walks divided by their plate appearances. It measures a player’s ability to get on
base without having to hit the ball. It is taken as a percentage because if player A walks 20 times
in 100 plate appearances (20%), he is more inclined to walk than player B who walks 40 times in
500 plate appearances (8%). Using walks as an independent variable, the regression would view
player B as the player who walks more frequently, which is clearly not true. So, walk percentage
is used instead of total walks. The average walk percentage in the study was 9.91%. Baseball
theory suggests that a player who consistently walks in the minor leagues will also consistently
walk in the MLB, therefore this coefficient is expected to be positive. Increased attention to walk
percentage in evaluating baseball statistics was the centerpiece of the theory shown in the movie
“Moneyball” staring Brad Pitt, which featured the resurgence of the underfunded Oakland
Athletics in the early 2000s due to increased reliance on mathematical formulas in their decision
making.
The third independent variable is strikeout percentage (KPC). Strikeout percentage is the
number of times a player strikes out divided by their number of plate appearances. The average
strikeout percentage in the data was 19.72%. It is taken as a percentage for the same reason as
walk percentage. Baseball theory suggests that a player who strikes out at a high rate in AAA
will struggle in the MLB as the level of pitchers is higher in the MLB than in AAA, so the
expected coefficient for this variable is negative.
The fourth independent variable is age, which is measured as the age a player was during
their last AAA season. Age of a AAA player is very important. Baseball theory suggests that the
older a player is in AAA, the less success they will have in the MLB. Many of the better players
reach AAA at a younger age due to performing well at the lower levels of the minor leagues,
therefore the coefficient for this variable is expected to be negative. The average age of a AAA
player in the study was 24.54 years old. Once again, the AAA statistics used were from the
players’ most recent AAA season, so if a player played in AAA at age 23 in 2019 and age 25 in
2021, their age was recorded at 25.
The last independent variable is home run percentage. This is measured as the percentage of
times at bat a player hits a home run during their most recent AAA season. It is calculated by
dividing home runs by plate appearances. Like walk percentage and strikeout percentage, a
player who hit 10 home runs in 400 plate appearances (2.5%) is not a superior home run hitter
5
Dupps

when compared to a player with 5 home runs in 100 plate appearances (5%), so the regression
should be adjusted for that. Baseball theory suggests that a player who hits a lot of home runs in
AAA would find success in the major leagues due to the player having great power and the
increased emphasis put on home run hitting in the MLB. Because of this theory, I expect the
home run percentage coefficient to be positive. The average home run percentage was 3.71%.
There were many possible independent variables to be used in this regression, but theory
strongly suggested the used of these five independent variables.

III. Review of Previous Literature


“Using MiLB Stats to Predict MLB wOBA” was published by RStudio on RPubs and written
by Alex DaSilva, who was finishing up his PHD at Dartmouth College at the time of this article.
In the article, DaSilva attempts to predict Major League success using players’ AAA stats. He
measures success by using weighted On Base Average (wOBA). The dependent variable DaSilva
uses is the difference between a player’s AAA wOBA and their MLB wOBA (AAAwOBA-
MLBwOBA). He expects a positive dependent variable because players are more successful at
the less competitive AAA than the more competitive MLB.
To make the regression more accurate, DaSilva also limits the sample to hitters who have
100 plate appearances in both AAA and the MLB from the 2015-2021 seasons. After making
these boundaries, there are 574 players that qualify for his data set.
Early in the article, DaSilva identifies 17 potential independent variables, then measures the
correlations between them. He finds quite a bit of correlation between some of the variables, so
he narrows it down to four independent variables for the regression. The four variables are On-
Base Percentage (OBP), Isolated Power (ISO), weighted Runs Created+(wRC+), and Home
Runs/Fly Balls (HR/FB). OBP, as I mentioned earlier, is how many times a player reaches
base/total plate appearances. OBP is included in my final regression because it is strongly
supported by baseball theory to impact how a player performs at the MLB level. HR/FB is very
similar to the home run percentage used in my regression. It is calculated by taking the amount
of Home Runs a player hit by the number of times a player was called out via a fly ball. I chose
to use home run percentage because it is easier to compute and understand while measuring
virtually the same thing. I did not use ISO because it caused some of the coefficients in my
regression to have the opposite signs from expected. It also correlated too strongly with home
run percentage, since home runs are a large part of ISO. I also didn’t use wRC+ because it is a
relatively new statistic, so it was not always calculated for some of the older players in the
regression.
DaSilva gathered the data and ran the regression. The equation reads.

AAAwOBA-MLBwOBA= 0.0189-(0.1609*OBP) (0.0873*ISO)+(0.0569*wRC+)


+(0.0262*HR/FB)

DaSilva expected the wOBA difference to be a positive number, since it is subtracted from
players’ AAA wOBA to determine their predicted MLB wOBA. One would expect a player to
play worse in the MLB than AAA due to an increase in competition. The coefficient for OBP
and ISO are negative, which is surprising at first. But this is because OBP and ISO cannot
decrease without a change in wRC+ and HR/FB, another reason why I used different
independent variables. The OBP variable means as OBP goes up by .001(a realistic increase), the
6
Dupps

difference between MLB and AAA wOBA would decrease by .0001609, keeping the other
independent variables constant. The HR/FB variable means as HR/FB goes up by .001, the
difference between MLB and AAA wOBA would increase by .0262, keeping the other
independent variables constant.
This regression is relevant because it uses wOBA as its dependent variable, something I
modeled my regression after. Rather than predict the difference between AAA wOBA and MLB
wOBA, I will attempt to predict a player’s MLB wOBA using their AAA statistics.
Another related article was written by Chris Mitchell, a writer for FanGraphs, a large sports
media company focused on producing analytical articles for sports fans. Mitchell was an expert
in R coding, attended Harvard, and currently works for the MLB’s Minnesota Twins. In the
article, Mitchell attempts to predict the likelihood a AAA player reaches the MLB (defined as
playing in at least one MLB game), not necessarily have success. Mitchell wrote this article in
2014, so he used data from the 1995-2011 AAA seasons. Players must have at least 400 minor
league plate appearances (at any level) to qualify for the sample.
The dependent variable is the percent chance a AAA player makes it to the MLB. Mitchell
used ISO, Strikeout Rate, Walk Rate, BABIP, Age, and whether a player is ranked as a top 100
prospect by Baseball America. From this equation, I used Strikeout rate, walk rate, and age,
which were all explained earlier (ISO was also explained earlier). BABIP is the percentage of
balls put into play that result in a hit. BABIP is (Hits-Home Runs)/(At Bats-Strikeouts-Home
Runs). I did not include BABIP because it was too strongly correlated with On Base Percentage,
since they both measure similar things. The last independent variable is whether a player is
ranked as a top 100 prospect by Baseball America, a baseball magazine. This independent
variable was added because it accounts for the human element of how players are evaluated,
since it is the only one that a player cannot control. I did not include Baseball America top 100
prospect ranking because it changes multiple times per year, making it difficult to track. Also,
since players cannot control it, and I am more focused on the variables players can control.
The regression equation is.

% chance a AAA player makes the MLB=8.68-(3.64*Strikeout Rate)+(2.78*Walk Rate)


+(7.78*BABIP)-(8.1*ISO)-(0.62*Age)+(0.96*BA Top 100 Prospect)+(0.008*I[Age^2])
+(0.5*ISO:Age)

During his research, Mitchell realized ISO and AGE were somewhat correlated, meaning
that it is more important for old players to hit for power than young players. To account for that,
he added the ISO:Age independent variable. I did not include ISO, so this was not relevant to my
regression. The strikeout rate variable means as strikeout rate increases by .01, the percent
change of a player making it to the MLB decreases by .0364, keeping all other independent
variables constant. The walk rate variable means as walk rate increases by .01, the percent
change of a player making it to the MLB increases by .0278, keeping all other independent
variables constant. The age variable means as a player’s age increases by .01, the percent change
of a player making it to the MLB decreases by .62, keeping all other independent variables
constant.
This study is useful because teams are always trying to predict if a player will make the
MLB. If you get a player that never makes it to the MLB, then they have no value to your team.
I’m using wOBA as my dependent variable because whether a player makes the MLB is a
dummy variable, and we shouldn’t be predicting dummy variables for our project. The
7
Dupps

independent variables, however, are useful because the players with the higher chances of
making the MLB are usually the players who have success in the MLB once they reach it, so this
regression and my project are estimating similar things.

IV. Regression Results


Although four of the coefficients are significant, there are a few problems with the
regression. One of the weaknesses of this regression is that it does not account for defense or
speed, which are both large aspects of a player’s performance. Also, some great players don’t
qualify for the regression because they may not spend 100 plate appearances at AAA. If teams
think the player is good enough to help them in the MLB immediately, they won’t keep them at
AAA for very long. A few players skipped AAA all together. These players aren’t accounted for
in this regression. Again, for sake of reading clarity, the MLBwOBA, and OBP variables were
multiplied by 1000, and the BBPC, HRPC and KPC variables were multiplied by 100. For the
regression, all the independent variables were kept as linear because the scatterplot and line of
best fit for each variable suggested a linear relationship between the variable and MLBwOBA.

Dependent Variable=MLBwOBA
Variable Coefficient T-Statistic Probability
Constant 351.6*** 11.7 0.0000
On Base 0.02 0.4 0.7
Percentage
Strikeout -0.86** -2.2 0.0254
Percentage
Walk 1.28** 1.89 0.0599
Percentage
Age -2.92*** -3.55 0.0004
Home Run 4.61*** 4.19 0.0000
Percentage

R-Squared .1018
Adjusted R-Squared 0.0904
F-statistic 8.93
Probability(F-Statistic) 0.0000***
**=significant at 5% rejection level ***=significant at 1% rejection level

Equation:
MLBwOBA=351.6+(On Base Percentage*0.02)-(Strikeout Percentage*0.86)+(Walk
percentage*1.28)-(Age*2.92)+(Home Run Percentage*4.61)

Four of the independent variables were significant at the 5% level of significance. For On
Base Percentage (OBP), the null hypothesis is that the OBP coefficient is less than or equal to
8
Dupps

zero, with the alternative being it is greater than zero. The coefficient of OBP was .023, with a p-
value of .35(.70/2). Since all these coefficients have expected signs, they are one-tailed tests,
meaning the p-value is divided by 2. Since our p-value is above .05, we fail to reject the null
hypothesis at the 5% significance level. This means that, with 95% confidence, there is not
enough evidence to suggest an increase in AAA OPS would increase MLBwOBA, holding all
other independent variables constant.
The next independent variable is strikeout percentage (KPC). The null hypothesis is that the
strikeout percentage coefficient is greater than or equal to zero, while the alternative hypothesis
is that it is less than zero. The strikeout percentage coefficient is -.86 and a p-value of .03
(.599/2). This p-value is below our critical p-value of .05, so we reject the null hypothesis. This
means that an increase in AAA strikeout percentage would decrease MLBwOBA with 95%
confidence, holding all other independent variables constant. The real meaning of the coefficient
is that as a player’s AAA strikeout percentage increases by one percentage point, MLBwOBA
decreases by .86 points, holding all other independent variables constant.
The third independent variable is walk percentage (BBPC). The null hypothesis is that the
walk percentage coefficient is less than or equal to zero, with the alternative being it is greater
than zero. The coefficient for walk percentage is 1.28, with a p-value of .0127(.0254/2), so we
reject the null. This means that, with 95% confidence, a player’s an increase in a player’s walk
percentage increased their MLBwOBA. The real meaning of the coefficient is as walk
percentage increases by 1 percentage point, MLBwOBA increases by 1.28 points, holding all
other independent variables constant.
The next independent variable is the player’s age during their last season in AAA. Since the
coefficient was expected to be negative, the null hypothesis is that the age coefficient is greater
than or equal to zero, with the alternative being it is less than zero. The coefficient for this
variable is -2.93, with a p-value of .0002 (.0004/2), so I reject the null hypothesis. This means
that, with 95% confidence, I conclude as the age of a player in AAA increases, MLBwOBA
decreases. Since the p-value is less than .01, we can also reject the null hypothesis with 99%
confidence. The real meaning of the coefficient is as a player’s age in AAA increases by one,
their expected MLBwOBA decreases by 2.86 points, holding all other independent variables
constant.
The final independent variable is home run percentage in AAA. Since we expect this
coefficient to be positive, the null hypothesis is that the home run percentage coefficient is less
than or equal to zero, with the alternative being it is greater than zero. The coefficient for home
run percentage is 4.61, with a p-value of 0.000, so I reject the null hypothesis. This means that,
with 95% confidence, I conclude that as a player’s AAA home run percentage rises, their
MLBwOBA also rises. Since this p-value is also less than .01, we can reject the null hypothesis
with 99% confidence as well. The real meaning of the coefficient is as AAA home run
percentage increases by one percentage point, MLBwOBA increases by 4.61 points, holding all
other independent variables constant.
The R-squared for the regression is .1018, which is very low. This means that 10.18% of the
variation in MLBwOBA is determined by a player’s On Base Percentage, Walk Percentage,
Strikeout Percentage, Age and Home Run Percentage. The R-squared adjusted is 0.0904, which
means that 9.04% of the variation in MLBwOBA is determined by variations in a player’s OBP,
BBPC, KPC, Age, and HRPC when adjusted for the trade-off of adding in an extra independent
variable. The R-squared and R-squared adjusted are so low because we are dealing with human
beings and their performance. Every human is unique and performs differently depending on
9
Dupps

many factors, so it is impossible to quantify the all the factors that impact human performance,
which leaves the R-squared and R-squared adjusted to be very low.
The F-statistic for the regression is 8.93 with a p-value of 0. This means that, with 99%
confidence, the independent variables are jointly significant to our dependent variable. In other
words, On Base Percentage, Strikeout Percentage, Walk Percentage, Age, and Home Run
Percentage combined have a statistically significant impact on MLBwOBA.
Another part of this project is determining the most important of the independent variables in
explaining the fluctuations in MLBwOBA. To do that, I analyzed the standardized coefficients to
see which one is the highest. The standardized coefficient for home run percentage is the highest
(0.23), which means that it is the most important independent variable. The real meaning of the
0.23 is that one standard deviation change in home run percentage changes MLBwOBA by 0.23
standard deviations. It is surprising that home run percentage is the most important variable
because theory suggests that either age or on base percentage would be the most important.
Hitting a home run is only one part of players’ game, but clearly it is very important. After
seeing these numbers, one might suggest a AAA player to focus on hitting more home runs,
since it is the most important factor (out of the variables in the model) in determining major
league success.

Dependent variable: MLBwOBA


Variable Standardized Coefficient
On Base Percentage 0.02
Strikeout Percentage -0.13
Walk Percentage 0.12
Age -0.17
Home Run Percentage 0.23

V. Conclusion
The model does not predict a large percentage of MLBwOBA. It was proven that an increase
in AAA home run percentage and walk percentage do increase MLBwOBA; while a player being
older in AAA and an increase in strikeout percentage decrease their MLBwOBA for 2022. The
independent variables determine only a small percentage of the variation in MLBwOBA. None
of this is surprising. All the four statistically significant variables were significant in their
expected direction. The R-squared is so low because it is very difficult to predict future baseball
performance. Teams pay whole analytics departments to come up with formulas to predict
player’s future performance, but none of that is foolproof. At the end of the day, the future is still
unpredictable.
What was surprising, however, was that On Base Percentage was not statistically significant.
OBP is considered one of the most important statistics in determining a player’s future, but this
study would suggest that it is not very significant. Its inclusion in the regression was necessary
based on theory, yet it did not significantly impact MLBwOBA. I was also surprised by the
magnitude of the HRPC coefficient. I was not expecting it to have a larger impact than both
strikeout percentage and the walk percentage.
10
Dupps

Overall, the regression results are a mixed bag. Four of the five independent variables are
significant in the expected direction, which indicates a valuable regression. Also, the F-statistic is
very significant, which also would indicate a valid regression (many times when the variables are
significant, the F-statistic is also significant). The R-squared and R-squared adjusted, however,
are very low, which indicates the regression does not account for many variables that cause
fluctuation in a player’s Major League performance. This regression is useful because it
identifies a few major causes to MLB performance, but it leaves out quite a few important
variables. It is possible to expand upon this equation (by using more independent variables) in
hopes of accounting for more of the variation in MLBwOBA. The regression does not predict
Major League performance by using minor league statistics.

VI. Bibliography
Silver, Nate. “Is Alex Rodriguez Overpaid?” Baseball Between the Numbers, Baseball
Prospectus, New York City, NY, 2006.

DaSilva, Alex. “Using MiLB Stats to Predict MLB WOBA.” Rpubs, 9 Nov. 2021,
https://rpubs.com/alexdasilva/PredictMLBwOBA.

Mitchell, Chris. “Using Triple-A Stats to Predict Future Performance.” Fangraphs Community,
Fangraphs, 27 July 2014, https://community.fangraphs.com/using-triple-a-stats-to-predict-future-
performance/.

VII. Data
Here is some summary data for both the dependent variable and the independent variables.

MLBwOBA OBP KPC BBPC AGE HRPC


Mean 301.27 368.78 19.72 9.91 24.54 3.71
Median 303 366 19.56 9.55 24 3.56
Maximum 472 525 37.69 21.05 34 10
Minimum 183 253 3.62 2.31 19 0
Standard Deviation 40.22 42.80 6 3.64 2.35 1.98
11
Dupps

You might also like