You are on page 1of 12

Abstract:

For offenses in baseball, the object of the game is to score runs. Understanding factors
which may help or hurt a team in its pursuit of runs is key to maximizing run output. This
analysis uses a variety of regression methods in order to better understand these factors. The
conclusion is that hits and walks help positively affect run output, whereas stolen bases,
grounding into double plays, and leaving runners on base negatively affect run output. Extra
bases and strikeouts are found to be nonfactors.
Introduction
Baseball is Americas original pastime and first professional team sport. The game
consists of nine innings in which each team bats until it accrues three outs. The away team bats
in the first or top half of the inning, and the home team bats in the second or bottom half of
the inning. Each team attempts to score as many runs as possible before accruing three outs.
The team with the most runs at the end of nine innings is the winner.
Because runs are at the heart of baseball, understanding factors which contribute either
positively or negatively to scoring runs is an important part of the game. Seven factors are
considered in this project: Hits, Stolen Bases (SB), Bases on Balls or Walks (BB), Strikeouts
(SO), Extra Bases (XBase), Grounded into Double Plays (GDP), and Runners Left on Base
(LOB). Hits are batted balls which land safely in the field of play resulting in the runner reaching
at least first base. A runner is credited with a stolen base when they advance to the next base
independent of a hit or an error. When the pitcher throws the ball to the catcher, the ball crosses
home plate either in the strike zone or outside of the strike zone. If the ball is in the strike zone,
it is a strike. If the ball is outside of the strike zone, it is a ball. In each plate appearance, a
maximum of four balls and three strikes can be accrued. If four balls are accrued, the plate
appearance results in a walk or base on balls, and the batter advances to first base. If the three
strikes are accrued, the plate appearance results in a strikeout, and the batter is out. Each hit
results in the runner reaching first base, and each additional base the batter reaches as a result
of their hit is called an extra base. Doubles, triples, and home runs result in one, two, and three
extra bases, respectively. If a batter bats a ball into play, and it results in two outs, it is referred
to as a double play. If that double play is the result of a ball hit on the ground, the batter
grounded into a double play. Runners who are still on base when their team accrues its third
out are said to be runners left on base.
This project endeavors to find out if a relationship exists between runs and the variables
previously stated. If a relationship does exist, the nature of such a relationship will be explored.
In order to explore this potential relationship, data from the 2015 Major League Baseball season
will be analyzed.
Methods
Methods employed in this analysis include: forward selection, backwards elimination,
stepwise selection, adjusted r 2 criterion, Mallows C p criterion, and AIC p criterion.
Forward selection involves starting with one or a few variables in a model and evaluating the
benefit of adding a new variable. Backward elimination involves starting with a model with
several variables and eliminating variables which appear to be insignificant. Stepwise selection
is a combination of the previous two methods. It involves starting with what appears to be the

most significant variable, adding the next most significant variable to the model, then testing as
to whether either variable may be removed. If neither can be removed, the process begins again
with the next most significant variable. This process continues until a variable can be removed.
Adjusted r 2 is a measure of deviation of the data from the regression function which
accounts for the number of variables in the model. The adjusted r 2 criterion involves
considering models containing all combinations of variables. Models with the highest adjusted
r 2 are considered to be best. However, if a model containing more variables has only a

r 2 value than a model with fewer variables, the model with fewer variables
should be selected in the interest of efficiency. Mallows C p criterion considers the total mean
squared error of the regression model, and by this criterion the minimized value for C p
indicates the best model. The AIC p criterion considers the sum squared error of the model
marginally higher

while including a penalty term related to the number of predictors in the model. Like Mallows
C p criterion, the minimized AIC p value indicates the best model.
Results
The preliminary analysis included an examination of plots of the response variable
(Runs) against each of the possible predictor variables. In some of the plots, it appears that
Toronto is an outlier with a value of approximately 900 for runs, for example in this plot of GDP
against Runs:

However, in other plots, it is less clear that Toronto, for example this plot of XBase
against Runs.

While Toronto appears to be an extreme case, its influence is unclear. As a result,


influence data was analyzed to determine if Toronto should be dropped from the analysis. When
considering deleted studentized residuals, hat diagonal, Cooks distance, and DIFFITS, the
answer is clear (See Appendix): Toronto is both an outlier and influential. Consequently, Toronto
will be dropped from the analysis.
In order to avoid multicollinearity, the correlation matrix was analyzed. The largest
correlation coefficient is .59158 which is between LOB and BB. All other correlation coefficients
are below .5, so it is unlikely that multicollinearity will be a problem.

The next step in the analysis was to select the variables which should be present in the
model. First, the adjusted r 2 , Mallows C p , and AIC p criteria were considered. The
adjusted r 2 criterion indicates that the best model is a reduced model which includes Hits,
SB, BB, SO, GDP, and LOB. However, it is only marginally better than a reduced model which
excludes SO. The same is true for AIC p criterion. However, Mallows C p criterion
indicates that the model with 6 variables may be quite a bit better than the reduced model.

Because the measures for the adjusted

r 2 and

AIC p criteria did not entirely line

up with the results for Mallows C p , it seemed reasonable to further explore if the model with
SO should be used or if the reduced model with 5 predictors should be used. In order to decide,
a stepwise selection was employed and verified this assumption.

Once concluding that the model should include the variables Hits, SB, BB, GDP, and
LOB, interaction and quadratic terms were considered. In order to evaluate the interaction and
quadratic terms, t-tests were employed in order to evaluate the benefit of adding each potential
term to the model given the presence of the other five predictors with a significance level of .05.
No interaction or quadratic terms were significant (see Appendix). This is not altogether
unsurprising because of the scale of the predictor variables as compared with the scale of the
modeled Runs variable. With the exception of SB and GDP, the mean value for all predictor
variables is significantly greater than the mean value for runs. This means that a beta value for
the product of any of these interaction terms would necessarily be a very small fraction and
therefore close to 0.

In the end, the final model was:

Runsi=25.1574+1.02822 Hitsi0.38672 SBi+.97303 BBi 1.69875GDP i 0.86286 LOB i

In order to test the model, the model was compared with data from the 2014 MLB
season. When modelling runs using the 2014 data, SSE=8661.13 and SSTO=91726.3.
Essentially, the model fit the new data with r 2 =.9056.

Conclusion
The analysis shows a positive relationship between runs and the predictors hits and
walks. On the other hand, the analysis shows a negative relationship between runs and the
predictors stolen bases, grounded into double plays, and runners left on base. The only
surprising result here is the the negative relationship between runs and stolen bases. There has
been a shift in the past few years with regard to stolen bases. Whereas old school baseball
minds view the stolen base in a positive light, many newer minds in baseball see the stolen
base as too risky to be beneficial in the long run. This analysis corroborated that view. Another
interesting result is the non relationship between runs and extra bases. This indicates that
hitting doubles, triples, and homeruns is not significantly more important than simply reaching
first base, which is interesting and may fly in the face of conventional wisdom.
Future analyses and research should include more variables including some traditional
statistics: batting average, on base percentage, and slugging. In addition to traditional statistics,
some new school statistics should be considered like wins above replacement and park factor. It
may also be advantageous to include data from multiple seasons to build a more accurate
model.

Appendix
Original Data

Tests For Adding Interaction Terms

Tests For Adding Quadratic Terms

Studentized Residuals

2014 Results

You might also like