You are on page 1of 14

## A Hidden Markov Model of Run Scoring in Baseball

### Abstract

The goal of a baseball team is to score more runs (points) than its opponent. Accurately modeling and predicting the run scoring process for a given team can help that team’s manager decide whether he is going to achieve this goal. We design a Hidden Markov Model (HMM) where the state observations are runs scored, and we simulate it to show that it is close to a complete model of the actual run scoring process. We compare its resultant accuracy to mathematical formulas (e.g. linear regression) that estimate a team’s runs scored (dependent variable) based on relevant statistical data about the team’s actual performance (independent variables). Finally, we propose improvements to the HMM based on the incorporation of minor (yet important) factors that impact the run scoring process. We also discuss some theoretical and practical applications of this HMM, including player evaluation and prediction of future performance.

### Introduction

Baseball, like any game, is full of uncertainty. Will the batter hit a homerun or a single (or something else)? How many runs will a certain team score? I will attempt to answer the second question, which depends in part on the first (and more fundamental) question, with a Markov model of the run scoring process. Formally, we can describe this model as a Hidden Markov Model (HMM) where the observation in a given state is the number of runs scored in that state. For those who know little (or nothing) about the game of baseball, I will soon give a descriptive example to illustrate the run scoring process. However, I would first like to present the basic rules and gameplay in baseball: !

(Picture from http://en.wikipedia.org/wiki/Baseball)

Baseball is played on a diamond-shaped surface (pictured above) with 4 “bases”, or plates, on its corners. The base “sequence” is Home-1 st -2 nd -3 rd -Home; when a player called the “batter” completes this sequence (by returning Home), his team scores one “run”, or point. The “batter” (or “hitter”) stands in one of the two “batter’s boxes”, both of which are adjacent to Home plate. The pitcher (a player on the opposing team) throws the baseball towards Home plate with the hopes of preventing the batter from getting a

“hit”, which is either a single (1B), double (2B), triple (3B), or homerun (HR). The batter advances to 1 st base on a 1B, 2 nd base on a 2B, 3 rd base on a 3B, and Home on a HR. A HR automatically scores one run for the batter’s team. In addition, a “walk” (BB) or “hit- by-pitch” (HBP) is similar to a 1B in that it also advances the batter to 1 st base. A player who advances to a certain base except Home is called a “baserunner”, or a runner on 1 st /

• 2 nd /3 rd base. For further details on the gameplay in baseball, let us turn to an illustration of the run scoring process:

A baseball team’s “lineup” (like the following example) contains nine batters, ordered from 1-9. This sequence of batters is repeated as the game moves along. So after batter “x” has completed his “plate appearance”, or opportunity to hit, the next batter is “(x + 1)”, where x is some number between 1-8. After the 9 th batter, we return to the start of the sequence with the 1 st batter.

NY Mets Team Lineup:

• 1. Edgardo Alfonzo

• 2. Derek Bell

• 3. Mike Piazza

• 4. Todd Zeile

• 5. Benny Agbayani

• 6. Robin Ventura

• 7. Jay Payton

• 8. Mike Bordick

The following sequence, called a “game log”, documents the progression of a sample game with the example lineup above. There are usually/about nine “innings” in a game (here I have displayed the first seven), each beginning with no baserunners, no “outs”, and the first batter due to hit in the batter’s box. In the 1 st inning (start of the game), Edgardo Alfonzo, who is 1 st in the lineup order, is the first batter. Each batter makes a “play”, which maps to seven possible outcomes: BB/HBP, 1B, 2B, 3B, HR, Out, DP. When a batter makes an Out, he does not advance to any base. A DP stands for “Double Play”, or two Outs, which “erases” a baserunner in addition to retiring the batter. A DP can only occur when there is at least one baserunner. After explaining parts of this game log, I will give an example of a DP for clarification.

 Todd Zeile made an out. 1 0 1 4 Benny Agbayani hit a single. 1 1 0 5 Robin Ventura hit a triple. 0 0 1 7 Jay Payton made an out. 0 0 1 7

In the 1 st inning, Alfonzo draws a BB and advances to 1 st base, which is represented by “1 0 0” in the Baserunners column. This means that there is a runner on 1 st base, and no runner on 2 nd or 3 rd base after the play (BB). After Derek Bell makes an out, Alfonzo might advance to 2 nd base (with a certain probability), but here an out leads to no change. Piazza hits a 1B and Alfonzo advances to 2 nd base. The next two batters make outs, and the inning is over. An inning ends when the team makes 3 outs, and we move on to the next inning, which begins with 0 outs. The first batter (Ventura) in the next inning is the one right after the batter who made the last out in the previous inning (Agbayani).

The NY Mets lineup scores 4 runs in the first 6 innings. In the 7 th inning, Alfonzo is the first batter again as Hampton (the #9 hitter) makes an out to end the previous inning. Alfonzo hits a 2B and advances to 2 nd base. Bell makes an out. Piazza hits a 1B and Alfonzo advances to 3 rd base. Usually, a runner on 2 nd base will score (advance to Home) on a 1B, but sometimes he is only able to advance to 3 rd base. So now there are runners on 1 st and 3 rd base. Zeile makes an out, and the runners do not advance. There is about a 50/50 chance that a runner on 3 rd base will score on an out. In this case, he does not score. Agbayani hits a single, advances to 1 st base, and Piazza advances to 2 nd base. Alfonzo advances to Home, scoring a run, and is no longer a baserunner now. So the Mets have now scored a total of 5 runs. Ventura hits a triple, advances to 3 rd base, and both baserunners score to give the Mets 7 runs. Payton makes an out, and the inning is over.

The key point to notice is that baserunners will always advance on a hit by the batter and sometimes advance on an out by the batter. In addition, the runners will advance at least the same number of bases as the batter does. For example, if a batter hits a 2B, a runner on 1 st base advances to at least 3 rd base (a runner on 2 nd or 3 rd base scores automatically).

The following is an example of a DP from a different game:

 Inning 3 Baserunners Runs Mike Hampton hit a single. 1 0 0 0 Edgardo Alfonzo hit into a double play. 0 0 0 0 Derek Bell made an out. 0 0 0 0

Alfonzo’s DP erases Hampton from 1 st base. Bell makes the third out to end the inning. If a DP occurs when there is more than one baserunner, the runner on 1 st base is the one who is usually erased. A DP rarely occurs when there is no runner on 1 st base.

Also, note that each play occurs within the context of a certain inning. A DP cannot occur in an inning that has already recorded two outs because one out automatically ends the current inning, and this leads us to the next inning, which begins with no outs.

Now that I have explained the run scoring process, I will describe the probability distribution over the random action “Play”. This probability distribution is specific to a certain player; for example, some players are more likely to hit HRs than others. Recall that plate appearances (PA) are the number of opportunities a batter gets to hit during a season. A season is 162 games; each year there is a new season. As an example:

 Player Team Year PA BB/HBP 1B 2B 3B HR Out DP Edgardo NY 2000 650 100 109 40 2 25 362 12 Alfonzo Mets

Since Alfonzo hit a total of 109 singles in 650 PA during the 2000 season, his probability of hitting a single was simply 109/650 = 16.8 %. So Pr(Play = 1B) = 16.8% for Alfonzo in the 2000 season. The open-source database at www.baseball-databank.org provided me with the necessary data to generate probability distributions for each player. The Markov model’s transition probabilities are based in part on these probability distributions. The game logs above are sample results from simulations of this model.

### Description of the Hidden Markov Model

Each state is composed of three elements: the situation before the play occurred, the play, and the situation after the play occurred. A situation is described as the batter currently in the batter’s box (referenced by his lineup # between 1-9), the baserunners (e.g. 1 0 1), the current inning (1-9), and the current number of outs recorded in the inning (0, 1, or 2).

Mathematical Description of the Before/After Situation in a State:

Before the play: Lineup #, X baserunners, Inning, N outs After the play: (Lineup # + 1), X’ baserunners, Inning’, N’ outs X (or X’) = (n1, n2, n3) where n1 = 1 if runner on 1 st (otherwise 0), n2 = 1 if runner on 2 nd , etc.

In the After the Play specification, “Lineup # + 1” refers to the next batter in the lineup. If Lineup # = 9, however, the next batter is actually “(Lineup # + 1) modulo 9” because the lineup repeats.

A play (as described earlier) maps to one of these outcomes - BB/HBP, 1B, 2B, 3B, HR, Out, or DP

As an example, let Before the Play equal “Alfonzo (#1 in lineup), (0, 1, 0), 8 th inning, 0 outs”, Play equal “1B”, and After the Play equal “Bell (#2 in lineup), (1, 0, 0), 8 th inning, 0 outs”. This state means that Alfonzo was the batter when there was a runner on 2 nd base and 0 outs in the 8 th inning. Alfonzo hit a single, the runner scored, and Bell was the next hitter with Alfonzo on 1 st base and 0 outs in the 8 th inning. This state description directly determines how many runs score on one play (in this case, 1 run scored).

Possible Observations:

In a certain state we will observe 0, 1, 2, 3, or 4 runs scored. The probability = 100% for one of these 5 possibilities because each state implies exactly one possibility. 4 runs is the maximum number that can score on one play – a HR with a runner on every base (except Home).

Examples Observation: 0 runs scored State: Before equals “Bell (#2 in lineup), (0, 1, 0), 3 rd inning, 2 outs”, Play equals “Out”, and After equals “Piazza (#3 in lineup), (0, 0, 0), 4 th inning, 0 outs”.

Bell makes an out to end the 3 rd inning and so the situation after the play represents the beginning of the 4 th inning. No more runs can score in an inning after the third out has been recorded.

---------------------------------------------------------------------------------------------------------------------------------

Observation: 1 run scored State: Before equals “Bordick (#8 in lineup), (1, 1, 0), 2 nd inning, 0 outs”, Play equals “1B”, and After equals “Hampton (#9 in lineup), (1, 1, 0), 2 nd inning, 0 outs”.

Bordick hits a single, the runner on 2 nd base scores, the runner on 1 st base advances to 2 nd base, and Bordick advances to 1 st base. Thus, we still have runners on 1 st and 2 nd base after the play.

---------------------------------------------------------------------------------------------------------------------------------

Observation: 2 runs scored State: Before equals “Ventura (#6 in lineup), (1, 1, 0), 7 th inning, 2 outs”, Play equals “3B”, and After equals “Payton (#7 in lineup), (0, 0, 1), 7 th inning, 2 outs”.

Ventura hits a triple, the two baserunners before the play score, and Ventura advances to 3 rd base.

---------------------------------------------------------------------------------------------------------------------------------

Observation: 3 runs scored State: Before equals “Agbayani (#5 in lineup), (1, 1, 1), 9 th inning, 1 out”, Play equals “2B”, and After

equals “Ventura (#6 in lineup), (0, 1, 0), 9 th inning, 1 out”.

Agbayani hits a double, all three baserunners before the play are able to score, and Agbayani advances to

• 2 nd base.

---------------------------------------------------------------------------------------------------------------------------------

Observation: 4 runs scored State: Before equals “Piazza (#3 in lineup), (1, 1, 1), 5 th inning, 1 out”, Play equals “HR”, and After equals “Zeile (#4 in lineup), (0, 0, 0), 5 th inning, 1 out”.

Piazza hits a homerun, all three baserunners before the play score, and Piazza also scores automatically, thus totaling 4 runs on the play.

Transition Probabilities:

T(S, S’) represents the probability that state S transitions to state S’. T(S, S’) > 0 only if the After situation

(After the Play) in S equals the Before situation (Before the Play) in S’. Formally, T(S, S’) = the probability that the Play value in S’ (e.g. 2B) will occur, times the probability that the After situation in S (or the Before situation in S’) will lead to the After situation in S’ given that the Play value in S’ has occurred.

As an example, let S be: Before the Play equals “Hampton (#9 in lineup), (0, 0, 0), 8 th inning, 0 outs”, Play equals “2B”, and After the Play equals “Alfonzo (#1 in lineup), (0, 1, 0), 8 th inning, 0 outs”. Let S’ be:

Before the Play equals “Alfonzo (#1 in lineup), (0, 1, 0), 8th inning, 0 outs”, Play equals “1B”, and After the Play equals “Bell (#2 in lineup), (1, 0, 0), 8 th inning, 0 outs”. Notice that the After situation in state S equals the Before situation in state S’. In state S, Hampton hits a double and Alfonzo is now the batter. Next, in state S’, Alfonzo hits a single, Hampton scores, and Bell is the next batter.

The transition probability T(S, S’) = Pr(Play = 1B) x Pr(X’ = (1, 0, 0) | X = (0, 1, 0), Play = 1B), i.e. the probability that Alfonzo hits a single times the probability that the runner on 2 nd base (Hampton) advances to Home given that Alfonzo has hit a single. The question remains, how do we calculate this transition probability? This depends on the particular batter involved (in this case, Edgardo Alfonzo). Recall Alfonzo’s 2000 stats, now given below in the form of proportions or probabilities:

 Pr(BB/HBP) Pr(1B) Pr(2B) Pr(3B) Pr(HR) Pr(Out) Pr(DP) .154 .168 .062 .003 .038 .557 .018

The only problem with these probabilities is that a DP (double play) can only occur with less than 2 outs (there is a limit of 3 outs per inning) and at least one runner on base (usually 1 st base). The other plays can occur in any situation, but their probabilities will change (though their relative proportions will stay the same) when Pr(DP) = 0. Based on batting stats (called “splits”) at www.retrosheet.org, I found that about

60% of any batter’s total PA occur when there are 2 outs or no baserunners (or both). What this means, for example, is that about 60% of Alfonzo’s 650 PA (390) in 2000 occurred when there were either 2 outs or no baserunners, i.e. when Pr(DP) = 0.

For any batter (in this case, Alfonzo), the equation becomes Pr(DP) = (.60 x 0) + (.33 x Y) + (.07 x Z) = . 018. ‘Y’ represents the Pr(DP) when there is at least one baserunner, one of whom is on 1 st base, and less than 2 outs (about 33% of the PA). ‘Z’ represents the Pr(DP) when there is at least one baserunner, none of whom are on 1 st base, and less than 2 outs (about 7% of the PA). I have set ‘Z’ equal to .03 because this rarely occurs (for any batter, including Alfonzo). We can then solve for ‘Y’: (.018 - .0021) / .33 = .048. ‘Z’ is a constant value for all batters whereas ‘Y’ varies with the particular batter.

The following table presents my results for Alfonzo:

 Situation Pr(BB/HBP) Pr(1B) Pr(2B) Pr(3B) Pr(HR) Pr(Out) Pr(DP) No baserunners or 2 outs .157 .171 .063 .003 .039 .567 0 Runner on 1 st base and less than 2 outs .149 .163 .060 .003 .037 .540 .048 (Y) Runner on 2 nd or 3 rd base (not on 1 st base) and less than 2 outs .152 .166 .061 .003 .038 .550 .03 (Z)

Returning to the question of how we calculate T(S, S’), we now know that Pr(Play = 1B) = .166 because Alfonzo is batting with a runner on 2 nd base and 0 outs (less than 2 outs). But what about the Pr(X’ = (1, 0, 0) | X = (0, 1, 0), Play = 1B)? The following discussion will explain how to calculate the probability that some After situation in S will lead to some After situation in S’, given that a certain play has occurred.

Given a certain play (e.g. 1B) and an After situation in S (e.g. Alfonzo, X = (0, 1, 0), 8 th inning, 0 outs), we can directly determine the number of outs in the After situation in S’. The non-outs (BB/HBP, 1B, 2B, 3B, HR) keep the number of outs unchanged. A single out results in 1 more out, and a DP results in 2 more outs. Once we reach 3 outs from one of these two plays, we automatically “reset” the number of outs to zero, the X’ component (in the After situation in S’) to (0, 0, 0), and the current inning to the next inning. X’, however, can often equal one of several possibilities, and so we need to use probabilities. When the play is a BB/HBP, 3B, or HR, however, there is only one possibility:

Let X = (n1, n2, n3) in the After situation in S (or the Before situation in S’)

BB/HBP:

If n1 = 1 and n2 = 1 then X’ = (1, 1, 1) and runs scored = n3 If n1 = 1 and n2 = 0 then X’ = (1, 1, n3) and runs scored = 0 Else X’ = (1, n2, n3) and runs scored = 0

 3B: X’ = (0, 0, 1) and runs scored = n1 + n2 + n3 HR:

X’ = (0, 0, 0) and runs scored = 1 + n1 + n2 + n3

Explanation: If a batter’s play is a BB/HBP, then the batter advances to 1 st base. If there is a runner on 1 st base before the play, then this runner moves to 2 nd base. This continues like a chain reaction as long as there are runners who are adjacent in the base sequence. For example, if there are runners on every base before the BB/HBP, then the runner on 3 rd base scores, the runner on 2 nd base advances to 3 rd base, the runner on

• 1 st base advances to 2 nd base, and the batter advances to 1 st base. If a batter’s play is a 3B, he advances to

• 3 rd base and any baserunners existing before the play score. A HR is the same as a 3B except that the batter advances to Home and also scores. Triples usually occur far less often than any other play.

When Play = 1B or 2B, there are often several possibilities for X’, given an After situation in S (X):

 X X' Pr(X' | X, Play = 1B) Runs Pr(X' | X, Play = 2B) Runs (0, 0, 0) (1, 0, 0) 1 0 X (0, 0, 0) X' (0, 1, 0) 1 0 (1, 0, 0) (1, 1, 0) 0.85 0 (1, 0, 0) (0, 1, 1) 0.7 0 (1, 0, 0) (1, 0, 1) 0.15 0 (1, 0, 0) (0, 1, 0) 0.3 1 (0, 1, 0) (1, 0, 0) 0.8 1 (0, 1, 0) (0, 1, 0) 1 1 (0, 1, 0) (1, 0, 1) 0.2 0 (0, 0, 1) (0, 1, 0) 1 1 (0, 0, 1) (1, 0, 0) 1 1 (1, 1, 0) (0, 1, 1) 0.7 1 (1, 1, 0) (1, 1, 0) 0.68 1 (1, 1, 0) (0, 1, 0) 0.3 2 (1, 1, 0) (1, 0, 1) 0.12 1 (0, 1, 1) (0, 1, 0) 1 2 (1, 1, 0) (1, 1, 1) 0.2 0 (1, 0, 1) (0, 1, 1) 0.7 1 (0, 1, 1) (1, 0, 0) 0.8 2 (1, 0, 1) (0, 1, 0) 0.3 2 (0, 1, 1) (1, 0, 1) 0.2 1 (1, 1, 1) (0, 1, 1) 0.7 2 (1, 0, 1) (1, 1, 0) 0.85 1 (1, 1, 1) (0, 1, 0) 0.3 3 (1, 0, 1) (1, 0, 1) 0.15 1 (1, 1, 1) (1, 1, 1) 0.2 1 (1, 1, 1) (1, 1, 0) 0.68 2 (1, 1, 1) (1, 0, 1) 0.12 2 When Play = Out or DP, there are often several possibilities for X’, given an After situation in S (X): (If a DP occurs with 1 out or an Out occurs with 2 outs, then X' = (0, 0, 0) since now there are 3 outs) X X' Pr(X' | X, Play = Out, 0 or 1 outs) Runs X X' Pr(X' | X, Play = DP, 0 outs) Runs (0, 0, 0) (0, 0, 0) 1 0 (1, 0, 0) (0, 0, 0) 1 0 (1, 0, 0) (1, 0, 0) 0.95 0 (0, 1, 0) (0, 0, 0) 1 0 (1, 0, 0) (0, 1, 0) 0.05 0 (0, 0, 1) (0, 0, 0) 1 0 (0, 1, 0) (0, 1, 0) 0.9 0 (1, 1, 0) (0, 0, 1) 0.7 0 (0, 1, 0) (0, 0, 1) 0.1 0 (1, 1, 0) (0, 1, 0) 0.2 0 (0, 0, 1) (0, 0, 0) 0.5 1 (1, 1, 0) (1, 0, 0) 0.1 0 (0, 0, 1) (0, 0, 1) 0.5 0 (0, 1, 1) (0, 0, 1) 0.5 0 (1, 1, 0) (1, 0, 1) 0.6 0 (0, 1, 1) (0, 1, 0) 0.5 0 (1, 1, 0) (0, 1, 1) 0.1 0 (1, 0, 1) (0, 0, 0) 1 1 (1, 1, 0) (1, 1, 0) 0.3 0 (1, 1, 1) (0, 1, 1) 0.45 0 (0, 1, 1) (0, 1, 1) 0.5 0 (1, 1, 1) (0, 0, 1) 0.5 1 (0, 1, 1) (0, 1, 0) 0.5 1 (1, 1, 1) (1, 1, 0) 0.05 0 (1, 0, 1) (1, 0, 0) 0.5 1 (1, 0, 1) (1, 0, 1) 0.5 0 (1, 1, 1) (1, 1, 1) 0.4 0 (1, 1, 1) (1, 1, 0) 0.3 1 (1, 1, 1) (1, 0, 1) 0.3 1

Going back to our example, the above tables tell us that there are only two possibilities for X’ given that X = (0, 1, 0) and Play = 1B; either X’ = (1, 0, 0) or (1, 0, 1), meaning that either Hampton scores or only advances to 3 rd base. The former is more probable, i.e. Pr(X’ = (1, 0, 0) | X, 1B) = 0.8. Thus, T(S, S’) = .166 x .8 = .1328.

Rationale for Modeling the Run Scoring Process with a HMM:

The result of any baseball game (in the news) is typically described as Home Team 5, Visiting Team 3 (for example); this means that the home team won the game because it scored 5 runs, which was more than what the visiting team scored (3 runs). Baseball fans often look at a “scoreboard”, which is a listing of different games and their current results, and see (for example) that the score is 6-2 in the fifth inning of one game. The scoreboard also often tells them how many runs were scored by each team in each inning.

However, if they did not see this particular game (and the game log is unavailable), they can only guess what sequence of plays led to the resulting score in each inning. Similarly, the observable parameter in our HMM is a sequence of runs scored, and one challenge is to determine the most likely sequence of states (including plays) that could account for a certain observed sequence. The states in our HMM are the hidden parameters because different situations and plays can account for the same observed sequence.

Another interesting question that has mostly theoretical value (no practical application) is: Given a certain lineup of hitters, what is the probability of a certain observation sequence of runs scored (during a single game)? We will discuss a similar question that does have practical value; namely, what will be the sum total of runs scored (on average) per game, given a certain lineup? The HMM is designed so that we can predict (or estimate) the answer to this latter question, both in theory and through actual simulations. The following sections discuss this key issue in detail, and reach a conclusion on the HMM’s accuracy.

### Practical Applications: Predicting Before the Fact (Future) and Estimating After the Fact (Past)

You may ask, what is the practical value of modeling the run scoring process? First of all, one must know that the objective of a baseball team is to score more runs than its opponent (each team usually receives 9 innings to try and score runs in a game). Thus, the more runs a team scores (on average) each game, the more games it is likely to win. Accurately predicting runs scored can help a team’s manager decide whether or not he needs to improve his lineup’s ability to score runs.

Specifically, we wish to predict, given a team’s (projected) lineup for each game, how many runs they will score in a certain season. For example, given the NY Mets lineup for each game of the 2001 season, we could generate the “Play” probability distributions for each player in the lineup from each player’s cumulative performance between 1998-2000. Thus, we would be using data from previous seasons to predict each player’s performance in 2001 as well as the team’s runs scored in 2001 (before the fact).

Inaccurately predicting future performance, however, will probably tell us more about the inaccuracy of our prediction algorithm than about the inaccuracy of our model. For example, if our predictions for each player’s “Play” probability distribution for 2001 are very inaccurate, then a simulation of our model will likely give inaccurate predictions (for the team’s runs scored), regardless of whether or not the model is accurate.

Thus, before we can predict future performance, we need to test whether the model is an accurate simulator of the run scoring process. To accomplish this, we want to estimate a team’s run scored after the fact, i.e. we want to simulate the 2000 season, for example,

based on the probability distributions generated from the actual 2000 stats. As a specific example, if we did this for the 2000 NY Mets, we would expect Edgardo Alfonzo (or any other player on the team) to have (on average) the same proportion of singles, outs, etc. that he had in actuality during the 2000 season. Similarly, if the model is accurate, we would expect the runs scored in different simulations to be (on average) equal to the actual runs scored by the 2000 NY Mets.

This idea of estimating after the fact is also the basis for the design of mathematical formulas that accurately estimate runs scored from actual team stats. There are linear and non-linear formulas. The following two formulas are examples of each type, respectively:

XR (Extrapolated Runs) = (.50 x 1B) + (.72 x 2B) + (1.04 x 3B) + (1.44 x HR) + (.34 x BB/HBP) + (-.096 x Outs)

BsR (Base Runs) = [A * B/(B + Outs)] + HR

(A and B represent baserunners and advancement, respectively)

A = 1B + 2B + 3B + BB/HBP B = [2*HR + 3.6*3B + 2.2*2B + .8*1B + .1*BB/HBP] * 1.02

The XR formula takes the total number of 1B, 2B, 3B, etc. hit by a team during a certain season and estimates the team’s runs scored using linear weights for each stat. As an example, the 2000 Mets had 945 singles, 282 doubles, 20 triples, 198 HRs, 720 BB/HBP, and 4330 Outs. XR estimates that the team scored .5*945 + .72*282 +… = 811 runs (they actually scored 807 runs). Note: the XR formula is fitted to data before the 2000 season.

B/(B+Outs) in the BsR formula represents the percentage of baserunners (A) who advance to Home plate (i.e. score). After hitting a HR, a batter automatically scores and does not become a baserunner. Thus, we separately add the total number of HRs.

David Smyth, the creator of BsR, stated (6/6/05): “Runs equals baserunners (A) times the

proportion of baserunners who score (B/B+Outs), plus home runs. This statement is so obviously true that some people have called it an 'identity' instead of an equation or theory.”

This identity is certainly correct, but the problem is that we usually cannot determine the exact proportion of baserunners who scored based on recorded stats like BB, HR, 2B, etc. Unless we use a “game log”, or an ordered record of each play, we cannot tell the value of this proportion. For example, the sequence “1B, DP, 2B, 3B” yields one run but a different sequence like “3B, 2B, 1B, DP” may yield two runs, even though each sequence records the same plays and the same number of baserunners (3). A Markov model becomes useful here because it accounts for the variability in the run scoring process, as it can produce different plays and orderings of plays from simulation to simulation.

BsR estimates (on average) how each of the recorded team stats affects the proportion of baserunners who score. Sometimes (but not always), it estimates this proportion very accurately. As an example, it calculates that about 31% (612) of the 1,967 baserunners for the 2000 Mets scored runs. So it estimates that the team scored a total of 612 + 198 = 810 runs, only 3 runs above the actual total. Similarly, XR (like all linear formulas) estimates

the average run value of each stat. Since a HR automatically scores at least one run, and may score up to 3 more baserunners, its “expected” run value is about 1.44.

If we let R = actual runs scored by a certain team, then a linear regression formula may give us a good estimate of R based on the actual team stats (R is the dependent variable and BB/HBP, 1B, 2B, etc. are the independent variables). If I simulated my model (for a certain team after the fact) about 100 times, the sample mean runs scored (which is an unbiased estimator of the HMM’s expected value) would hopefully estimate R more precisely. R is often difficult for mathematical formulas to accurately estimate because they do not account for the randomness in play orderings as well as the context of innings and games, both of which are captured by the HMM.

### Results & Conclusion

Let the random variable H = the number of runs scored by a certain team after simulating the HMM based on the probability distributions generated from the team’s players’ season stats (either before or after the fact). I believe that E(H) is difficult to formally calculate (mathematically) because of several key factors in real baseball games:

(1) Each game is independent of any other game, i.e. a team could have a different lineup for each game. One lineup could score 2 runs in a game and another lineup could score 10 runs in another game, but these run values do not depend on each other. In other words, what happens in one game (theoretically) has no effect on what happens in any other game. Thus, we have to calculate E(H) for each distinct game, given a certain lineup for that game. (2) All runs score within the context of 3 outs in each inning and innings are dependent on each other as they determine which batters hit in other innings. Because an inning could continue indefinitely as long as the batters do not make 3 outs, an infinite number of runs scored in an inning is theoretically possible. (3) Lineups often change dynamically during each game. (4) Games can extend beyond nine innings to break tie scores between two teams, i.e. each team bats for a minimum of eight innings (and usually nine) but there is no definite maximum.

I will attempt to give a general idea of how one might try to mathematically calculate E(H), but the above factors make it difficult to formally implement this function. The initial state for any random game in the 1 st inning is one of the following:

S1: Before equals “(#1 in lineup), (0, 0, 0), 1 st inning, 0 outs”, Play equals “BB/HBP”, and After equals “(#2 in lineup), (1, 0, 0), 1 st inning, 0 outs”.

S2: Before equals “(#1 in lineup), (0, 0, 0), 1 st inning, 0 outs”, Play equals “1B”, and After equals “(#2 in lineup), (1, 0, 0), 1 st inning, 0 outs”.

S3: Before equals “(#1 in lineup), (0, 0, 0), 1 st inning, 0 outs”, Play equals “2B”, and After equals “(#2 in lineup), (0, 1, 0), 1 st inning, 0 outs”.

S4: Before equals “(#1 in lineup), (0, 0, 0), 1 st inning, 0 outs”, Play equals “3B”, and After equals “(#2 in lineup), (0, 0, 1), 1 st inning, 0 outs”.

S5: Before equals “(#1 in lineup), (0, 0, 0), 1 st inning, 0 outs”, Play equals “HR”, and After equals “(#2 in lineup), (0, 0, 0), 1 st inning, 0 outs”.

S6: Before equals “(#1 in lineup), (0, 0, 0), 1 st inning, 0 outs”, Play equals “Out”, and After equals “(#2 in lineup), (0, 0, 0), 1 st inning, 1 out”.

Before the first play of the game, there are no baserunners (0, 0, 0) and 0 outs. There are 6 possible plays in this situation and 5 possible situations after the play. The initial state probability distribution equals the first batter’s “Play” probability distribution when there are no baserunners. For example, Pr(Initial State = S1) = Pr(Play = BB/HBP). The “reward” or “observation” in each of these possible initial states is 0 runs, except S5. Because a HR with no baserunners scores 1 run, S5 has a reward of 1 run. Let R(S) = reward in state S. Then the following formula calculates E(H):

E(H) = Σ(Pr(Si) x U(Si)) U(S) = R(S) + Σ(T(S, S’) x U(S’))

(over all possible states S’)

Si stands for some initial state (1 <= i <= 6) and U(S) stands for “utility” of S. Once we transition to a state in which there are 3 outs recorded in the 9 th inning after a certain play (either an Out or DP), we can no longer transition to another state (i.e. the game is over). The interested reader is encouraged to try and output results from this formula, but beware: an infinite (or very long) sequence of state transitions is improbable but possible. The average sequence probably contains about 40-45 transitions.

I, however, have decided not to do this (for now) due to time constraints and the previously mentioned factors (in actual simulations of this HMM, I have accounted for and/or incorporated those four factors so that I get accurate results when running the model, much like I would if I implemented these factors in the mathematical calculation of E(H) ). Instead, I will use the sample mean runs scored from my HMM, which is an unbiased estimator of E(H), to test the accuracy of my HMM. I did 100 simulations for each of the 16 teams in the National League (a league is a group of teams), and compared each team’s mean runs scored in simulation to their actual runs scored. For “after the fact” predictions, I hoped that the sample mean would be as good or perhaps better than any mathematical formula (e.g. XR) that estimates runs scored. So I did 100 “after the fact” simulations for each team and for each season between 2000-2004. Thus, I did this for a total of 16 x 5 = 80 teams (the 2001 Mets and 2000 Mets are different teams). I calculated the difference (non-absolute and absolute) between each team’s average runs scored and their actual runs scored. This difference can also be called error. If E(H) gives a good estimate of R, meaning that the HMM is an accurate model of the actual run scoring process, then this error (on average) should be very small. The table below gives the mean non-absolute error and the mean absolute error for the HMM as well as the two formulas – XR and BsR:

 HMM XR BsR Mean Non-Abs Error -5.2 +17.0 +21.6 Mean Abs Error 16.5 21.2 24.5

As an example of how to calculate these errors, the 2000 Mets scored an average of 829 runs in my 100 simulations, and in actuality scored 807 runs. So non-absolute error = 829 – 807 = +22 runs. The absolute error is the absolute value of +22, which is 22. Since we would like the average difference between the estimate and R to be 0, meaning the expected value of the non-absolute error would be 0, the HMM seems to do a job that is comparable to (if not better than) both estimation formulas. But since this is a sample of 80 teams, this data does not necessarily prove my HMM’s superiority to these formulas. In addition, its non-absolute error is still somewhat significant, and so it still needs improvement (not much, though).

To make future improvements, I would need to expand the HMM to account for minor (yet important) factors such as baserunning speed, stolen bases, caught stealing, errors, wild pitches, passed balls, runners thrown out, triple plays, etc. to see whether its performance improves significantly. In order to appreciate how incorporating these factors in the HMM would likely improve its accuracy, the reader must first understand what these factors are and how they influence the gameplay in baseball. The following link is a good beginner’s reference on baseball: http://en.wikipedia.org/wiki/Baseball.

Other interesting questions that my HMM could answer (further research):

If we replace one batter (e.g. an average one) in some lineup with another one (e.g. Mike Piazza), how many more (or less) runs will the team score? (this is important in Player Evaluation)

Does the way in which you order the batters in your lineup make a difference in run production?

### Appendix: Example Simulations of the HMM

The example below shows us the results of two different “after the fact” simulations for the 2000 NY Mets, along with the stats of certain key players (RBI, OBP, and SLG are not necessary for the reader to understand but I calculated them for those who are more familiar with baseball and may be interested):

 TEAM: NY Mets YEAR: 2000 SEASON 1 Runs Scored: 844 Todd Zeile PA: 626 BB/HBP: 73 Hits: 165 2B: 34 3B: 3 HR: 29 RBI: 106 OBP: .359 SLG: .528 Robin Ventura PA: 554 BB/HBP: 88 Hits: 111 2B: 21 3B: 1 HR: 30 RBI: 102 OBP: .338 SLG: .481
 Mike Piazza PA: 548 BB/HBP: 76 Hits: 153 2B: 29 3B: 0 HR: 34 RBI: 109 OBP: .387 SLG: .602 Derek Bell PA: 629 BB/HBP: 60 Hits: 171 2B: 34 3B: 2 HR: 20 RBI: 81 OBP: .350 SLG: .473 Edgardo Alfonzo PA: 655 BB/HBP: 95 Hits: 179 2B: 53 3B: 2 HR: 19 RBI: 76 OBP: .402 SLG: .523 TEAM: NY Mets YEAR: 2000 SEASON 2 Runs Scored: 783 Todd Zeile PA: 626 BB/HBP: 77 Hits: 142 2B: 33 3B: 3 HR: 29 RBI: 110 OBP: .335 SLG: .488 Robin Ventura PA: 555 BB/HBP: 67 Hits: 107 2B: 25 3B: 1 HR: 24 RBI: 77 OBP: .294 SLG: .422 Mike Piazza PA: 549 BB/HBP: 58 Hits: 149 2B: 27 3B: 0 HR: 29 RBI: 91 OBP: .353 SLG: .536 Derek Bell PA: 630 BB/HBP: 77 Hits: 138 2B: 23 3B: 0 HR: 8 RBI: 48 OBP: .313 SLG: .335 Edgardo Alfonzo PA: 652 BB/HBP: 120 Hits: 196 2B: 49 3B: 1 HR: 29 RBI: 85 OBP: .480 SLG: .628 Notice that Alfonzo averages 24 HR and 108 BB/HBP in the two simulations; these are comparable to his actual totals in the 2000 season (25 HR, 100 BB/HBP). If we had simulated the HMM for the 2000 season based on the probability distributions generated from Alfonzo and other players’ cumulative stats from 1997-1999 (“before the fact”), then we would generally get different results. For example, Alfonzo had only hit more than 20 HRs in a season once before 2000, and thus our simulator would likely output a result of less than 20 HRs for him. Here is an example of such a simulation: TEAM: NY Mets YEAR: 2000 SEASON 1 Runs Scored: 750 Edgardo Alfonzo PA: 671 BB/HBP: 68 Hits: 181 2B: 29 3B: 6 HR: 15 RBI: 82 OBP: .358 SLG: .443 Derek Bell PA: 628 BB/HBP: 53 Hits: 171 2B: 35 3B: 3 HR: 14 RBI: 83 OBP: .336 SLG: .442 Todd Zeile PA: 667 BB/HBP: 89 Hits: 164 2B: 24 3B: 4 HR: 33 RBI: 113 OBP: .357 SLG: .510 Mike Piazza PA: 588 BB/HBP: 64 Hits: 161 2B: 23 3B: 0 HR: 31 RBI: 88 OBP: .349 SLG: .529