You are on page 1of 7

Brendon Good

Math 241
4/17/15
Baseball Data Project
This statistical project is based around baseball and specific stats and variables .The
purpose behind the baseball statistical project is to determine if there is any correlation between
the salary a professional baseball player earns and several different variables. RBIs, walks, hits,
runs, and at bats were all variables used in the determination if there was any correlation. The
population of the data is the entirety of the MLB, and the sample set used for this data is from
244 MLB players on 11 different teams.
The first area of my project that was determined was the descriptive statistics of the
baseball data set. These descriptive statistics are the mean, median, mode, range, standard
deviation, and variance for the salary earned by the sample population. The following chart
shows these statistics.
Variable
Salary

Mean

StDev

Variance Median

Range

Mode Mode

3577408 4305248 1.85352E+13 1500000 24500000 500000

21

The mean salary of this data set is $3,577,408. The mean of this data is the most useful measure
of central tendency. While the other information is interesting, showing such information as
where the middle range of the salary is located (1,500,000) and the most common salary
(500,000 earned by 21 players), the mean salary is the most useful to us. It shows the average
salary earned by the sample set and allows us to make correlations from there. The most useful
measure of variability in this data set is standard deviation. The standard deviation allows us to

see what salary may be normal, small or large compared to the mean salary depending on how
many deviations from the mean it is.
The histogram that follows gives us a visual representation of how the salary is
distributed over the sample population. It shows us how many players earn different ranges of
salaries, and also gives us a rough visual of the distribution of salaries for the population sample.
Histogram of Salary
180
160
140

Frequency

120
100
80
60
40
20
0

4000000

8000000 12000000 16000000 20000000 24000000 28000000

Salary

This histogram shows the salary earned by most players is within the 0 $4,000,000 range. The
second most earned salary that the histogram shows is in the range of 4,000,000, to 8,000,000.
The salaries drop off dramatically after these first two ranges, down to only player making
between 24,000,000 and 28,000,000. The data in this histogram is substantially skewed to the
left, showing that the information is not symmetric.
The following boxplot shows much of the same information as the above histogram. It
allows us to other sets of information easier though, such as outliers.

Boxplot of Salary
25000000

Salary

20000000

15000000

10000000

5000000

The data in this box plot shows we have many outliers above the third quartile. The outliers start
approximately 11,000,000 and cluster between there and roughly 16,000,000. This cluster is
followed by two more outliers even further above original outliers. The data is also clustered
around the first and second quartiles, showing us that much of the salary ranges down below 5
million dollar mark at the third quartile. Following this cluster data is spread out between the
second quartile and the third quartile.
A confidence interval allows us to determine with a certain percentage that the calculated
true mean is contained within a certain range of data. The following interval does this with a
95% level of confidence.
N Mean SE Mean
95% CI
244 3577408 275615 (3037212, 4117604)
This information shows that there is a 95% chance that the confidence interval we calculated
contains the true mean of $3,577,408.

Following the confidence interval I conducted a hypothesis test attempting to determine


whether the sample data provides sufficient evidence to conclude that the sample population
mean salary is less than 3 million. During the hypothesis I determined a p-value of .981 and
comparing that with a significance level of .05 determined that there is not sufficient evidence to
reject the null hypothesis of the true mean salary equaling 3 million. By not rejecting the null
hypothesis I can conclude that we do not have sufficient evidence to prove that the true mean
salary of the sample proportion of players is less than 3 million.
After the hypothesis test I conducted a regression analysis of the variables Salary versus
Runs, and determined the regression equation comparing the amount of salary to the amount of
runs to be (Salary = 2158285 + 60844 (Runs)). Using this regression equation I conducted two
calculations determining a players salary if they hit 18 runs, and then 33 runs. The calculations
follow.
Salary = 2158258 + 60844(18) = 3253450
Salary = 2158258 + 60844(33) = 4166110
Using this regression equation it does appear that by achieving more runs, a player would earn a
higher salary. Although following my calculations with the regression equation, I constructed a
scatterplot (with a coefficient of determination = 14.6%) of earned salary versus the number of
runs.

Scatterplot of Salary vs Runs


25000000

Salary

20000000

15000000

10000000

5000000

0
0

20

40

60
Runs

80

100

120

Even though the regression equation does appear to show a higher salary per runs scored, by
observing the scatterplot it appears that a players earned salary compared to the number of runs
has no direct correlation. The scatterplot shows not distinct line correlating earned salary to
number of runs from 0 runs to 120 runs. This shows that regression is not an appropriate way
predicting in this specific scenario.
I continued my analysis of the data set by conducting multiple hypothesis test to
determine if there was a correlation between earned salary and the different variables. The first
hypothesis test I conducted was a continuation of the above calculations, correlating salary
versus runs. Preforming this hypothesis test I calculated a p-value of 0.0 with a correlation
coefficient of .383. With this p-value and a significance level of .05 I determined that I have
sufficient evidence to reject the null hypothesis. This means that there is sufficient evidence to
determine that a correlation between the number of runs and earned salary is present. A caveat to
that though is that with the correlation coefficient being so low (.383) the linear relation is slight
but still positive.
I concluded my analysis by preforming hypothesis tests on the remaining variables,
trying to determine if there was a correlation between them, and earned salary. The following
table shows p-values for each hypothesis test as well as the correlation coefficients.

Hypothesis Tests
Salary vs RBIs Correlation Coefficient = .406 Pvalue = 0.0
Salary vs HITS Correlation Coefficient = .356 Pvalue = 0.0
Salary vs RUNS Correlation Coefficient = .383 Pvalue = 0.0
Salary vs WALKS Correlation Coefficient = .416 Pvalue = 0.0
Salary vs AT BATS Correlation Coefficient = .359 Pvalue = 0.0
The chart above shows the correlation coefficients for each of the independent variables when
compared to the salary variable. Each in hypothesis test was determined to have a p-value of 0,
and when paired with a significance level of .05 the null hypothesis for each test can be rejected.
These rejections determine that there is a correlation between each independent variable and the
earned salary of baseball players. But as with runs there is still a caveat. The correlation
coefficients for each test while still positive was fairly low, meaning that the linear relation is still
only slight.
After preforming multiple types of test on this data set, I can conclude that while there is
a correlation between the salary earned by players and each independent variable, the correlation
is only slight. Conducting a calculation on a regression equation and then reviewing a scatterplot
gave mixed results, but by preforming hypothesis tests on each independent variable versus
salary I was able to determine a level of correlation. This implies that players who are able to
achieve high levels in each of the independent variables should have a higher salary then others.
To test this correlation further, a new data set could be established from players from both ends
of the spectrum. Players with low stats versus players with high stats could both be taken, and
then have their salaries compared versus the stats of their different independent variables.