You are on page 1of 14



Data Analysis and Interpretation Final Project Cody Cooper University of Central Oklahoma

Final Project


Major League Baseball is a professional baseball league that includes the American League and the National League. There are 30 teams including one in Canada. Baseball is one of the most popular sports in the United States. We will be analyzing data gathered from a random sampling of 254 players. We will examine similarities between the data for each player paying attention to any linear relationships between performance and pay and determine their statistical significance. In the data of the baseball players, there are several variables to consider. Table 1.1A illustrates their division into variable type, data type, and what scale they are measured on. The variables are salary, games played, hits, homeruns, runs batted in, and batting average respectively. There are two main types of variables, qualitative and quantitative. The data set contains only quantitative variables. Quantitative variables are meaningful numerical values that are distinct. Quantitative variables can be classified as either continuous or discrete. Discrete variables take on individually distinct values while continuous variables can take on any value within an interval. For instance, when we look at the data for games played, the data is quantitative and discrete, because it assumes meaningful numeric values and assumes a countable number of distinct values. Comparatively, the batting average data can be classified as continuous because the data can be at infinitely uncountable values and can take on any value within an interval. In order choose the correct method for summarizing the data, it is important to understand the measurement scales for each variable. All of the variables for the baseball data set are measured on a ratio scale. The ratio scale represents the strongest level of measurement. Ratio scaled data have a true zero point as the origin and valid ratios can be calculated from the data set. TABLE 1.1A
Data Description Variable Type Data Type Measurement Scale Salary quantitative continuous ratio Games Played quantitative discrete ratio Hits discrete ratio Homeruns discrete ratio Runs Batted In quantiative discrete ratio Batting Average quantitative continuous ratio

quantitative quantitative



To further analyze the data we will summarize the variables classified as discrete into frequency distributions, bar charts and pie charts. TABLE 1.1A Frequency for games played data
Classes 0 up to 250 250 up to 500 500 up to 750 750 up to 1000 1000 up to 1250 1250 up to 1500 1500 up to 1750 1750 up to 2000 2000 up to 2250 2250 up to 2500 2500 up to 2750 2750 up to 3000 Frequency 27 59 42 26 33 27 13 14 5 5 1 2 Total = 254

FIGURE 1.2A Pie chart of games played

FINAL PROJECT FIGURE 1.2B Bar chart of games played


Table 1.1B Frequency for hits data
Classes Frequency 0 up to 250 39 250 up to 500 54 500 up to 750 44 750 up to 1000 29 1000 up to 1250 28 1250 up to 1500 26 1500 up to 1750 13 1750 up to 2000 11 2000 up to 2250 8 2250 up to 2500 6 2500 up to 2750 4 2750 up to 3000 1 3000 up to 3250 1 Total = 254

FINAL PROJECT Figure 1.2C Pie chart for hits data


Figure 1.2D Bar chart for hits data

FINAL PROJECT Table 1.1C Frequency for home run data
Classes 0 up to 100 100 up to 200 200 up to 300 300 up to 400 400 up to 500 500 up to 600 600 up to 700 700 up to 800 Frequency 1 152 51 30 10 4 4 2 Total = 254


Figure 1.2E Pie chart for home run data

FINAL PROJECT Figure 1.2F Bar chart of homrun data


Table 1.1C Frequency for RBI(runs batted in)
Classes 0 up to 250 250 up to 500 500 up to 750 750 up to 1000 1000 up to 1250 1250 up to 1500 1500 up tp 1750 1750 up to 2000 Frequency 97 68 36 28 11 7 6 1 Total=254

FINAL PROJECT Figure 1.2G Pie chart for RBI


The frequency distributions group data into categories and records the number of observations that fall into each category. Pie charts, are segmented circles whose segments protray frequencies of the categories of the variables. Lasty, the bar charts depict the frequency for each category of the data as a bar rising vertically from the horizontal axis. For the continuous variables, we will show the data in relative frequency histograms and relative frequency polygons. The relative frequency of each category equal the proportion of observations in each category. A category’s relative frequency is calculated by dividing the frequency by the total number of observaitons. The sum of the relative frequency should be one or close to one due to rounding.

FINAL PROJECT Figure 1.2H Relative frequency histogram for salary data


Figure 1.2I Relative frequency polygon for salary data

FINAL PROJECT Figure 1.2J Relative frequency histogram for batting average


Figure 1.2K Relative frequency polygon for batting average

Final Project


The histograms are a series of rectangles where the width and height of each rectangle represents the class width and frequency of the respective classes. The polygons are a connected series of neighboring points where each point represents the midpoint of a particular class and its associated relative frequency. Next we will look at scatterplots of salaries vs homeruns and salaries vs batting average. A scatterplot is a graphical tool that helps in determining whether or not two variables are related in some systematic way. Each point in the diagram represents a pair of known observed values of the two variables. Figure 1.2L Scatterplot of salaries vs homeruns

FINAL PROJECT Figure 1.2M Scatterplot of batting averages and salary


To further analyze the data, we will look at the salary data for the MLB players. The mean for the salary data is $4,689,717.22, the median is $3,500,000, the mode is $380,000, and the standard deviation is 4808745. Figure 1.2H illustrates the player’s salaries and distribution From the histogram we can see that the data has a positively skewed distribution. This reflects the presence of a small number of relatively large values around the mean. The mean for the batting averages is 0.275446640316206, the standard deviation is 0.0225366130365633. The highest paid player is on team NYY and has a salary of $23,428,571. This player’s batting average is 0.289. Al though this player is the highest paid, Figure 1.2M illustrates that the batting average doesn’t necessarily affect the player’s salary. The scatterplot shows many players who have similar averages but make about one quarter or less than that of the highest earning



player. The probability that the player’s batting average will be less than .300 is 0.889. Based on the probability at least 226 players will average less than or equal to .300. At the 95% confidence level, the regression coefficient of salary is $27,943.37, conversely the regression coefficient of home runs is 1.9. This implies that for every two percentage points, a player’s salary will increase by $27,943.37. Figure 1.2l illustrates the positive linear relationship between the salary data and the home run data. Similarly if we examine the data between salary and batting average, the regressions coefficient of salary and batting average we see that for every increase of 1.82% in the batting average the salary will increase as well. Figure 1.2M illustrates the positive linear relationship between the data. Hypothesis testing is used to resolve conflicts between two competing opinions on a particular population parameter of interest. For salary and home runs, the null hypothesis is Hₒ: µ≥1.9 and the alternative hypothesis is Hₐ:µ˂1.9. For salary and batting average, the null hypothesis is, Hₒ≥1.82 and the alternative hypothesis is Hₐ˂1.82. Using the data in this report can be useful to team owners for player retension, recruiting and making decisions based on player performance. One of particular note is the relationship between players batting averages and their salary. The data shows that although there is a positive linear relationship between the two variables, a player’s batting average doesn’t necessarily influence they salary. The data shows many players that have similar batting averages yet are paid significantly less. Also analyzing the home run data, 65% of the players had between 100 and 200 homeruns. Overall the data illustrates that there is a positive linear relationship between salary and performance, though it is slight, it is present.