© All Rights Reserved

11 views

© All Rights Reserved

- asd
- Journal Grafika Komputer
- mkm042
- Moreno 1
- Visual Aesthetics-marketing Research
- 4-Maths - IJMCAR - Probability - Delshad Shaker - Iraq - Paid
- Help Sheet for Spss Prntouts
- Math 1005 Statistics
- Final Review Solns
- Program It a Manual Enero 2014
- japoulres22-0245
- t chi sqr mann
- Final Project
- Impact of the Integration of Text-Messag
- independent samples
- STAB22_FinalExam_2011F
- SlidesCh7.pdf
- Making Decisions With Data_Project
- Est Ad is Tic A
- Analysis of Costs of Recovery of Message Logging

You are on page 1of 7

Math 241

4/17/15

Baseball Data Project

This statistical project is based around baseball and specific stats and variables .The

purpose behind the baseball statistical project is to determine if there is any correlation between

the salary a professional baseball player earns and several different variables. RBIs, walks, hits,

runs, and at bats were all variables used in the determination if there was any correlation. The

population of the data is the entirety of the MLB, and the sample set used for this data is from

244 MLB players on 11 different teams.

The first area of my project that was determined was the descriptive statistics of the

baseball data set. These descriptive statistics are the mean, median, mode, range, standard

deviation, and variance for the salary earned by the sample population. The following chart

shows these statistics.

Variable

Salary

Mean

StDev

Variance Median

Range

Mode Mode

21

The mean salary of this data set is $3,577,408. The mean of this data is the most useful measure

of central tendency. While the other information is interesting, showing such information as

where the middle range of the salary is located (1,500,000) and the most common salary

(500,000 earned by 21 players), the mean salary is the most useful to us. It shows the average

salary earned by the sample set and allows us to make correlations from there. The most useful

measure of variability in this data set is standard deviation. The standard deviation allows us to

see what salary may be normal, small or large compared to the mean salary depending on how

many deviations from the mean it is.

The histogram that follows gives us a visual representation of how the salary is

distributed over the sample population. It shows us how many players earn different ranges of

salaries, and also gives us a rough visual of the distribution of salaries for the population sample.

Histogram of Salary

180

160

140

Frequency

120

100

80

60

40

20

0

4000000

Salary

This histogram shows the salary earned by most players is within the 0 $4,000,000 range. The

second most earned salary that the histogram shows is in the range of 4,000,000, to 8,000,000.

The salaries drop off dramatically after these first two ranges, down to only player making

between 24,000,000 and 28,000,000. The data in this histogram is substantially skewed to the

left, showing that the information is not symmetric.

The following boxplot shows much of the same information as the above histogram. It

allows us to other sets of information easier though, such as outliers.

Boxplot of Salary

25000000

Salary

20000000

15000000

10000000

5000000

The data in this box plot shows we have many outliers above the third quartile. The outliers start

approximately 11,000,000 and cluster between there and roughly 16,000,000. This cluster is

followed by two more outliers even further above original outliers. The data is also clustered

around the first and second quartiles, showing us that much of the salary ranges down below 5

million dollar mark at the third quartile. Following this cluster data is spread out between the

second quartile and the third quartile.

A confidence interval allows us to determine with a certain percentage that the calculated

true mean is contained within a certain range of data. The following interval does this with a

95% level of confidence.

N Mean SE Mean

95% CI

244 3577408 275615 (3037212, 4117604)

This information shows that there is a 95% chance that the confidence interval we calculated

contains the true mean of $3,577,408.

whether the sample data provides sufficient evidence to conclude that the sample population

mean salary is less than 3 million. During the hypothesis I determined a p-value of .981 and

comparing that with a significance level of .05 determined that there is not sufficient evidence to

reject the null hypothesis of the true mean salary equaling 3 million. By not rejecting the null

hypothesis I can conclude that we do not have sufficient evidence to prove that the true mean

salary of the sample proportion of players is less than 3 million.

After the hypothesis test I conducted a regression analysis of the variables Salary versus

Runs, and determined the regression equation comparing the amount of salary to the amount of

runs to be (Salary = 2158285 + 60844 (Runs)). Using this regression equation I conducted two

calculations determining a players salary if they hit 18 runs, and then 33 runs. The calculations

follow.

Salary = 2158258 + 60844(18) = 3253450

Salary = 2158258 + 60844(33) = 4166110

Using this regression equation it does appear that by achieving more runs, a player would earn a

higher salary. Although following my calculations with the regression equation, I constructed a

scatterplot (with a coefficient of determination = 14.6%) of earned salary versus the number of

runs.

25000000

Salary

20000000

15000000

10000000

5000000

0

0

20

40

60

Runs

80

100

120

Even though the regression equation does appear to show a higher salary per runs scored, by

observing the scatterplot it appears that a players earned salary compared to the number of runs

has no direct correlation. The scatterplot shows not distinct line correlating earned salary to

number of runs from 0 runs to 120 runs. This shows that regression is not an appropriate way

predicting in this specific scenario.

I continued my analysis of the data set by conducting multiple hypothesis test to

determine if there was a correlation between earned salary and the different variables. The first

hypothesis test I conducted was a continuation of the above calculations, correlating salary

versus runs. Preforming this hypothesis test I calculated a p-value of 0.0 with a correlation

coefficient of .383. With this p-value and a significance level of .05 I determined that I have

sufficient evidence to reject the null hypothesis. This means that there is sufficient evidence to

determine that a correlation between the number of runs and earned salary is present. A caveat to

that though is that with the correlation coefficient being so low (.383) the linear relation is slight

but still positive.

I concluded my analysis by preforming hypothesis tests on the remaining variables,

trying to determine if there was a correlation between them, and earned salary. The following

table shows p-values for each hypothesis test as well as the correlation coefficients.

Hypothesis Tests

Salary vs RBIs Correlation Coefficient = .406 Pvalue = 0.0

Salary vs HITS Correlation Coefficient = .356 Pvalue = 0.0

Salary vs RUNS Correlation Coefficient = .383 Pvalue = 0.0

Salary vs WALKS Correlation Coefficient = .416 Pvalue = 0.0

Salary vs AT BATS Correlation Coefficient = .359 Pvalue = 0.0

The chart above shows the correlation coefficients for each of the independent variables when

compared to the salary variable. Each in hypothesis test was determined to have a p-value of 0,

and when paired with a significance level of .05 the null hypothesis for each test can be rejected.

These rejections determine that there is a correlation between each independent variable and the

earned salary of baseball players. But as with runs there is still a caveat. The correlation

coefficients for each test while still positive was fairly low, meaning that the linear relation is still

only slight.

After preforming multiple types of test on this data set, I can conclude that while there is

a correlation between the salary earned by players and each independent variable, the correlation

is only slight. Conducting a calculation on a regression equation and then reviewing a scatterplot

gave mixed results, but by preforming hypothesis tests on each independent variable versus

salary I was able to determine a level of correlation. This implies that players who are able to

achieve high levels in each of the independent variables should have a higher salary then others.

To test this correlation further, a new data set could be established from players from both ends

of the spectrum. Players with low stats versus players with high stats could both be taken, and

then have their salaries compared versus the stats of their different independent variables.

- asdUploaded byAbdulgaffor Baimpal
- Journal Grafika KomputerUploaded byMaskur
- mkm042Uploaded byYacine Tarik Aizel
- Moreno 1Uploaded byteste157792
- Visual Aesthetics-marketing ResearchUploaded byRahul Singh
- 4-Maths - IJMCAR - Probability - Delshad Shaker - Iraq - PaidUploaded byTJPRC Publications
- Help Sheet for Spss PrntoutsUploaded bySaba Rajpoot
- Math 1005 StatisticsUploaded bysuitup10
- Final Review SolnsUploaded byMorgan Sanchez
- Program It a Manual Enero 2014Uploaded byElena
- japoulres22-0245Uploaded byYobelman Tarigan
- t chi sqr mannUploaded bydollie
- Final ProjectUploaded bySam
- Impact of the Integration of Text-MessagUploaded byJOHN
- independent samplesUploaded byapi-270984424
- STAB22_FinalExam_2011FUploaded byexamkiller
- SlidesCh7.pdfUploaded byJose Ruben Sorto Bada
- Making Decisions With Data_ProjectUploaded bySangeeta Panigrahi
- Est Ad is Tic AUploaded byErika Juliet Bermudez
- Analysis of Costs of Recovery of Message LoggingUploaded byapi-3798769
- Statistical Analysis1Uploaded byLhiza
- Assignment-2.docxUploaded byXavier Crypt
- TE ACH ING EF F EC TI VE NESS O FSENIOR SECONDARY SCHOOL TEACHERS IN RELATION TO TEACHING EXPERIENCE, GENDER AND EDUCATIONAL QUALIFICATIONUploaded byAnonymous CwJeBCAXp
- spssexamplettests.pdfUploaded bySezan Tanvir
- Pvalue_t_tableUploaded byDonghyeon Jo
- UPENN Phil 015 Handout 6Uploaded bymatthewkimura
- art%3A10.1007%2Fs11219-017-9361-yUploaded byradiumtau
- 2014-04-17_024018_statsolutions.docxUploaded bydjosh
- 15 correlationUploaded byapi-299265916
- Literature Review 2Uploaded byClint Davids

- #3 Oob Woolf SataUploaded byShylaJain
- 06922281Uploaded byMekaTron
- nplabUploaded byachutha795830
- DL405 Installation and I-O ManualUploaded byjvcoral321
- Wavelet Based Analysis for Transmission Line Fault LocationUploaded byAlexander Decker
- PC Hardware and MaintainenceUploaded bysashahyd1
- AquamaticUploaded byNatalia Sagredo Rey
- Effect of Anchimeric Assistance in the Reaction of Triphenylphosphine with α ,β -Unsaturated Carboxylic AcidsUploaded byBEN DUNCAN MALAGA ESPICHAN
- Chemical Bonding and Atomic StructureUploaded byAliLakho
- CE121 - FW3.docxUploaded byJonas Cayanan
- File DisksUploaded byAlana Peterson
- Writing Self-Documenting PHP CodeUploaded byKristian Wiborg
- Image Processing PptUploaded byVarun Tendulkar
- Chapter12 AlgebraUploaded bylen16328
- Wine Fermentation Kinetic Model Verification and Simulation of Refrigeration Malfunction During Wine FermentationUploaded byroger_sh
- PS 4 SolutionsUploaded byGanesh Kumar
- 6ES72315PD300XB0 Datasheet EnUploaded byAhmed Samir
- ESBeamTool Summary Scan PlanUploaded byJosé García
- Msc SyllabusUploaded byV Nageswara Varma
- DokumentUploaded byAnonymous uhR7u85Y
- LADE16 Higher Order DeUploaded byRoumen Guha
- FOUR QUADRANT SPEED CONTROL OF DC MOTOR USING AT89S52 MICROCONTROLLERUploaded byJournal 4 Research
- BOD METHODUploaded byElena Reyes
- Full Text 01Uploaded byAshmayar Asif
- Automation and RoboticsUploaded byNick Te
- shellr57Uploaded byspycore
- LPA SeriesUploaded byfox7878
- htnUploaded byAle_blessed
- PT=131 (863-869)Uploaded bySherla Febriany
- Power Electronic Module - Chapter 5Uploaded byjayxcell