4 views

Uploaded by lucasvancleef

A report on life expectancy for Ohio citizens using regression

save

- 1987 Ashton Et Al an Empirical Analysis of Audit Delay
- 2nd Syllabus
- human behavior at work
- 45 Statistics
- Econometrics Paper
- Interpretation of the correlation coefficient - a basic review.pdf
- 5 Introduction to Multiple Regression
- UsefulStataCommands
- Amba 600 Problem Set 3 (Umuc)
- Multicollinearity in Multiple Regression
- Statistical Terms
- Dialnet-UsingPanelDataTechniquesForSocialScienceResearch-5035136.pdf
- blogit
- Regression 2
- Writing Tips for Economics Research Papers_P Nikolov 2010
- dp1309
- Tutorials2016s1 Week9 Answers
- iebs
- HarrisonAboueissaHartley-revision-accepted for publication.pdf
- Exam4135 2004 Solutions
- M.tech.(Mechanical Engineering) Part-Time
- Interpreting Multiple Regression
- Analysis of farm performance in Europe under different climatic and management conditions to improve understanding of adaptive capacity
- Maria Bolboaca - Lucrare de Licenta
- Academic Proposal Template
- Robust standard errors for panel regressions with cross-sectional dependence
- Chapter 7 - Multiple Regression
- A Statistical Analysis to Predict Financial Distress
- Iran
- PJSS-Vol33-No1-07
- REKAYASA IDE.docx
- KAK cara mendapatkan umpan balik.doc
- serattumbuhandanhewan-171015105337
- algo.pdf
- amal bk
- Java Applet by Greekshow
- தலைப்பிள்ளை வரமா சாபமா_ – தெய்வ சங்கல்பம்
- [Private Pilot Airplane] Ch.1.pdf
- Acción de Control Integral
- Evaluación Financiera de Proyectos y Fuentes de Financiamiento (Libro)
- AN-263.pdf
- Prapet-02
- Repentence Brings Revival
- hpk 2016 incident injury trauma and illness policy final
- DATA SGP
- Guia TDAH.pdf
- Teoria de Las Deciciones Collaborative Work
- 60-Throttling.pdf
- Situs Judi Poker Agen Poker Terpercaya Nonapoker
- 1452078645 3934 Caldeiras Brava One of Erp
- Declaran a Las Abejas Como El Ser Vivo Más Importante Del Planeta
- final
- 370613123 Karsten Mueller Frank Lamprecht Fundamental Chess Endings PDF
- Pemeriksaan Fisik Jantung Dan ToF Tutor 1
- MATODO BIOGRAFICO
- Energia y Trabajo
- Veloz Jimenez Christian Alejandro
- Air on G String.pdf
- hp envy dv6 7380la.pdf
- SP ASKEP KEHILANGAN.doc

You are on page 1of 13

12/2/13

Applied Regression

Dr. Burnell

**Data Analysis of the Life Expectancy for Citizens of Ohio
**

Though death may occur at any time for any number of unforeseeable reasons, a

multitude of studies have been conducted regarding factors that may affect human life

expectancy. The purpose of this paper is to use regression in order to determine whether or not

there is statistical evidence to suggest that these variables have a significant impact on the age

at which an individual dies, as well as to analyze what the impact is of these variables on that

age. Focusing exclusively on Ohio, I have selected a multitude of these factors on which I have

access to sufficient amount of data to analyze, and suggest how they may impact the age at

which a citizen of the state dies. This data comes from the 1998 US Census, further stratified so

as to only represent mortality statistics within the state of Ohio during that year. This data was

gathered from www.dataferret.census.gov. The units of observation for my model are people

who have died in year 1998 in the state of Ohio, and there are 105,891 observations in my

sample.

Model Specification

The dependent variable in my model is “Age”, which refers to the age at which a

particular observation has died at. Though humans are equally able to die at any age, both

previously performed statistical analysis and theory suggest that a variety of factors affect the

estimated value of this variable. Having access to data for only a limited number of these

proposed factors, I have selectively chosen variables from a variety of different vectors that the

proposed factors of life expectancy can be categorized under. The vectors that I made an effort

to represent include educational, environmental, genetic, lifestyle, and health.

The sole independent variable of the education vector in my model is “Educ”. Educ

“Cityrs”. based off a recode that assigns each city in Ohio to a different numerical value. as well as reduced risk for more common causes of death such as traffic accidents. While metropolitan citizens have a greater risk of being killed by violent crime and sexually transmitted disease. which has the value of 1 for each observation described . Educational attainment is only one of several indicator variables used to measure socioeconomic status. A 2012 study on the impact of education and race on life expectancy. I have included the variable of “Metro”. as determined by the Office of Management and Budget. I expect the value of the coefficient on “Metro” to be positive. an annual report published by County Health Ratings published in 2011 suggests that they have a greater life expectancy than their rural counterparts. using census data from 1990-2008. and non-white races other than black. This is a dummy variable created from the variable in my original dataset. (Beck. et al. which describes the city of residence of an observation. This particular recode separates race into only three sectors of white. concluded that that there is a positive correlation between additional years of education and longevity of life.” (Olshansky. The dummy variable “Metro” takes the value of 1 for each observation whose “Cityrs” variable has the value of a city defined as metropolitan. 2012) Considering this analysis I expect the coefficient of my “Educ” variable will be positive. The first variable included in my regression from the genetic vector is that of race. I have created the dummy variable of “White”. 2011) This is attributed to the greater mean socioeconomic status of metropolitan dwellers. black. as measured by the recode “Racer3”. and it is also a principal component of socioeconomic status. For the purposes of my regression. Antonucci. which describes whether or not an observation was a citizen of a city classified as metropolitan. The underlying cause of this relationship proposed by the researchers involved is that “Education is an important variable known to influence health inequalities.measures the number of year of schooling a particular observation has completed from no formal education up to five years of college education. all of which can influence health and longevity. For the vector of environmental factors.

divorced or widowed. it was observed that amongst all races and levels of education women had longer life expectancy than men. My theoretical basis for this expected positive coefficient comes from a study done on 4. I have turned this variable into the dummy variable “Married”.as white in the original race variable. expressed in the variable “occup”. a study done by Columbia University using information from the U. Medical Expenditure Panel Survey (MEPS) and National Death Index concludes that there is evidence that individuals in blue collar jobs are more likely to have medical problems. 2012) Because of this. I believe that the coefficient on my variable “Male”. which takes the value of 1 if the observation is married. 2013. et al. This variable assigns a unique numeric value to a large spectrum of occupations. I expect the coefficient of this variable to be positive. (Olshansky. 345). which has the value of 1 if the observation was a male. National Health Interview Survey (NHIS). Consequently.802 individuals known as the North Carolina Alumni Heart Study. Since studies on the topic have concluded that being married leads to a longer life. will be negative. and to have to continue working in spite of these . Martin. In the aforementioned study on the impact of race and education on life expectancy. and associates one of these values to each observation based on what that person’s usual occupation was at the time of their death. This study concludes that there is evidence to suggest that “Being single. or losing a partner without replacement. The other lifestyle variable accounted for in my data is that of usual occupation. which refers to whether or not the observation was married. p. I found this appropriate. A second genetic variable that I included in my model is the variable of “Sex”. single. Antonucci. In regards to the effects of occupation types on life expectancy.S. increased the risk of early death during middle age and reduced the likelihood that one would survive to be elderly” (Brummet. Helms. This variable has been further changed into the dummy variable “Male”. and Siegler. In regards to lifestyle. as theory widely suggests that being white has a positive effect on your life expectancy compared to every other race. the first variable I included was that of “Marstat”.

I have created a dummy variable labeled “Bluecollar”. a dummy variable that takes the value of 1 for each observation whose cause of death was cancer. to create the variable “Marriedcancer”. With this in mind. Having determined these variables to be theoretically important in determining the life expectancy of a human being. The criteria used for bluecollar in this variable is that the observations occupation must be both labor-oriented and not require advanced education in order to enter. Since cancer and heart disease are the leading causes of premature death. cancer and heart disease are the two leading causes of death among all people. et al. as well as the reduced chance of undertreatment for a patient with a spouse. the functional form of my model is such that: Ŷ = β0 + β1Educ + β2Metro + β3White + β4Male + β5Married + β6Bluecollar + β7Cancerorheartdisease + β8Marriedcancer + ɛ . I used the variable “ucr52”. 2011) In light of this information. a 2013 study by the American Society of Clinical Oncology reports evidence that cancer patients who are married survive longer than those who are not. (Berger. I expect that the coefficient on this dummy variable will be negative. Chen. While cancer is one of the leading causes of death within the US. (Aizer. which I expect will be positive. I multiplied my “Married” variable with my variable “Cancer”. As US Census data has repeatedly shown. The purpose of this variable is to account for the effect that theory supposes being married has on the life expectancy of a cancer patient. which takes on the value of 1 if the cause of the observations death is heart disease or cancer.. which takes the value of 1 in cases where the observation has an occupation that is considered blue collar. to create the dummy variable “Cancerorheartdisease”. a recode of the cause of death for each observation in my study.medical issues. For the final vector of factors that I included in my analysis consists of variables related to the health status of the observation. 2013) To account for this. The study concludes that married cancer patients are likely to live longer than their unmarried counterparts due to the role that the emotional support of a spouse plays on coping with cancer.

78 is not high enough to reject the null that it is insignificant at the 5% level of significance. their life expectancy increases . the calculated t-score of .96. as the slope between my dependant and independent variables will remain constant. the nonsystematic component of my expected Y that can not be determined from my independent variables. Regression Analysis After running my proposed model in Stata.43. If “Metro” is kept in the model due to theoretical significance. the following results were generated: The first variable “Educ” is statistically significant at the 5% level of significance. and ɛ as my error term. For the independent variable “Metro”. which would be the estimated value of Y if every independent variable had a value of 0. of which the critical t-value is 1.1119774 years. The functional form of this model can be described as the linear form. the life expectancy of an individual decreases by .1119774 should be interpreted as saying for each additional year of education.With β0 representing the constant term. The coefficient -. its coefficient is interpreted as saying that for an observation whose city of residence was metropolitan. with calculated t-score of -15.

16% of the estimated value for “Age” can be explained by the independent variables in .68347 )Married . The t-score on the dummy variable “Married”.3.031417 years.633431 )White + (1.68347 years.101422 ) Male .29. Since “Married” has a coefficient of -10. which is -34.by . The coefficient of this variable.(7.38 is also significant at the 5% level.633431 on White means that if an individual observation is white. indicates that the variable is significant at the 5% level.68347.343748 )Bluecollar + (-6.101422 years compared to if they were a woman. meaning that if kept in the model observations who are both married and have cancer have their life expectancy decreased by . and so is significant at the 5% level. my R-square value is . means that if an observation happened to work in field that is considered blue collar. whose tscore is -22.101422.70. “Marriedcancer” is insignificant at the 5% level of significance with a tscore of only .7932 . The final included variable. The coefficient of this variable. is also significant at the 5% level. their life expectancy is 7. my model for life expectancy is now: Ŷ = 86. The coefficient of 1. it should be interpreted that if an observation is married.138705)Marriedcancer + ɛ Though several of the variables are highly significant. This coefficient means that if an observation happens to be male. and has a coefficient of .(. which is -7. and has a coefficient of 1.1119774)Educ +(. their life expectancy is increased by by 1.0316 means that only 3.6817049 )Metro + (1. which has a t-score of -23. The health dummy variable “Cancerorheartdisease”.138705 years.031417 means that if an observation became afflicted with one of the illnesses that make up the two leading causes of death in their life.031417 )Cancerorheartdisease + (.6817049 years.633431 years. their life expectancy is raised by a constant of 1. With these coefficients being known and interpreted. The race dummy variable “White” has a t-value of 3.(10.98. The dummy variable “Bluecollar”. The gender dummy variable “Male” is significant at the 5% level with a t-score of 3. their life expectancy had dropped by 6. their life expectancy decreases by 10. -6.343748.138705.343748 years less than it would have been otherwise.76.

94. I reject the null that my model is not statistically significant. I believe that omitted variable bias is playing a role in my unexpected results due to the theoretical underpinnings in my model. The first step I took in this measure was to look at the simple correlation coefficients between the variables in my equation: . and many many more. there are a large multitude of theoretically relevant factors that I just did not have the data to include in my analysis. Since the calculated F-value of my regression is 432.882 degrees of freedom. These factors include things such as statistics on cigarette smoking. While the variables in my model may be some of the factors that potentially have an impact on life expectancy. I have undergone the steps to identify whether or not it’s presence is particularly impactful on my particular equation. nor do I have access to enough data to add these variables to my dataset should I discover what they are. In the particular case of my model. With 8 explanatory variables. Unfortunately I neither have a way to determine what these potential omitted variables are. This is an indicator for potential omitted variable bias. Econometric Problems Upon running my regression one of the first things noticeable is that several of my variables are highly significant in the opposite sign than was expected. It is entirely possible that with the large number of theoretically relevant variables left out of my model. that the coefficients of the variables included in my model may have captured some of their effect on “Age”. dietary habits.my model. the critical F-value for my regression is 1. Since a certain degree of multicollinearity exists in all equations. environmental conditions. and 105.36.

but both variables still remain theoretically significant enough that this coefficient is trivial. as the two are inherently linked by the fact that they both only apply to married individuals. as the correlation between “Bluecollar” and “Male” may be due to blue collar jobs being predominantly performed by men. with each other one as their independent variables: . The highest correlations in my equation are between “Bluecollar” and “Male”. with a value of 0. Ultimately my correlation coefficients do not provide evidence that suggests my equation is being afflicted by multicollinearity. which are derived by regression each of my independent variables as the dependent.Upon determining the correlation coefficients I have observed that no two variables have a troublingly high level of correlation. and the correlation between “Marriedcancer and Married”.4473. Further supplementing my conclusion are the low VIF scores on my variables. These are not particularly damning realizations. The high correlation between “Married” and “Marriedcancer” is to be expected.

(Studenmund. The presence of heteroskedasticity in my model means that the error terms of my observations are not drawn from a distribution that has common variance. I have rerun my regression. This was done by employing the Breusch-Pagan test in Stata: Since the P>Chi(2) is <. I have decided to go ahead and test for heteroskedasticity. Due to the large absolute values on a number of my equation’s t-scores. which are all relatively low by standard analysis.05. To remedy this error. for my Breusch-Pagan test there is evidence of heteroskedasticity in my model. the VIF score for each of my variables is between 1 and 2. The results are extremely similar to my .While there are differing opinions on what constitutes a high enough VIF score to warrant suspicion of multicollinearity.0000. but this time with robust standard errors. causing my t-scores to be inflated and unreliable. and which ones are not. p. I can no longer have faith in the determinations I made about which of my variables are statistically significant. 2011. 337) This heteroskedasticity underestimates the standard errors of my coefficients. more specifically . Consequently.

. though the standard errors are larger than before: In light of this second regression. which only confirms my theory based suspicion that there are many other variables relevant in determining the life expectancy of a person. Assessment of Model Ultimately I do not believe the results of my model are very valid in describing how the different variables play into determining an individuals life expectancy. this suspicion of extreme omitted variable bias is supplemented by the presence of descriptive variables in my sample having unexpected signs and being largely significant in the opposite sign than theory would dictate. This includes the variables “Male”. but as mentioned throughout this project. “Educ”. and possibly being outdated as it is 15 years old. and “Married”. It is possible that this is due to my sample being only from one state. This is in part due to my low R-squared.original regression. a slew of conditions suggest to me that omitted variables are greatly affecting my model. Furthermore. I still have the same statistically significant variables and coefficients for my model.

Mendu. P. Retrieved December 3. M. McCarthy. genetic. . J. Working with only mortality statistics from the year of 1998. I did not have access to data about a large amount of variables that theory would suggest contribute to an individual’s life expectancy.. and environmental characteristics of the observations in my sample. E. A. Works Cited Aizer.. Chen. (2013). Hu.. There are a great number of factors under all these categories that lead to human mortality..I believe the greatest limitation in my process was lack of access to data about a number of health. M.. Marital Status and Survival in Patients With Cancer. 31. Journal of clinical oncology. et al. Nguyen.. which causes me to believe that information must be known about them in order to make a valid estimate on an individual’s life expectancy.

d. Retrieved December 2.ascopubs. Blue Collar Workers Can Look Forward to Working Longer and in Worse Health than their White Collar Bosses. Columbia University. A.healthaffairs.. S.abstract Berger.d. Kohli. Country: Who Is Healthier?. S... from http://content. And Many May Not Catch Up.. from http://online. 2013. Retrieved December 2.nytimes.). (2013. Jackson. from http://well. T. T.columbia. Wall Street Journal. 2013. Mass.49.com/2013/09/24/married-cancerpatients-live-longer/?_r=0 Studenmund. Zheng. Study Highlights Importance of Social Ties During Midlife. Married Cancer Patients Live Longer. 2013. H. Rother. M. (2013. from http://jco.org/content/31/8/1803. J. Retrieved December 6.wsj.edu/news/blue-collar-workers-can-look-forward-working-longerand-worse-health-their-white-collar-bosses City vs. (n. Differences In Life Expectancy Due To Race And Educational Differences Are Widening.sciencedaily.mailman. 2013. (n. Using Econometrics: A Practical Guide. Boston. (2012). from http://www.htm Olshansky.). 31(8). ScienceDaily. 2013.2013. September 24). Health Affairs.. Retrieved December 6.org/content/early/2013/09/18/JCO. (2011). et al.: Addison Wesley. from http://www.. Y. J. January 10).2013. . Antonucci. New York Times.full Parker-Pope.blogs.6489.com/news/articles/SB10001424052702304793504576434442652581806 Marriage Linked to Better Survival in Middle Age. Retrieved December 6.com/releases/2013/01/130110102342.

- 1987 Ashton Et Al an Empirical Analysis of Audit DelayUploaded byEmanuelArioBimo
- 2nd SyllabusUploaded byAkshay Sharma
- human behavior at workUploaded byDeepak Kumar
- 45 StatisticsUploaded byYasaswi
- Econometrics PaperUploaded byPam Ramos
- Interpretation of the correlation coefficient - a basic review.pdfUploaded byLam KC
- 5 Introduction to Multiple RegressionUploaded byDarren Ignatius Lee
- UsefulStataCommandsUploaded bygergoszetlik7300
- Amba 600 Problem Set 3 (Umuc)Uploaded byOmarNiemczyk
- Multicollinearity in Multiple RegressionUploaded byanjo0225
- Statistical TermsUploaded byEric Sanchez
- Dialnet-UsingPanelDataTechniquesForSocialScienceResearch-5035136.pdfUploaded byMochammad Ridwan
- blogitUploaded bySeokho Seo
- Regression 2Uploaded byNur Syahirah 신 애
- Writing Tips for Economics Research Papers_P Nikolov 2010Uploaded byArbee Lu
- dp1309Uploaded byalifatehitqm
- Tutorials2016s1 Week9 AnswersUploaded byyizzy
- iebsUploaded byAnurag Bhatia
- HarrisonAboueissaHartley-revision-accepted for publication.pdfUploaded byCourtney Andrews
- Exam4135 2004 SolutionsUploaded bymissinu
- M.tech.(Mechanical Engineering) Part-TimeUploaded bySan Deep Sharma
- Interpreting Multiple RegressionUploaded byRalph Wajah Zwena
- Analysis of farm performance in Europe under different climatic and management conditions to improve understanding of adaptive capacityUploaded byVo Duc Hoang Vu
- Maria Bolboaca - Lucrare de LicentaUploaded byBucurei Ion-Alin
- Academic Proposal TemplateUploaded byAbdalla Mohamed Abdalla
- Robust standard errors for panel regressions with cross-sectional dependenceUploaded byvanny
- Chapter 7 - Multiple RegressionUploaded byIdhaniv
- A Statistical Analysis to Predict Financial DistressUploaded byDevi Rahmawati
- IranUploaded byBanzanawa Alkali
- PJSS-Vol33-No1-07Uploaded byWaqas Khan