You are on page 1of 13

Lucas Van Cleef

Applied Regression
Dr. Burnell

Data Analysis of the Life Expectancy for Citizens of Ohio
Though death may occur at any time for any number of unforeseeable reasons, a
multitude of studies have been conducted regarding factors that may affect human life
expectancy. The purpose of this paper is to use regression in order to determine whether or not
there is statistical evidence to suggest that these variables have a significant impact on the age
at which an individual dies, as well as to analyze what the impact is of these variables on that
age. Focusing exclusively on Ohio, I have selected a multitude of these factors on which I have
access to sufficient amount of data to analyze, and suggest how they may impact the age at
which a citizen of the state dies. This data comes from the 1998 US Census, further stratified so
as to only represent mortality statistics within the state of Ohio during that year. This data was
gathered from The units of observation for my model are people
who have died in year 1998 in the state of Ohio, and there are 105,891 observations in my

Model Specification
The dependent variable in my model is “Age”, which refers to the age at which a
particular observation has died at. Though humans are equally able to die at any age, both
previously performed statistical analysis and theory suggest that a variety of factors affect the
estimated value of this variable. Having access to data for only a limited number of these
proposed factors, I have selectively chosen variables from a variety of different vectors that the
proposed factors of life expectancy can be categorized under. The vectors that I made an effort
to represent include educational, environmental, genetic, lifestyle, and health.
The sole independent variable of the education vector in my model is “Educ”. Educ

“Cityrs”. based off a recode that assigns each city in Ohio to a different numerical value. as well as reduced risk for more common causes of death such as traffic accidents. While metropolitan citizens have a greater risk of being killed by violent crime and sexually transmitted disease. which has the value of 1 for each observation described . Educational attainment is only one of several indicator variables used to measure socioeconomic status. A 2012 study on the impact of education and race on life expectancy. I have included the variable of “Metro”. as determined by the Office of Management and Budget. I expect the value of the coefficient on “Metro” to be positive. an annual report published by County Health Ratings published in 2011 suggests that they have a greater life expectancy than their rural counterparts. using census data from 1990-2008. and non-white races other than black. This is a dummy variable created from the variable in my original dataset. (Beck. et al. which describes the city of residence of an observation. This particular recode separates race into only three sectors of white. concluded that that there is a positive correlation between additional years of education and longevity of life.” (Olshansky. The dummy variable “Metro” takes the value of 1 for each observation whose “Cityrs” variable has the value of a city defined as metropolitan. 2012) Considering this analysis I expect the coefficient of my “Educ” variable will be positive. The first variable included in my regression from the genetic vector is that of race. I have created the dummy variable of “White”. 2011) This is attributed to the greater mean socioeconomic status of metropolitan dwellers. black. as measured by the recode “Racer3”. and it is also a principal component of socioeconomic status. For the purposes of my regression. Antonucci. which describes whether or not an observation was a citizen of a city classified as metropolitan. The underlying cause of this relationship proposed by the researchers involved is that “Education is an important variable known to influence health inequalities.measures the number of year of schooling a particular observation has completed from no formal education up to five years of college education. all of which can influence health and longevity. For the vector of environmental factors.

divorced or widowed. it was observed that amongst all races and levels of education women had longer life expectancy than men. My theoretical basis for this expected positive coefficient comes from a study done on 4. I have turned this variable into the dummy variable “Married”.as white in the original race variable. expressed in the variable “occup”. a study done by Columbia University using information from the U. Medical Expenditure Panel Survey (MEPS) and National Death Index concludes that there is evidence that individuals in blue collar jobs are more likely to have medical problems. 2012) Because of this. I believe that the coefficient on my variable “Male”. which takes the value of 1 if the observation is married. 2013. et al. This variable assigns a unique numeric value to a large spectrum of occupations. I expect the coefficient of this variable to be positive. (Olshansky. 345). which has the value of 1 if the observation was a male. National Health Interview Survey (NHIS). Consequently.802 individuals known as the North Carolina Alumni Heart Study. Since studies on the topic have concluded that being married leads to a longer life. will be negative. and to have to continue working in spite of these . Martin. In the aforementioned study on the impact of race and education on life expectancy. and associates one of these values to each observation based on what that person’s usual occupation was at the time of their death. This study concludes that there is evidence to suggest that “Being single. or losing a partner without replacement. The other lifestyle variable accounted for in my data is that of usual occupation. which refers to whether or not the observation was married. p. I found this appropriate. A second genetic variable that I included in my model is the variable of “Sex”. single. Antonucci. In regards to the effects of occupation types on life expectancy.S. increased the risk of early death during middle age and reduced the likelihood that one would survive to be elderly” (Brummet. Helms. This variable has been further changed into the dummy variable “Male”. and Siegler. In regards to lifestyle. as theory widely suggests that being white has a positive effect on your life expectancy compared to every other race. the first variable I included was that of “Marstat”.

I have created a dummy variable labeled “Bluecollar”. a dummy variable that takes the value of 1 for each observation whose cause of death was cancer. to create the variable “Marriedcancer”. With this in mind. Having determined these variables to be theoretically important in determining the life expectancy of a human being. The criteria used for bluecollar in this variable is that the observations occupation must be both labor-oriented and not require advanced education in order to enter. Since cancer and heart disease are the leading causes of premature death. cancer and heart disease are the two leading causes of death among all people. et al. as well as the reduced chance of undertreatment for a patient with a spouse. the functional form of my model is such that: Ŷ = β0 + β1Educ + β2Metro + β3White + β4Male + β5Married + β6Bluecollar + β7Cancerorheartdisease + β8Marriedcancer + ɛ . I used the variable “ucr52”. 2011) In light of this information. a 2013 study by the American Society of Clinical Oncology reports evidence that cancer patients who are married survive longer than those who are not. (Berger. I expect that the coefficient on this dummy variable will be negative. Chen. While cancer is one of the leading causes of death within the US. (Aizer. which I expect will be positive. I multiplied my “Married” variable with my variable “Cancer”. As US Census data has repeatedly shown. The purpose of this variable is to account for the effect that theory supposes being married has on the life expectancy of a cancer patient. which takes on the value of 1 if the cause of the observations death is heart disease or cancer.. which takes the value of 1 in cases where the observation has an occupation that is considered blue collar. to create the dummy variable “Cancerorheartdisease”. a recode of the cause of death for each observation in my study.medical issues. For the final vector of factors that I included in my analysis consists of variables related to the health status of the observation. 2013) To account for this. The study concludes that married cancer patients are likely to live longer than their unmarried counterparts due to the role that the emotional support of a spouse plays on coping with cancer.

78 is not high enough to reject the null that it is insignificant at the 5% level of significance. their life expectancy increases . the calculated t-score of .96. as the slope between my dependant and independent variables will remain constant. the nonsystematic component of my expected Y that can not be determined from my independent variables. Regression Analysis After running my proposed model in Stata.43. If “Metro” is kept in the model due to theoretical significance. the following results were generated: The first variable “Educ” is statistically significant at the 5% level of significance. and ɛ as my error term. For the independent variable “Metro”. which would be the estimated value of Y if every independent variable had a value of 0. of which the critical t-value is 1.1119774 years. The functional form of this model can be described as the linear form. the life expectancy of an individual decreases by .1119774 should be interpreted as saying for each additional year of education.With β0 representing the constant term. The coefficient -. its coefficient is interpreted as saying that for an observation whose city of residence was metropolitan. with calculated t-score of -15.

16% of the estimated value for “Age” can be explained by the independent variables in .68347 )Married . The t-score on the dummy variable “Married”.3.031417 years.633431 )White + (1.68347 years.101422 ) Male .29. Since “Married” has a coefficient of -10. which is . The coefficient of this variable.(7.38 is also significant at the 5% level.633431 on White means that if an individual observation is white. indicates that the variable is significant at the 5% level.68347.343748 )Bluecollar + (-6.101422 years compared to if they were a woman. meaning that if kept in the model observations who are both married and have cancer have their life expectancy decreased by . and so is significant at the 5% level. my R-square value is . means that if an observation happened to work in field that is considered blue collar. whose tscore is -22.101422.70. “Marriedcancer” is insignificant at the 5% level of significance with a tscore of only .7932 . The final included variable. The coefficient of this variable. is also significant at the 5% level. their life expectancy is 7. my model for life expectancy is now: Ŷ = 86. The coefficient of 1. it should be interpreted that if an observation is married.138705)Marriedcancer + ɛ Though several of the variables are highly significant. This coefficient means that if an observation happens to be male. and has a coefficient of .(. which is -7. and has a coefficient of 1.1119774)Educ +(. their life expectancy is increased by by 1.0316 means that only 3.6817049 )Metro + (1. which has a t-score of -23. The health dummy variable “Cancerorheartdisease”.138705 years.031417 means that if an observation became afflicted with one of the illnesses that make up the two leading causes of death in their life.031417 )Cancerorheartdisease + (.6817049 years.633431 years. their life expectancy is raised by a constant of 1. With these coefficients being known and interpreted. The race dummy variable “White” has a t-value of 3.(10.98. The dummy variable “Bluecollar”. The gender dummy variable “Male” is significant at the 5% level with a t-score of 3. their life expectancy had dropped by 6. their life expectancy decreases by 10. -6.343748.138705.343748 years less than it would have been otherwise.76.

94. I reject the null that my model is not statistically significant. I believe that omitted variable bias is playing a role in my unexpected results due to the theoretical underpinnings in my model. The first step I took in this measure was to look at the simple correlation coefficients between the variables in my equation: . and many many more. there are a large multitude of theoretically relevant factors that I just did not have the data to include in my analysis. Since the calculated F-value of my regression is 432.882 degrees of freedom. These factors include things such as statistics on cigarette smoking. While the variables in my model may be some of the factors that potentially have an impact on life expectancy. I have undergone the steps to identify whether or not it’s presence is particularly impactful on my particular equation. nor do I have access to enough data to add these variables to my dataset should I discover what they are. In the particular case of my model. With 8 explanatory variables. Unfortunately I neither have a way to determine what these potential omitted variables are. This is an indicator for potential omitted variable bias. Econometric Problems Upon running my regression one of the first things noticeable is that several of my variables are highly significant in the opposite sign than was expected. It is entirely possible that with the large number of theoretically relevant variables left out of my model. that the coefficients of the variables included in my model may have captured some of their effect on “Age”. dietary model. the critical F-value for my regression is 1. Since a certain degree of multicollinearity exists in all equations. environmental conditions. and 105.36.

but both variables still remain theoretically significant enough that this coefficient is trivial. as the two are inherently linked by the fact that they both only apply to married individuals. as the correlation between “Bluecollar” and “Male” may be due to blue collar jobs being predominantly performed by men. with each other one as their independent variables: . The highest correlations in my equation are between “Bluecollar” and “Male”. with a value of 0. Ultimately my correlation coefficients do not provide evidence that suggests my equation is being afflicted by multicollinearity. which are derived by regression each of my independent variables as the dependent.Upon determining the correlation coefficients I have observed that no two variables have a troublingly high level of correlation. and the correlation between “Marriedcancer and Married”.4473. Further supplementing my conclusion are the low VIF scores on my variables. These are not particularly damning realizations. The high correlation between “Married” and “Marriedcancer” is to be expected.

(Studenmund. The presence of heteroskedasticity in my model means that the error terms of my observations are not drawn from a distribution that has common variance. I have rerun my regression. This was done by employing the Breusch-Pagan test in Stata: Since the P>Chi(2) is <. I have decided to go ahead and test for heteroskedasticity. Due to the large absolute values on a number of my equation’s t-scores. which are all relatively low by standard analysis.05. To remedy this error. for my Breusch-Pagan test there is evidence of heteroskedasticity in my model. the VIF score for each of my variables is between 1 and 2. The results are extremely similar to my .While there are differing opinions on what constitutes a high enough VIF score to warrant suspicion of multicollinearity.0000. but this time with robust standard errors. causing my t-scores to be inflated and unreliable. and which ones are not. p. I can no longer have faith in the determinations I made about which of my variables are statistically significant. 2011. 337) This heteroskedasticity underestimates the standard errors of my coefficients. more specifically . Consequently.

. though the standard errors are larger than before: In light of this second regression. which only confirms my theory based suspicion that there are many other variables relevant in determining the life expectancy of a person. Assessment of Model Ultimately I do not believe the results of my model are very valid in describing how the different variables play into determining an individuals life expectancy. this suspicion of extreme omitted variable bias is supplemented by the presence of descriptive variables in my sample having unexpected signs and being largely significant in the opposite sign than theory would dictate. This includes the variables “Male”. but as mentioned throughout this project. “Educ”. and possibly being outdated as it is 15 years old. and “Married”. It is possible that this is due to my sample being only from one state. This is in part due to my low R-squared.original regression. a slew of conditions suggest to me that omitted variables are greatly affecting my model. Furthermore. I still have the same statistically significant variables and coefficients for my model.

Mendu. P. Retrieved December 3. M. McCarthy. genetic. . J. Working with only mortality statistics from the year of 1998. I did not have access to data about a large amount of variables that theory would suggest contribute to an individual’s life expectancy.. and environmental characteristics of the observations in my sample. E. A. Works Cited Aizer.. Chen. (2013). Hu.. There are a great number of factors under all these categories that lead to human mortality..I believe the greatest limitation in my process was lack of access to data about a number of health. M.. Marital Status and Survival in Patients With Cancer. 31. Journal of clinical oncology. et al. Nguyen.. which causes me to believe that information must be known about them in order to make a valid estimate on an individual’s life expectancy.

d. Retrieved December 2.ascopubs. Blue Collar Workers Can Look Forward to Working Longer and in Worse Health than their White Collar Bosses. Columbia University. A.healthaffairs.. S.abstract Berger.d. Kohli. Country: Who Is Healthier?. S... from http://content. And Many May Not Catch Up.. from http://online. 2013. Retrieved December 2.nytimes.). (2013. Jackson. from http://well. T. T.columbia. Wall Street Journal. 2013. Studenmund. Zheng. Study Highlights Importance of Social Ties During Midlife. Married Cancer Patients Live Longer. 2013. H. Rother. M. (2013. from J. Retrieved December City vs. (n. Differences In Life Expectancy Due To Race And Educational Differences Are Widening.sciencedaily.mailman. 2013. (n. Using Econometrics: A Practical Guide. Boston. (2012). from http://www.htm Olshansky.). 31(8). ScienceDaily. 2013.2013. September 24). Health Affairs.. Retrieved December (2011). et al.: Addison Wesley. from http://www.. Y. J. January 10).2013. . Antonucci. New York Times.full Marriage Linked to Better Survival in Middle Age. Retrieved December