Professional Documents
Culture Documents
Cancer Mortality Prediction
Cancer Mortality Prediction
INTRODUCTION
In the United States, the age-standardized cancer death rate began declining in the early
1990s, largely because of declines in deaths from lung and prostate cancer in men, breast
cancer in women, and colorectal cancer in both sexes. The age-standardized death rate
approximates the population’s risk of dying from cancer and is used to compare risk of death
between populations or over time within a population. A decline in the death rate means that
the overall risk of dying from cancer in the population has decreased. However, age-
standardized rates do not convey the full extent of the cancer burden, as they effectively
remove the influence of demographic changes in the population. During this time, the
observed number of cancer deaths has continued to increase.
The number of cancer deaths is a function of the population’s risk of dying from cancer and
the population’s age structure and size. The observed increase in the number of cancer deaths
reflects the increased risk of dying from cancer with age, and during the past several decades,
the US population has grown, particularly in the older age groups. These demographic trends
and increasing cancer burden are forecast to continue as the cohort born following World
War II enters the age groups most at risk of dying from cancer.
DATA VISUALIZATION
• The average number of cancer cases • 17.10 percent of county residents ages 18-24
diagnosed annually is a highly right skewed and highest education attained are less than high school
the median count of 171 person (avgAnnCount). (PctNoHS18_24)
• The average number of reported deaths • 34.7 percent of county residents ages 18-24
due to cancer is also a right skewed with the highest education attained are high school diploma
median deaths being 61 per year (PctHS18_24)
(avgDeathsPerYear). • 5.4 percent of the county residents ages 18-24
• The mean number of cancer death rate highest education attained are bachelor’s degree
out of 100,000 people per year by county is 178.1 (PctBachDeg18_24)
(TARGET_deathRate). • 35.30 percent of county residents ages 25 and
• The rate at which people get cancer of above highest education attained high school diploma,
100,000 people by county shows a median value this variable has a slight normal distribution
of 453.5 (incidenceRate). (PctHS25_Over)
• The median household income of the • 12.3 percent of county residents ages 25 and
county is $45207 (medIncome). above highest education attained: bachelor’s degree
• The median estimated population of the (PctBachDeg25_Over)
number of people living in a county is 26643 and • There are 54.15 percent of county residents ages
this data is highly right skewed (popEst2015). 16 and above employed, this data is normally distributed
• The percent of poverty by county shows (PctEmployed16_Over)
an estimated mean of 15.90 percent • 7.6 percent of county residents ages 16 and over
(povertyPercent) unemployed, this data is normally distributed
• The mean number of cancer-related (PctUnemployed16_Over)
clinical trials per capita (per county) is 155.4 and • 65.1 percent of county residents have private
this data is highly right skewed (studyPerCap) health coverage (PctPrivateCoverage)
• The mean median age of the county • 48.45 percent of county residents are with
residents is 45.27 years which means that half the private health coverage alone and no public assistance
people younger than age 45 and half are older and (PctPrivateCoverageAlone)
this variable is normally distributed (MedianAge). • 41.1 percent of county residents are with
• The mean median age of male county employee-provided private health coverage
residents is 39.57 years and this variable has a (PctEmpPrivCoverage)
normal distribution (MedianAgeMale) • There are 36.3 percent of county residents with
• The mean median age of female county government-provided health coverage
residents is 42.15 years and this data is normally (PctPublicCoverage)
distributed (MedianAgeFemale) • 19.24 percent of county residents are with
• The average persons per household government-provided health coverage alone
shows a mean of 2.5 people (AvgHouseholdSize) (PctPublicCoverageAlone)
• 51.77 percent of county residents are • The median percent of county residents who
married, and this variable has a slight normal identify as white are 90 (PctWhite)
distribution (PercentMarried) • The median percent of county residents who
• 51.24 percent are married households identify as black are 2.24 (PctBlack)
(PctMarriedHouseholds) • 0.54 percent of county residents identify as
• The mean number of live births relative Asian (PctAsian)
to number of women in county is 5.64 (BirthRate) • 0.82 percent of county residents identify in a
category which is not white, black, or Asian
(PctOtherRace)
Multiple linear regression refers to a statistical technique that is used to predict the outcome
of a variable based on the value of two or more variables. It is sometimes known simply as
multiple regression, and it is an extension of linear regression. The variable that we want to
predict is known as the dependent variable in our case Y is TARGET_deathRate, while the
variables we use to predict the value of the dependent variable are known as independent or
explanatory variables.
For this model, we started by using all the variables except categorical variables because the
model with categorical variable showed very high VIF for all the categorical columns.
• The p-values for all variables except
medIncome, studyPerCap, binnedInc, MedianAge,
MedianAgeFemale, AvgHouseholdSize,
PctBachDeg18_24, PctUnemployed16_Over,
PctPrivateCoverageAlone, PctPublicCoverage,
PctPublicCoverageAlone, PctBlack, PctAsian is more
than 0.05, which means that at the 95% significance
level, those variables are not significant.
• Summary also provide R-squared and
Adjusted R-squared value. R-squared indicates the
proportion of the variation in your dependent variable
(Y) explained by your independent variables (X) for a
linear regression model which can be interpreted as
0.5185 or 51.85%. R-square will be increased simply
by adding additional predictors to the model, thus
adjusted R-Squared is used instead of R-squared for
comparing models with more than one predictor
variable.
First, we split our data to train and test (80% for training and 20% for testing).
After splitting our data, we create our model using training data to predict
TARGET_deathRate base on all Xi.
This figure shows a regression tree fit to train data. It consists of a series of splitting rules,
starting at the top of the tree. For example, the top split assigns observations having
PctBachDeg_Over>=10.35 to the left branch, and then that group is further subdivided by
incidendenceRate, and this one also subdivided by PctBachDeg25_Over and
OctPrivateCoverage.
Rsquare
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more
accurate the model (on the data in hand).
for a train data the model which can be interpreted as 41.23%, and
for test data the model which can be interpreted as 36.41%.