You are on page 1of 6

Cancer Mortality Prediction

INTRODUCTION
In the United States, the age-standardized cancer death rate began declining in the early
1990s, largely because of declines in deaths from lung and prostate cancer in men, breast
cancer in women, and colorectal cancer in both sexes. The age-standardized death rate
approximates the population’s risk of dying from cancer and is used to compare risk of death
between populations or over time within a population. A decline in the death rate means that
the overall risk of dying from cancer in the population has decreased. However, age-
standardized rates do not convey the full extent of the cancer burden, as they effectively
remove the influence of demographic changes in the population. During this time, the
observed number of cancer deaths has continued to increase.

The number of cancer deaths is a function of the population’s risk of dying from cancer and
the population’s age structure and size. The observed increase in the number of cancer deaths
reflects the increased risk of dying from cancer with age, and during the past several decades,
the US population has grown, particularly in the older age groups. These demographic trends
and increasing cancer burden are forecast to continue as the cohort born following World
War II enters the age groups most at risk of dying from cancer.
DATA VISUALIZATION

• The average number of cancer cases • 17.10 percent of county residents ages 18-24
diagnosed annually is a highly right skewed and highest education attained are less than high school
the median count of 171 person (avgAnnCount). (PctNoHS18_24)
• The average number of reported deaths • 34.7 percent of county residents ages 18-24
due to cancer is also a right skewed with the highest education attained are high school diploma
median deaths being 61 per year (PctHS18_24)
(avgDeathsPerYear). • 5.4 percent of the county residents ages 18-24
• The mean number of cancer death rate highest education attained are bachelor’s degree
out of 100,000 people per year by county is 178.1 (PctBachDeg18_24)
(TARGET_deathRate). • 35.30 percent of county residents ages 25 and
• The rate at which people get cancer of above highest education attained high school diploma,
100,000 people by county shows a median value this variable has a slight normal distribution
of 453.5 (incidenceRate). (PctHS25_Over)
• The median household income of the • 12.3 percent of county residents ages 25 and
county is $45207 (medIncome). above highest education attained: bachelor’s degree
• The median estimated population of the (PctBachDeg25_Over)
number of people living in a county is 26643 and • There are 54.15 percent of county residents ages
this data is highly right skewed (popEst2015). 16 and above employed, this data is normally distributed
• The percent of poverty by county shows (PctEmployed16_Over)
an estimated mean of 15.90 percent • 7.6 percent of county residents ages 16 and over
(povertyPercent) unemployed, this data is normally distributed
• The mean number of cancer-related (PctUnemployed16_Over)
clinical trials per capita (per county) is 155.4 and • 65.1 percent of county residents have private
this data is highly right skewed (studyPerCap) health coverage (PctPrivateCoverage)
• The mean median age of the county • 48.45 percent of county residents are with
residents is 45.27 years which means that half the private health coverage alone and no public assistance
people younger than age 45 and half are older and (PctPrivateCoverageAlone)
this variable is normally distributed (MedianAge). • 41.1 percent of county residents are with
• The mean median age of male county employee-provided private health coverage
residents is 39.57 years and this variable has a (PctEmpPrivCoverage)
normal distribution (MedianAgeMale) • There are 36.3 percent of county residents with
• The mean median age of female county government-provided health coverage
residents is 42.15 years and this data is normally (PctPublicCoverage)
distributed (MedianAgeFemale) • 19.24 percent of county residents are with
• The average persons per household government-provided health coverage alone
shows a mean of 2.5 people (AvgHouseholdSize) (PctPublicCoverageAlone)
• 51.77 percent of county residents are • The median percent of county residents who
married, and this variable has a slight normal identify as white are 90 (PctWhite)
distribution (PercentMarried) • The median percent of county residents who
• 51.24 percent are married households identify as black are 2.24 (PctBlack)
(PctMarriedHouseholds) • 0.54 percent of county residents identify as
• The mean number of live births relative Asian (PctAsian)
to number of women in county is 5.64 (BirthRate) • 0.82 percent of county residents identify in a
category which is not white, black, or Asian
(PctOtherRace)

BIVARIATE AND MULTIVARIATE ANALYSIS

We see moderate to strong correlations for TARGET_deathRate with incidenceRate,


povertyPercent, PctPublicCoverageAlone and strong correlation for PctPublicCoverage with
PctPublicCoverageAlone and PctEmpPrivCoverage, PercentMarried with
PctMarriedHouseholds, avgDeathsPerYear with avgAnnCount and popEst2015.
Let’s look at some of the multivariate graphs. The relationship between state and median
income per state with cancer death rate intimate that residents with a higher median income
of the state has lower cancer death rates and people with a low median income of the state
have high cancer death rates.

BUDLING OUR MODELS


Model 1: Multiple Linear Regression

Multiple linear regression refers to a statistical technique that is used to predict the outcome
of a variable based on the value of two or more variables. It is sometimes known simply as
multiple regression, and it is an extension of linear regression. The variable that we want to
predict is known as the dependent variable in our case Y is TARGET_deathRate, while the
variables we use to predict the value of the dependent variable are known as independent or
explanatory variables.
For this model, we started by using all the variables except categorical variables because the
model with categorical variable showed very high VIF for all the categorical columns.
• The p-values for all variables except
medIncome, studyPerCap, binnedInc, MedianAge,
MedianAgeFemale, AvgHouseholdSize,
PctBachDeg18_24, PctUnemployed16_Over,
PctPrivateCoverageAlone, PctPublicCoverage,
PctPublicCoverageAlone, PctBlack, PctAsian is more
than 0.05, which means that at the 95% significance
level, those variables are not significant.
• Summary also provide R-squared and
Adjusted R-squared value. R-squared indicates the
proportion of the variation in your dependent variable
(Y) explained by your independent variables (X) for a
linear regression model which can be interpreted as
0.5185 or 51.85%. R-square will be increased simply
by adding additional predictors to the model, thus
adjusted R-Squared is used instead of R-squared for
comparing models with more than one predictor
variable.

Now let’s build our model using just significant variables.


Our adjusted R2 value has increased a little bit to
0. 5189, which suggests that removing all no
significant values with alpha risk equal 0.05
improved the model.
The R2 value increases with the number of
independent variables so it is better to use the
adjusted R squared value especially when
comparing models. The adjusted R2 indicates that
51.89% of the variation in Target Death Rate
can be explained by the model containing
avgAnnCount, avgDeathsPerYear, incidenceRate,
popEst2015… which is quite high so predictions
from the regression equation are fairly reliable.

Residual Standard Error (RSE), or sigma


• Linearity (top left plot) is good for our model
• Homogeneity of variance (top right plot) is respected
• Multicollinearity (middle left plot) is not an issue (we tend to use the threshold of
10 for VIF, just two are high than 10: avgDeathsPerYear and popEst2015 so we
should remove these variables from our model
• There are no influential points (middle right plot)
• Normality of the residuals (two bottom plots) is also acceptable. In any case, the
number of observations is large enough given the number of parameters and given
the small deviation from normality so tests on the coefficients are (approximately)
valid whether the error follows a normal distribution or no.

Model 2: Decision Trees

First, we split our data to train and test (80% for training and 20% for testing).

After splitting our data, we create our model using training data to predict
TARGET_deathRate base on all Xi.
This figure shows a regression tree fit to train data. It consists of a series of splitting rules,
starting at the top of the tree. For example, the top split assigns observations having
PctBachDeg_Over>=10.35 to the left branch, and then that group is further subdivided by
incidendenceRate, and this one also subdivided by PctBachDeg25_Over and
OctPrivateCoverage.
Rsquare

The RSE estimate gives a measure of error of prediction. The lower the RSE, the more
accurate the model (on the data in hand).
for a train data the model which can be interpreted as 41.23%, and
for test data the model which can be interpreted as 36.41%.

You might also like