Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook

<p style="font-family:Arial ; font-size:3.
5em;color:#800000"><br>
Heart Disease Risk Factor Data Analysis
Heart disease is the leading cause of death for men, women, and people of most racial and
ethnic groups in the United States. One person dies every 33 seconds in the United States
from cardiovascular disease.Several health conditions can increase the risk of heart disease.
These are called risk factors.
The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone

survey that is collected annually by the CDC. Each year, the survey collects responses from
over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the
use of preventative services. This dataset contains 253,680 survey responses from the
BRFSS in 2015. It contains 22 features relating to heart disease and its risk factors. I chose
this dataset because I love medicine!
Explanation of Important Fields

HeartDiseaseorAttack : Indicates if the person has heart disease (0 = No; 1 = Yes).
HighBP : Indicates if the person has been told by a health professional that they have High
Blood Pressure (0 = No; 1 = Yes).
HighChol : Indicates if the person has been told by a health professional that they have High
Blood Cholesterol (0 = No; 1 = Yes).
CholCheck : Cholesterol Check, if the person has their cholesterol levels checked within the
last 5 years (0 = No; 1 = Yes).
BMI: Body Mass Index, calculated by dividing the person's weight (kilograms) by the square
of their height (meters).
Smoker: Indicates if the person has smoked at least 100 cigarettes (0 = No; 1 = Yes).
Stroke : Indicates if the person has a history of stroke (0 = No; 1 = Yes).
Diabetes : Indicates if the person has a history of diabetes (0), or currently in pre-diabetes (1),
or suffers from either type of diabetes (2)
PhysActivity : Indicates if the person has some form of physical activity in their day-to-day
routine (0 = No; 1 = Yes).
Fruits : Indicates if the person consumes 1 or more fruit(s) daily (0 = No; 1 = Yes).
Veggies : Indicates if the person consumes 1 or more vegetable(s) daily (0 = No; 1 = Yes).
HvyAlcoholConsump: Indicates if the person has more than 14 drinks per week (0 = No; 1 =
Yes).
AnyHealthcare : Indicates if the person has any form of health insurance (0 = No; 1 = Yes).
NoDocbcCost : Indicates if the person wanted to visit a doctor within the past 1 year but
couldn’t, due to cost (0 = No; 1 = Yes).
GenHlth : Indicates the person's response to how well is their general health, ranging from 1
(excellent) to 5 (poor) (0 = No; 1 = Yes).
Menthlth : Indicates the number of days, within the past 30 days that the person had bad
mental health.
PhysHlth : Indicates the number of days, within the past 30 days that the person had bad
physical health.
DiffWalk : Indicates if the person has difficulty while walking or climbing stairs (0 = No; 1 =
Yes).
Sex : Indicates the gender of the person, where 0 is female and 1 is male.
Age : Indicates the age class of the person, where 1 is 18 years to 24 years up till 13 which is
80 years or older, each interval between has a 5-year increment.
Education : Indicates the highest year of school completed, with 0 being never attended or
kindergarten only and 6 being, having attended 4 years of college or more.
Income : Indicates the total household income, ranging from 1 (at least 10,000)𝑡𝑜6(
75,000+).
#Added Data: The following columns were added to replace the numerical categories given in
the original data with words or age ranges for graphing purposes: Age_range,
Blood_Pressure, Cholesterol, Smoke_Habits, Stroke_History, Fruit_Diet, Physical_Activity,
Alc_habits, Veggie_Diet
Import Packages
In [ ]: import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/Dataset/heart_disease_health_i
In [ ]: plt.style.use('ggplot')
Sample Data
This dataset has 22 columns 253,681 rows.
In [ ]: df.head()
Out[195]: HeartDiseaseorAttack HighBP HighChol CholCheck BMI Smoker Stroke Diabetes PhysA
0 0.0 1.0 1.0 1.0 40.0 1.0 0.0 0.0
1 0.0 0.0 0.0 0.0 25.0 1.0 0.0 0.0
2 0.0 1.0 1.0 1.0 28.0 0.0 0.0 0.0
3 0.0 1.0 0.0 1.0 27.0 0.0 0.0 0.0
4 0.0 1.0 1.0 1.0 24.0 0.0 0.0 0.0
5 rows × 23 columns
Data Analysis & Outliers

In [ ]: df.describe()
Out[196]: HeartDiseaseorAttack HighBP HighChol CholCheck BMI
count 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 2536
mean 0.094186 0.429001 0.424121 0.962670 28.382364
std 0.292087 0.494934 0.494210 0.189571 6.608694
min 0.000000 0.000000 0.000000 0.000000 12.000000
25% 0.000000 0.000000 0.000000 1.000000 24.000000
50% 0.000000 0.000000 0.000000 1.000000 27.000000
75% 0.000000 1.000000 1.000000 1.000000 31.000000
max 1.000000 1.000000 1.000000 1.000000 98.000000
8 rows × 22 columns
This BMI data has outliers. The max BMI is 98.0 kg/m^2, but the 50% percentile BMI is 27
kg/m^2. This outlier was not excluded because it is still relevant.
Analysis of Age, BMI, and Income of Individuals

with Heart Disease
In [ ]: def age_to_range(Age):
if Age == 1.0:
return '18-24'
elif Age == 2.0:
return '25-29'
elif Age == 3.0:
return '30-34'
elif Age == 4.0:
return '35-39'
elif Age == 5.0:
return '40-44'
elif Age == 6.0:
return '45-49'
elif Age == 7.0:
return '50-54'
elif Age == 8.0:
return '55-59'
elif Age == 9.0:
return '60-64'
elif Age == 10.0:
return '65-69'
elif Age == 11.0:
return '70-74'
elif Age == 12.0:
return '75-79'
elif Age == 13.0:
return '>80'
else:
return 'test'
df["Age_range"] = df['Age'].apply(age_to_range)
pos_heart_disease = df['HeartDiseaseorAttack'] == 1
df1 = df[pos_heart_disease]
In [ ]: df1["Age_range"].value_counts().plot(kind='bar', color = 'maroon', figs
plt.xlabel("Age Group")
plt.ylabel("Number of Individuals with Heart Disease")
plt.title("Number of Individuals with Heart Disease in Each Age Group")

Out[220]: Text(0.5, 1.0, 'Number of Individuals with Heart Disease in Each Age
Group')
The bar graph shows that the risk for heart disease increases with age, and it is consistent
with the fact that after the age of 65, the risk for heart disease increases.
In [ ]: df1["BMI"].plot(kind='hist', color = 'maroon',bins = 15, figsize=(15,10
plt.xlabel("BMI")
plt.ylabel("Number of Indivuals with Heart Disease")
plt.title("BMI of Indivuals with Heart Disease")
plt.xticks(np.arange(0, 100, 5))

Out[198]: ([<matplotlib.axis.XTick at 0x7f506b612fe0>,

<matplotlib.axis.XTick at 0x7f506b612fb0>,
<matplotlib.axis.XTick at 0x7f506b582cb0>,
<matplotlib.axis.XTick at 0x7f506b604940>,
<matplotlib.axis.XTick at 0x7f506b5de980>,
<matplotlib.axis.XTick at 0x7f506b47f010>,
<matplotlib.axis.XTick at 0x7f506b5808b0>,
<matplotlib.axis.XTick at 0x7f506b07ab60>,
<matplotlib.axis.XTick at 0x7f506b04a7a0>,
<matplotlib.axis.XTick at 0x7f506b04a620>,
<matplotlib.axis.XTick at 0x7f506b048e20>,
<matplotlib.axis.XTick at 0x7f506b048e80>,
<matplotlib.axis.XTick at 0x7f5072511240>,
<matplotlib.axis.XTick at 0x7f506bccf5b0>,
<matplotlib.axis.XTick at 0x7f506bccc190>,
<matplotlib.axis.XTick at 0x7f50762ee8f0>,
<matplotlib.axis.XTick at 0x7f506bccd300>,
<matplotlib.axis.XTick at 0x7f506bccd5d0>],
[Text(0, 0, '0'),
Text(5, 0, '5'),
Text(10, 0, '10'),
Text(15, 0, '15'),
Text(20, 0, '20'),
Text(25, 0, '25'),
Text(30, 0, '30'),
Text(35, 0, '35'),
Text(40, 0, '40'),
Text(45, 0, '45'),
Text(50, 0, '50'),
Text(55, 0, '55'),
Text(60, 0, '60'),
Text(65, 0, '65'),
Text(70, 0, '70'),
Text(75, 0, '75'),
Text(80, 0, '80'),
Text(85, 0, '85'),
Text(90, 0, '90'),
Text(95, 0, '95')])
The histogram shows a peak between 25 and 30 kg/m^2. This falls within the overweight
range.
Analysis of Relationship Between Heart

Disease and Blood Pressure, Cholesterol,
Stroke History, Cigarettes Smoked
In [ ]:

fig, axarr = plt.subplots(2, 2, figsize=(15, 10))

def high_normal(HighBP):
if HighBP == 1.0:
return 'High'
elif HighBP == 0.0:
return 'Normal'

def smoke_status(smoke):
if smoke == 1.0:
return '> 100'
elif smoke == 0.0:
return '< 100'

def alc(drink):
if drink == 1.0:
return '> 14'
elif drink == 0.0:
return '< 14

def stroke_hist(stroke):
if stroke == 1.0:
return 'History'
elif stroke == 0.0:
return 'No History'

df["Blood_Pressure"] = df['HighBP'].apply(high_normal)
df["Cholesterol"] = df['HighChol'].apply(high_normal)
df["Smoke_Habits"] = df['Smoker'].apply(smoke_status)
df["Stroke_History"] = df['Stroke'].apply(stroke_hist)

df1['Blood_Pressure'].value_counts().plot.bar(
ax=axarr[0][0], fontsize=12, color=['maroon', 'navy']
)
axarr[0][0].set_title("Blood Pressure of Individuals with Heart Disease
axarr[0][0].set_ylabel('Number of Individuals')
axarr[0][0].set_xlabel('Blood Pressure')

df1['Smoke_Habits'].value_counts().plot.bar(
)
axarr[1][1].set_title("Smoking Habits of Individuals with Heart Disease
axarr[1][1].set_xlabel('Number of Cigarettes Smoked')

df1['Stroke_History'].value_counts().plot.bar(
)
axarr[1][0].set_title("History of Stroke in Individuals with Heart Dise
axarr[1][0].set_xlabel('Existence of Stroke History')

df1['Cholesterol'].value_counts().plot.bar(
)
axarr[0][1].set_title("Cholesterol of Individuals with Heart Disease",
axarr[0][1].set_xlabel('Cholesterol Level')

plt.subplots_adjust(hspace=.6, wspace=.8)
Blood Pressure of Individuals with Heart Disease

The bar graph shows that in this dataset, the majority of individuals with Heart Disease also
have High Blood Pressure.
In [ ]: Hdisease_BP=df[['HeartDiseaseorAttack', "Blood_Pressure"]].groupby('Hea
Hdisease_BP
Out[200]: HeartDiseaseorAttack Blood_Pressure

0.0 Normal 138886
High 90901
1.0 High 17928
Normal 5965
dtype: int64
In this dataset, 75.0% of individuals with heart disease also had high blood pressure.
Conversely, 25.0% of individuals without heart disease also had high blood pressure. This
makes sense since high blood pressure can damage arteries by making them less elastic.
This decreases the flow of blood and oxygen, leading to heart disease. This is process is
pictured below:
Cholesterol Levels of Individuals with Heart Disease

The bar graph shows that in this dataset, the majority of individuals with Heart Disease also
have High Cholesterol.
In [ ]: Hdisease_Chol=df[['HeartDiseaseorAttack', "Cholesterol"]].groupby('Hear
Hdisease_Chol
Out[201]: HeartDiseaseorAttack Cholesterol

0.0 Normal 138949
High 90838
1.0 High 16753
Normal 7140
dtype: int64
In this dataset, 70% of individuals with heart disease also had high blood pressure. This
makes sense since with high cholesterol, you can develop fatty deposits in your blood
vessels. Eventually, these deposits grow, making it difficult for enough blood to flow through
your arteries. Those deposits can break suddenly and form a clot that causes a heart attack.
This process is pictured below:
Stroke History of Indivuals with Heart Disease

The bar graph shows that in this dataset, the majority of individuals with Heart Disease did
not have history of a stroke
In [ ]: Hdisease_stroke=df[['HeartDiseaseorAttack', "Stroke_History"]].groupby(
Hdisease_stroke

Out[202]: HeartDiseaseorAttack Stroke_History

0.0 No History 223432
History 6355
1.0 No History 19956
History 3937
dtype: int64
In this dataset, 2.8% of individuals without heart disease had a history of stroke. Additionally,
16.4% of individuals with heart disease had a history of stroke. It makes sense that the
percentage of individuals with a history stroke and Heart Disease is higher than the
percentage of individuals with a history of stroke but no Heart Disease because strokes and
heart disease are closely related. With heart disease, plaque build-up and blood clots in
arteries supplying blood to the brain can cause a stroke.
This process is pictured Below

Smoking Habits of Indivuals With Heart Disease
The bar graph shows that in this dataset, the majority of individuals with Heart Disease
smoked more than 100 cigarettes within their lifetime.
In [ ]: Hdisease_Smoke=df[['HeartDiseaseorAttack', "Smoke_Habits"]].groupby('He
Hdisease_Smoke
Out[203]: HeartDiseaseorAttack Smoke_Habits

0.0 < 100 132165
> 100 97622
1.0 > 100 14801
< 100 9092
dtype: int64
In this dataset, 61% of individuals with heart disease also smoked more 100 cigarettes within
their lifetime. Conversely, only 2% of individuals individuals without heart disease smoked
more 100 cigarettes within their lifetime. This makes sense because smoking increases the
formation of plaque in blood vessels and chemicals in cigarette smoke cause the blood to
thicken and form clots inside veins and arteries.
In [ ]: #Percentage Calculations

#BP
notbp_percent = 90901/(90901 + 138886)
bp_percent = 17928/(17928 + 5965)
print (bp_percent)
print (not_bp_percent)

#Chol

not_chol_percent = 90901/(90901 + 138886)
chol_percent = 17928/(17928 + 5965)
print (chol_percent)
print (not_chol_percent)
#Stroke
stroke_percent = 3937/(19956 + 3937)
print (stroke_percent)
notstroke_percent = 6355/(223432 + 6355)
print (notstroke_percent)

#smoke
smoke_percent = 14801/(14801 + 9092)
print (smoke_percent)
notsmoke_percent = 97622/(132165 + 97622)
print (notstroke_percent)

0.7503452894153099
0.24965471058469008
0.16477629431214164
0.027656046686714217
0.6194701376972335
0.027656046686714217
Analysis of the Relationship between Heart

Disease and Recent Lifestyle Choices
In [ ]:
df["Smoke_Habits"] = df['Smoker'].apply(smoke_status)
fig, axarr = plt.subplots(2, 2, figsize=(15, 10))

def foodnum_to_words(food):
if food == 1.0:
return 'At Least 1'
elif food == 0.0:
return 'none'

def alc(drink):
if drink == 1.0:
return '> 14'
elif drink == 0.0:
return '< 14'

def physical_activity_word(stroke):
if stroke == 1.0:
return 'Daily'
elif stroke == 0.0:
return 'Not Daily'

df["Fruit_Diet"] = df['Fruits'].apply(foodnum_to_words)
df["Veggie_Diet"] = df['Veggies'].apply(foodnum_to_words)
df["Physical_Activity"] = df['PhysActivity'].apply(physical_activity_wo
df["Alc_habits"] = df['HvyAlcoholConsump'].apply(alc)

df1['Fruit_Diet'].value_counts().plot.bar(
ax=axarr[0][0], fontsize=12, color=['green', 'maroon']

)
axarr[0][0].set_title("Daily Consumption of Fruits of Individuals with
axarr[0][0].set_xlabel('Number of Fruits Eaten Daily')

df1['Physical_Activity'].value_counts().plot.bar(
)
axarr[1][1].set_title("Physical Activity of Individuals with Heart Dise
axarr[1][1].set_xlabel('Participation in Daily Exercise')

df1['Alc_habits'].value_counts().plot.bar(
)
axarr[1][0].set_title("Alcohol Habits of Individuals with Heart Disease
axarr[1][0].set_xlabel('Number of Drinks Per Week')

df1['Veggie_Diet'].value_counts().plot.bar(
)
axarr[0][1].set_title("Daily Consumption of Vegetables of Individuals w
axarr[0][1].set_xlabel('Number of Vegetables Eaten Daily')

plt.subplots_adjust(hspace=.6, wspace=1.7)

According to the bar graphs, for individuals diagnosed with heart disease, the majority of
recent lifestyle habits reflect healthy choices. These recent lifestyle habits include number of
fruits and vegetables eaten daily, number of alcoholic drinks per week, and partipation in daily
physical activity. This could indicate lifestyle changes being made as recommended by a
health professional.
Pearson Statistical Analysis

The Pearson Statistical Analysis is often used to calculate linear corrrelation with Binary Data.
This dataset is primarily binary, so thus analysis was used. It revealed that the two variables
with the highest correlation were General Health and Physical Health (p-value = 0.524364).
In [ ]: df.corr(method='pearson', min_periods=1)
<ipython-input-221-74e793cab318>:1: FutureWarning: The default value

of numeric_only in DataFrame.corr is deprecated. In a future versio
n, it will default to False. Select only valid columns or specify th
e value of numeric_only to silence this warning.
df.corr(method='pearson', min_periods=1)
Out[221]: HeartDiseaseorAttack HighBP HighChol CholCheck BMI Sm
HeartDiseaseorAttack 1.000000 0.209361 0.180765 0.044206 0.052904 0.1
HighBP 0.209361 1.000000 0.298199 0.098508 0.213748 0.0
HighChol 0.180765 0.298199 1.000000 0.085642 0.106722 0.0
CholCheck 0.044206 0.098508 0.085642 1.000000 0.034495 -0.0
BMI 0.052904 0.213748 0.106722 0.034495 1.000000 0.0
Smoker 0.114441 0.096991 0.091299 -0.009929 0.013804 1.0
Stroke 0.203002 0.129575 0.092620 0.024158 0.020153 0.0
Conclusion
This data analysis reinforces that blood pressure, cholesterol levels, history of stroke, and
smoking habits are all related to heart disease. It shows that age increase, so does the risk of
heart disease and that many people with heart disease are overweight. It also shows that the
majority of the individuals diagnosed with Heart Disease in this data were making healthy
choices.
Words From the Analyst

What gave me the most trouble was making all the functions to replace the numerical
categories with words or phrases. The dataset had a lot of data, but it was mostly binary.
Because, of this I couldn't really make any scatterplots. I wouldn't recommend using any
filters before analyzing the data I enjoyed seperating the parameters into subplots I think that

Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Heart Disease Risk Factor Data Analysis Midterm Data 2 - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

<p style="font-family:Arial ; font-size:3.

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone

Explanation of Important Fields

Stroke : Indicates if the person has a history of stroke (0 = No; 1 = Yes).

0 0.0 1.0 1.0 1.0 40.0 1.0 0.0 0.0

1 0.0 0.0 0.0 0.0 25.0 1.0 0.0 0.0

2 0.0 1.0 1.0 1.0 28.0 0.0 0.0 0.0

3 0.0 1.0 0.0 1.0 27.0 0.0 0.0 0.0

4 0.0 1.0 1.0 1.0 24.0 0.0 0.0 0.0

Data Analysis & Outliers

Out[196]: HeartDiseaseorAttack HighBP HighChol CholCheck BMI

count 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 2536

mean 0.094186 0.429001 0.424121 0.962670 28.382364

std 0.292087 0.494934 0.494210 0.189571 6.608694

min 0.000000 0.000000 0.000000 0.000000 12.000000

25% 0.000000 0.000000 0.000000 1.000000 24.000000

50% 0.000000 0.000000 0.000000 1.000000 27.000000

75% 0.000000 1.000000 1.000000 1.000000 31.000000

max 1.000000 1.000000 1.000000 1.000000 98.000000

Analysis of Age, BMI, and Income of Individuals

Out[198]: ([<matplotlib.axis.XTick at 0x7f506b612fe0>,

Analysis of Relationship Between Heart

Blood Pressure of Individuals with Heart Disease

Out[200]: HeartDiseaseorAttack Blood_Pressure

Cholesterol Levels of Individuals with Heart Disease

Out[201]: HeartDiseaseorAttack Cholesterol

Stroke History of Indivuals with Heart Disease

Out[202]: HeartDiseaseorAttack Stroke_History

This process is pictured Below

Out[203]: HeartDiseaseorAttack Smoke_Habits

Analysis of the Relationship between Heart

Pearson Statistical Analysis

<ipython-input-221-74e793cab318>:1: FutureWarning: The default value

Out[221]: HeartDiseaseorAttack HighBP HighChol CholCheck BMI Sm

HeartDiseaseorAttack 1.000000 0.209361 0.180765 0.044206 0.052904 0.1

HighBP 0.209361 1.000000 0.298199 0.098508 0.213748 0.0

HighChol 0.180765 0.298199 1.000000 0.085642 0.106722 0.0

CholCheck 0.044206 0.098508 0.085642 1.000000 0.034495 -0.0

BMI 0.052904 0.213748 0.106722 0.034495 1.000000 0.0

Smoker 0.114441 0.096991 0.091299 -0.009929 0.013804 1.0

Stroke 0.203002 0.129575 0.092620 0.024158 0.020153 0.0

Words From the Analyst

You might also like