You are on page 1of 20

<p style="font-family:Arial ; font-size:3.

5em;color:#800000"><br>
Heart Disease Risk Factor Data Analysis

Heart disease is the leading cause of death for men, women, and people of most racial and
ethnic groups in the United States. One person dies every 33 seconds in the United States
from cardiovascular disease.Several health conditions can increase the risk of heart disease.
These are called risk factors.

The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone


survey that is collected annually by the CDC. Each year, the survey collects responses from
over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the
use of preventative services. This dataset contains 253,680 survey responses from the
BRFSS in 2015. It contains 22 features relating to heart disease and its risk factors. I chose
this dataset because I love medicine!

Explanation of Important Fields


HeartDiseaseorAttack : Indicates if the person has heart disease (0 = No; 1 = Yes).

HighBP : Indicates if the person has been told by a health professional that they have High
Blood Pressure (0 = No; 1 = Yes).

HighChol : Indicates if the person has been told by a health professional that they have High
Blood Cholesterol (0 = No; 1 = Yes).

CholCheck : Cholesterol Check, if the person has their cholesterol levels checked within the
last 5 years (0 = No; 1 = Yes).

BMI: Body Mass Index, calculated by dividing the person's weight (kilograms) by the square
of their height (meters).
Smoker: Indicates if the person has smoked at least 100 cigarettes (0 = No; 1 = Yes).

Stroke : Indicates if the person has a history of stroke (0 = No; 1 = Yes).

Diabetes : Indicates if the person has a history of diabetes (0), or currently in pre-diabetes (1),
or suffers from either type of diabetes (2)

PhysActivity : Indicates if the person has some form of physical activity in their day-to-day
routine (0 = No; 1 = Yes).

Fruits : Indicates if the person consumes 1 or more fruit(s) daily (0 = No; 1 = Yes).

Veggies : Indicates if the person consumes 1 or more vegetable(s) daily (0 = No; 1 = Yes).

HvyAlcoholConsump: Indicates if the person has more than 14 drinks per week (0 = No; 1 =
Yes).

AnyHealthcare : Indicates if the person has any form of health insurance (0 = No; 1 = Yes).

NoDocbcCost : Indicates if the person wanted to visit a doctor within the past 1 year but
couldn’t, due to cost (0 = No; 1 = Yes).

GenHlth : Indicates the person's response to how well is their general health, ranging from 1
(excellent) to 5 (poor) (0 = No; 1 = Yes).

Menthlth : Indicates the number of days, within the past 30 days that the person had bad
mental health.

PhysHlth : Indicates the number of days, within the past 30 days that the person had bad
physical health.

DiffWalk : Indicates if the person has difficulty while walking or climbing stairs (0 = No; 1 =
Yes).

Sex : Indicates the gender of the person, where 0 is female and 1 is male.

Age : Indicates the age class of the person, where 1 is 18 years to 24 years up till 13 which is
80 years or older, each interval between has a 5-year increment.

Education : Indicates the highest year of school completed, with 0 being never attended or
kindergarten only and 6 being, having attended 4 years of college or more.

Income : Indicates the total household income, ranging from 1 (at least 10,000)𝑡𝑜6(
75,000+).

#Added Data: The following columns were added to replace the numerical categories given in
the original data with words or age ranges for graphing purposes: Age_range,
Blood_Pressure, Cholesterol, Smoke_Habits, Stroke_History, Fruit_Diet, Physical_Activity,
Alc_habits, Veggie_Diet
Import Packages
In [ ]: import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/Dataset/heart_disease_health_i

In [ ]: plt.style.use('ggplot')

Sample Data
This dataset has 22 columns 253,681 rows.

In [ ]: df.head()

Out[195]: HeartDiseaseorAttack HighBP HighChol CholCheck BMI Smoker Stroke Diabetes PhysA

0 0.0 1.0 1.0 1.0 40.0 1.0 0.0 0.0

1 0.0 0.0 0.0 0.0 25.0 1.0 0.0 0.0

2 0.0 1.0 1.0 1.0 28.0 0.0 0.0 0.0

3 0.0 1.0 0.0 1.0 27.0 0.0 0.0 0.0

4 0.0 1.0 1.0 1.0 24.0 0.0 0.0 0.0

5 rows × 23 columns

Data Analysis & Outliers


In [ ]: df.describe()

Out[196]: HeartDiseaseorAttack HighBP HighChol CholCheck BMI

count 253680.000000 253680.000000 253680.000000 253680.000000 253680.000000 2536

mean 0.094186 0.429001 0.424121 0.962670 28.382364

std 0.292087 0.494934 0.494210 0.189571 6.608694

min 0.000000 0.000000 0.000000 0.000000 12.000000

25% 0.000000 0.000000 0.000000 1.000000 24.000000

50% 0.000000 0.000000 0.000000 1.000000 27.000000

75% 0.000000 1.000000 1.000000 1.000000 31.000000

max 1.000000 1.000000 1.000000 1.000000 98.000000

8 rows × 22 columns

This BMI data has outliers. The max BMI is 98.0 kg/m^2, but the 50% percentile BMI is 27
kg/m^2. This outlier was not excluded because it is still relevant.

Analysis of Age, BMI, and Income of Individuals


with Heart Disease
In [ ]: def age_to_range(Age):
if Age == 1.0:
return '18-24'
elif Age == 2.0:
return '25-29'
elif Age == 3.0:
return '30-34'
elif Age == 4.0:
return '35-39'
elif Age == 5.0:
return '40-44'
elif Age == 6.0:
return '45-49'
elif Age == 7.0:
return '50-54'
elif Age == 8.0:
return '55-59'
elif Age == 9.0:
return '60-64'
elif Age == 10.0:
return '65-69'
elif Age == 11.0:
return '70-74'
elif Age == 12.0:
return '75-79'
elif Age == 13.0:
return '>80'
else:
return 'test'
df["Age_range"] = df['Age'].apply(age_to_range)
pos_heart_disease = df['HeartDiseaseorAttack'] == 1
df1 = df[pos_heart_disease]
In [ ]: df1["Age_range"].value_counts().plot(kind='bar', color = 'maroon', figs
plt.xlabel("Age Group")
plt.ylabel("Number of Individuals with Heart Disease")
plt.title("Number of Individuals with Heart Disease in Each Age Group")

Out[220]: Text(0.5, 1.0, 'Number of Individuals with Heart Disease in Each Age
Group')

The bar graph shows that the risk for heart disease increases with age, and it is consistent
with the fact that after the age of 65, the risk for heart disease increases.
In [ ]: df1["BMI"].plot(kind='hist', color = 'maroon',bins = 15, figsize=(15,10
plt.xlabel("BMI")
plt.ylabel("Number of Indivuals with Heart Disease")
plt.title("BMI of Indivuals with Heart Disease")
plt.xticks(np.arange(0, 100, 5))

Out[198]: ([<matplotlib.axis.XTick at 0x7f506b612fe0>,


<matplotlib.axis.XTick at 0x7f506b612fb0>,
<matplotlib.axis.XTick at 0x7f506b582cb0>,
<matplotlib.axis.XTick at 0x7f506b604940>,
<matplotlib.axis.XTick at 0x7f506b5de980>,
<matplotlib.axis.XTick at 0x7f506b47f010>,
<matplotlib.axis.XTick at 0x7f506b048160>,
<matplotlib.axis.XTick at 0x7f506b5808b0>,
<matplotlib.axis.XTick at 0x7f506b07ab60>,
<matplotlib.axis.XTick at 0x7f506b04a7a0>,
<matplotlib.axis.XTick at 0x7f506b04a620>,
<matplotlib.axis.XTick at 0x7f506b048e20>,
<matplotlib.axis.XTick at 0x7f506b048790>,
<matplotlib.axis.XTick at 0x7f506b048e80>,
<matplotlib.axis.XTick at 0x7f5072511240>,
<matplotlib.axis.XTick at 0x7f506bccf5b0>,
<matplotlib.axis.XTick at 0x7f506bccc190>,
<matplotlib.axis.XTick at 0x7f50762ee8f0>,
<matplotlib.axis.XTick at 0x7f506bccd300>,
<matplotlib.axis.XTick at 0x7f506bccd5d0>],
[Text(0, 0, '0'),
Text(5, 0, '5'),
Text(10, 0, '10'),
Text(15, 0, '15'),
Text(20, 0, '20'),
Text(25, 0, '25'),
Text(30, 0, '30'),
Text(35, 0, '35'),
Text(40, 0, '40'),
Text(45, 0, '45'),
Text(50, 0, '50'),
Text(55, 0, '55'),
Text(60, 0, '60'),
Text(65, 0, '65'),
Text(70, 0, '70'),
Text(75, 0, '75'),
Text(80, 0, '80'),
Text(85, 0, '85'),
Text(90, 0, '90'),
Text(95, 0, '95')])
The histogram shows a peak between 25 and 30 kg/m^2. This falls within the overweight
range.

Analysis of Relationship Between Heart


Disease and Blood Pressure, Cholesterol,
Stroke History, Cigarettes Smoked
In [ ]: ​

fig, axarr = plt.subplots(2, 2, figsize=(15, 10))

def high_normal(HighBP):
if HighBP == 1.0:
return 'High'
elif HighBP == 0.0:
return 'Normal'

def smoke_status(smoke):
if smoke == 1.0:
return '> 100'
elif smoke == 0.0:
return '< 100'

def alc(drink):
if drink == 1.0:
return '> 14'
elif drink == 0.0:
return '< 14

def stroke_hist(stroke):
if stroke == 1.0:
return 'History'
elif stroke == 0.0:
return 'No History'

df["Blood_Pressure"] = df['HighBP'].apply(high_normal)
df["Cholesterol"] = df['HighChol'].apply(high_normal)
df["Smoke_Habits"] = df['Smoker'].apply(smoke_status)
df["Stroke_History"] = df['Stroke'].apply(stroke_hist)
df1 = df[pos_heart_disease]


df1['Blood_Pressure'].value_counts().plot.bar(
ax=axarr[0][0], fontsize=12, color=['maroon', 'navy']
)
axarr[0][0].set_title("Blood Pressure of Individuals with Heart Disease
axarr[0][0].set_ylabel('Number of Individuals')
axarr[0][0].set_xlabel('Blood Pressure')

df1['Smoke_Habits'].value_counts().plot.bar(
ax=axarr[1][1], fontsize=12, color=['maroon', 'navy']
)
axarr[1][1].set_title("Smoking Habits of Individuals with Heart Disease
axarr[1][1].set_ylabel('Number of Individuals')
axarr[1][1].set_xlabel('Number of Cigarettes Smoked')

df1['Stroke_History'].value_counts().plot.bar(
ax=axarr[1][0], fontsize=12, color=['maroon', 'navy']
)
axarr[1][0].set_title("History of Stroke in Individuals with Heart Dise
axarr[1][0].set_ylabel('Number of Individuals')
axarr[1][0].set_xlabel('Existence of Stroke History')

df1['Cholesterol'].value_counts().plot.bar(
ax=axarr[0][1], fontsize=12, color=['maroon', 'navy']
)
axarr[0][1].set_title("Cholesterol of Individuals with Heart Disease",
axarr[0][1].set_ylabel('Number of Individuals')
axarr[0][1].set_xlabel('Cholesterol Level')

plt.subplots_adjust(hspace=.6, wspace=.8)

Blood Pressure of Individuals with Heart Disease


The bar graph shows that in this dataset, the majority of individuals with Heart Disease also
have High Blood Pressure.

In [ ]: Hdisease_BP=df[['HeartDiseaseorAttack', "Blood_Pressure"]].groupby('Hea
Hdisease_BP

Out[200]: HeartDiseaseorAttack Blood_Pressure


0.0 Normal 138886
High 90901
1.0 High 17928
Normal 5965
dtype: int64

In this dataset, 75.0% of individuals with heart disease also had high blood pressure.
Conversely, 25.0% of individuals without heart disease also had high blood pressure. This
makes sense since high blood pressure can damage arteries by making them less elastic.
This decreases the flow of blood and oxygen, leading to heart disease. This is process is
pictured below:

Cholesterol Levels of Individuals with Heart Disease


The bar graph shows that in this dataset, the majority of individuals with Heart Disease also
have High Cholesterol.

In [ ]: Hdisease_Chol=df[['HeartDiseaseorAttack', "Cholesterol"]].groupby('Hear
Hdisease_Chol

Out[201]: HeartDiseaseorAttack Cholesterol


0.0 Normal 138949
High 90838
1.0 High 16753
Normal 7140
dtype: int64

In this dataset, 70% of individuals with heart disease also had high blood pressure. This
makes sense since with high cholesterol, you can develop fatty deposits in your blood
vessels. Eventually, these deposits grow, making it difficult for enough blood to flow through
your arteries. Those deposits can break suddenly and form a clot that causes a heart attack.
This process is pictured below:

Stroke History of Indivuals with Heart Disease


The bar graph shows that in this dataset, the majority of individuals with Heart Disease did
not have history of a stroke
In [ ]: Hdisease_stroke=df[['HeartDiseaseorAttack', "Stroke_History"]].groupby(
Hdisease_stroke

Out[202]: HeartDiseaseorAttack Stroke_History


0.0 No History 223432
History 6355
1.0 No History 19956
History 3937
dtype: int64

In this dataset, 2.8% of individuals without heart disease had a history of stroke. Additionally,
16.4% of individuals with heart disease had a history of stroke. It makes sense that the
percentage of individuals with a history stroke and Heart Disease is higher than the
percentage of individuals with a history of stroke but no Heart Disease because strokes and
heart disease are closely related. With heart disease, plaque build-up and blood clots in
arteries supplying blood to the brain can cause a stroke.

This process is pictured Below


Smoking Habits of Indivuals With Heart Disease
The bar graph shows that in this dataset, the majority of individuals with Heart Disease
smoked more than 100 cigarettes within their lifetime.

In [ ]: Hdisease_Smoke=df[['HeartDiseaseorAttack', "Smoke_Habits"]].groupby('He
Hdisease_Smoke

Out[203]: HeartDiseaseorAttack Smoke_Habits


0.0 < 100 132165
> 100 97622
1.0 > 100 14801
< 100 9092
dtype: int64

In this dataset, 61% of individuals with heart disease also smoked more 100 cigarettes within
their lifetime. Conversely, only 2% of individuals individuals without heart disease smoked
more 100 cigarettes within their lifetime. This makes sense because smoking increases the
formation of plaque in blood vessels and chemicals in cigarette smoke cause the blood to
thicken and form clots inside veins and arteries.
In [ ]: #Percentage Calculations

#BP
notbp_percent = 90901/(90901 + 138886)
bp_percent = 17928/(17928 + 5965)
print (bp_percent)
print (not_bp_percent)

#Chol

not_chol_percent = 90901/(90901 + 138886)
chol_percent = 17928/(17928 + 5965)
print (chol_percent)
print (not_chol_percent)
#Stroke
stroke_percent = 3937/(19956 + 3937)
print (stroke_percent)
notstroke_percent = 6355/(223432 + 6355)
print (notstroke_percent)


#smoke
smoke_percent = 14801/(14801 + 9092)
print (smoke_percent)
notsmoke_percent = 97622/(132165 + 97622)
print (notstroke_percent)

0.7503452894153099
0.24965471058469008
0.16477629431214164
0.027656046686714217
0.6194701376972335
0.027656046686714217

Analysis of the Relationship between Heart


Disease and Recent Lifestyle Choices
In [ ]: ​
df["Smoke_Habits"] = df['Smoker'].apply(smoke_status)
fig, axarr = plt.subplots(2, 2, figsize=(15, 10))

def foodnum_to_words(food):
if food == 1.0:
return 'At Least 1'
elif food == 0.0:
return 'none'


def alc(drink):
if drink == 1.0:
return '> 14'
elif drink == 0.0:
return '< 14'

def physical_activity_word(stroke):
if stroke == 1.0:
return 'Daily'
elif stroke == 0.0:
return 'Not Daily'

df["Fruit_Diet"] = df['Fruits'].apply(foodnum_to_words)
df["Veggie_Diet"] = df['Veggies'].apply(foodnum_to_words)
df["Physical_Activity"] = df['PhysActivity'].apply(physical_activity_wo
df1 = df[pos_heart_disease]
df["Alc_habits"] = df['HvyAlcoholConsump'].apply(alc)
df1 = df[pos_heart_disease]


df1['Fruit_Diet'].value_counts().plot.bar(
ax=axarr[0][0], fontsize=12, color=['green', 'maroon']

)
axarr[0][0].set_title("Daily Consumption of Fruits of Individuals with
axarr[0][0].set_ylabel('Number of Individuals')
axarr[0][0].set_xlabel('Number of Fruits Eaten Daily')

df1['Physical_Activity'].value_counts().plot.bar(
ax=axarr[1][1], fontsize=12, color=['green', 'maroon']
)
axarr[1][1].set_title("Physical Activity of Individuals with Heart Dise
axarr[1][1].set_ylabel('Number of Individuals')
axarr[1][1].set_xlabel('Participation in Daily Exercise')

df1['Alc_habits'].value_counts().plot.bar(
ax=axarr[1][0], fontsize=12, color=['green', 'maroon']
)
axarr[1][0].set_title("Alcohol Habits of Individuals with Heart Disease
axarr[1][0].set_ylabel('Number of Individuals')
axarr[1][0].set_xlabel('Number of Drinks Per Week')

df1['Veggie_Diet'].value_counts().plot.bar(
ax=axarr[0][1], fontsize=12, color=['green', 'maroon']
)
axarr[0][1].set_title("Daily Consumption of Vegetables of Individuals w
axarr[0][1].set_ylabel('Number of Individuals')
axarr[0][1].set_xlabel('Number of Vegetables Eaten Daily')

plt.subplots_adjust(hspace=.6, wspace=1.7)

According to the bar graphs, for individuals diagnosed with heart disease, the majority of
recent lifestyle habits reflect healthy choices. These recent lifestyle habits include number of
fruits and vegetables eaten daily, number of alcoholic drinks per week, and partipation in daily
physical activity. This could indicate lifestyle changes being made as recommended by a
health professional.

Pearson Statistical Analysis


The Pearson Statistical Analysis is often used to calculate linear corrrelation with Binary Data.
This dataset is primarily binary, so thus analysis was used. It revealed that the two variables
with the highest correlation were General Health and Physical Health (p-value = 0.524364).
In [ ]: df.corr(method='pearson', min_periods=1)

<ipython-input-221-74e793cab318>:1: FutureWarning: The default value


of numeric_only in DataFrame.corr is deprecated. In a future versio
n, it will default to False. Select only valid columns or specify th
e value of numeric_only to silence this warning.
df.corr(method='pearson', min_periods=1)

Out[221]: HeartDiseaseorAttack HighBP HighChol CholCheck BMI Sm

HeartDiseaseorAttack 1.000000 0.209361 0.180765 0.044206 0.052904 0.1

HighBP 0.209361 1.000000 0.298199 0.098508 0.213748 0.0

HighChol 0.180765 0.298199 1.000000 0.085642 0.106722 0.0

CholCheck 0.044206 0.098508 0.085642 1.000000 0.034495 -0.0

BMI 0.052904 0.213748 0.106722 0.034495 1.000000 0.0

Smoker 0.114441 0.096991 0.091299 -0.009929 0.013804 1.0

Stroke 0.203002 0.129575 0.092620 0.024158 0.020153 0.0

Conclusion
This data analysis reinforces that blood pressure, cholesterol levels, history of stroke, and
smoking habits are all related to heart disease. It shows that age increase, so does the risk of
heart disease and that many people with heart disease are overweight. It also shows that the
majority of the individuals diagnosed with Heart Disease in this data were making healthy
choices.

Words From the Analyst


What gave me the most trouble was making all the functions to replace the numerical
categories with words or phrases. The dataset had a lot of data, but it was mostly binary.
Because, of this I couldn't really make any scatterplots. I wouldn't recommend using any
filters before analyzing the data I enjoyed seperating the parameters into subplots I think that

You might also like