You are on page 1of 20

Week 14 – Case Study

(CLO4)
Learning Outcomes

Writing Objectives.

Apply Data Analysis Techniques using Pandas

Apply Data Visualization Techniques

2
Case Study – Cause of Deaths in US
This case study aims to provide you with a full example of performing data analytics using Python and Pandas. By completing this case
study, you will experience that pandas provide rich functionality that makes the process of performing data analytics simple and
straightforward

The dataset that we are using in this case study comes from the USA open data, which can be accessed through https://data.gov

3
Data Source.

This data set in a CSV file that contains records of deaths from 1999 to 2015 from the United State of America. There are total 10868
rows and 6 columns.

Sample Data:

4
Objectives:

We want to analysis the data to find the answers for below questions.

Q1: What is the total number of deaths in US from 1999 to 2016?


Q2: How many people died due to cancer in the Vermont state?
Q3: Which states have more than 50,000 Deaths due to Heart disease?
Q4: How many people died due to each Cause?
Q5: Show the Cause that has the highest number of deaths.
Q6: What is the main cause of death in 2002?
Q7: Which year had highest number of deaths due to Suicide?
Q8: What is the trend of death from the year 1999 to 2016?
Q9: Compare the deaths by different causes in the year 2016.
Q10: What are the top 5 states that has highest number of deaths due to heart disease?

5
Create a new Python Notebook project and upload the data.

1. Open Google Collab (https://colab.research.google.com/)


2. Create a new Notebook
3. Rename the notebook
4. Connect to Server
5. Upload the data file.

6
Reading the Data
We read the data from the CSV file and store in the data frame names USdata.
Using head display the sample of data.
Use shape to veirfy the total rows and total columns.

7
Clean Data
Using dropna() Remove the rows and columns that has all values as Null (NaN)

Using fillna() Replace any Null values with 0 in Deaths column.

8
Clean Data
Remove the columns that is not required for our data analysis.

Remove the rows where the Cause name value is ‘All causes’ as this is not a valid cause.

Remove the rows where the State value is ‘United States’ as this is country not a state.

9
Question1: What is the total number of deaths in US from 1999 to 2016?

To find the total number of deaths in US, we use statistical command sum() to calculate the total of the column Deaths.

From the result we know there are more than thirty-six million deaths from the year 1999 to 2016.

10
Question2: How many people died due to cancer in the Vermont state?

To find the total number of deaths in Vermont state due to Cancer, we use condition filter to get the rows with Cause name as Cancer and
State Vermont. Then use the statistic command sum() to get total number of deaths.

From the result we know there are twenty-four thousand six hundred thirty-five deaths in Vermont due to Cancer.

11
Question3: Which states have more than 50,000 Deaths due to Heart disease?

First we filter the rows that have more than 50000 deaths and Cause name is Heart Disease, then display the unique values in the column
“State”

From the result we know there there are three states California, New York and Florida has been affected..

12
Question 4: How many people died due to each Cause?

We need to group by the Cause and find the sum of Deaths.

From the result we can understand Heart disease is the


major cause of deaths in US from 1999 to 2015.

13
Question 5: Show the Cause that has the highest number of deaths.

We sort the deaths in the descending order and take the first row.

From the result we can understand Heart disease is having highest number of deaths in 1999 at California.

Note: We can also solve the question using the conditional and statistical commans as shown below

14
Question 6: What is the main cause of death in 2002?

We first need to filter the rows with Year 2002, then use group by to get the sum of deaths for each cause, sort these results and take the topmost row.

From the result we can understand Heart disease is having highest number of deaths in the year 2002.

15
Question 7: Which year had highest number of deaths due to Suicide?

We first need to filter the rows where Cause Name is Suicide, then sort the result in decending order by the column ;deaths’. Display only the columns Year and number of
deaths.

From the result we can understand highest number of Suicide deaths happen the year 2017.

16
Question 8: What is the trend of deaths from the year 2000 to 2016?

We need to filter the rows for years between 2000 and 2016, caluclate the total number of deaths by each year. Display the result in a line chart to view the trend by years.
Enhance the chart with title, colour, x label and y label.

From the result we can understand that from the year 2000 till 2009,
death rate was increase and decreased after 2009 the death rate is
increase exponential.

17
Question 9: Compare the deaths by different causes in the year 2016.

We need to filter the rows for year 2016, calculate the total number of deaths by cause. Display the result in a bar chart to view the total deaths by each cause. Enhance
the chart with title, colour, x label and y label.

From the result we can most of the deaths happen due to heart
disease and Cancer in the year 2016.

18
Question 10: What are the top 5 states that has highest number of deaths due to heart disease?

We need to filter the rows where the cause name is HeartDisease then caluclate the total number of deaths by State. Sort the result in descending order of Deaths. Display
the result using a Pie chart. Enhance the chart with title.

From the result we can se the top5 states in US that has highest
number of deaths due to heart disease.

19
Conclusion:

This case study reveals that the number of deaths in the united states has increased from 1999 to 2016. This is a clear upward trend. The
top 3 causes of deaths in the United States during this period are heart diseases, cancer, and stroke. California, Florida, Texas, New York,
Pennsylvania had greater number of deaths..

Recommendation: Since the main cause of death is heart diseases, government must create awareness among the people to have
healthy lifestyle and food to avoid getting heart diseases. There must be more medical care for heart diseases.

20

You might also like