You are on page 1of 5

Worksheet 2.

5
DS100-1
CLEANING DATA IN PYTHON
APPLIED DATA SCIENCE
Name:

Enrico, Dionne Marc L. Page 1 of 1

Write codes in Jupyter notebook as required by the problems. Copy both code and output as screen grab or screen shot and paste
them here.

1 Import literacy_birth_rate.csv and assign into a dataframe named data_1. Write a code that explores this
dataframe. List at least 4 problems associated with this dataframe.
Code and Output

import pandas as pd
literacy_birth_rate_df = pd.read_csv("literacy_birth_rate.csv")
print(literacy_birth_rate_df.info())

2 Import the following files: uber_apr.csv, uber_may.csv, uber_jun.csv. Concatenate these files into a single file,
uber. Print the first 6 lines of the resulting DataFrame. Ensure that the indexes are in order.
Code and Output

import pandas as pd
# Read in the csv files using the read_csv function
apr_df = pd.read_csv("uber_apr.csv")
Page 1 of 5
may_df = pd.read_csv("uber_may.csv")
jun_df = pd.read_csv("uber_jun.csv")

uber_df = pd.concat([apr_df, may_df, jun_df])


print(uber_df.head(6))

3 Import tuberculosis.csv. Print the first five lines. Melt the DataFrame, keeping the country and year columns fixed.
Print the last five lines of the melted DataFrame.
Code and Output

import pandas as pd
tb = pd.read_csv("tuberculosis.csv")
print(tb.head())
tb_melt = pd.melt(tb, id_vars=['country', 'year'])
print(tb_melt.tail())

Page 2 of 5
4 Use the melted DataFrame in the previous problem. Create (and populate) a gender and an age column from the variable
column. Print the first five lines of the resulting DataFrame. Convert the age column to a numeric data type. Hint: use
pd.to_numeric, with the errors parameter equal to ‘coerce’. Show evidence that this column has indeed been
transformed into a numeric.
Code and Output

import pandas as pd
tb = pd.read_csv("tuberculosis.csv")
print(tb.head())
tb_melt = pd.melt(tb, id_vars=['country', 'year'])
print(tb_melt.tail())
tb_melt['gender'] = tb_melt.variable.str[0]
tb_melt['age'] = tb_melt.variable.str[1:]
print(tb_melt.head())
tb_melt['age']=pd.to_numeric(tb_melt['age'], errors='coerce')
print(tb_melt.info(['age']))
print("The age column is now in float64")

Page 3 of 5
5 Merge the files site.csv and visited.csv into a single dataframe. Use the column name of site and the column
site of visited. Make sure that the index labels are in order. Print the resulting dataframe.
Code and Output

import pandas as pd
# merging two csv files

df = pd.concat(
map(pd.read_csv, ['site.csv', 'visited.csv']), ignore_index=True)
print(df)

Page 4 of 5
Page 5 of 5

You might also like