Lab Assignment 3

8/31/2021 Lab 4 Data Cleaning & Seaborn - Jupyter Notebook
Lab 4 Data Cleaning & Seaborn

¶
Data Cleaning With Pandas and NumPy
In [1]:
import numpy as np
import pandas as pd
i t t l tlib l t lt
In [2]:
df = pd.read_csv(r'https://raw.githubusercontent.com/realpython/python-data-cleaning/master
df i f ()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8287 entries, 0 to 8286
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Identifier 8287 non-null int64
1 Edition Statement 773 non-null object
2 Place of Publication 8287 non-null object
3 Date of Publication 8106 non-null object
4 Publisher 4092 non-null object
5 Title 8287 non-null object
6 Author 6509 non-null object
7 Contributors 8287 non-null object
8 Corporate Author 0 non-null float64
9 Corporate Contributors 0 non-null float64
10 Former owner 1 non-null object
11 Engraver 0 non-null float64
12 Issuance type 8287 non-null object
13 Flickr URL 8287 non-null object
14 Shelfmarks 8287 non-null object
dtypes: float64(3), int64(1), object(11)
memory usage: 971.3+ KB
In [3]:
df d ib ()
Out[3]:
Identifier Corporate Author Corporate Contributors Engraver
count 8.287000e+03 0.0 0.0 0.0
mean 2.017344e+06 NaN NaN NaN
std 1.190379e+06 NaN NaN NaN
min 2.060000e+02 NaN NaN NaN
25% 9.157875e+05 NaN NaN NaN
50% 2.043707e+06 NaN NaN NaN
75% 3.047430e+06 NaN NaN NaN
max 4.160339e+06 NaN NaN NaN
localhost:8888/notebooks/CO327(ML) Lab/Lab 4/Lab 4 Data Cleaning %26 Seaborn.ipynb 1/19

In [4]:
df h d()
Out[4]:
Edition Place of Date of

Identifier Publisher Title Author Contributors
Statement Publication Publication
Walter
S. Tinsley Forbes. [A FORBES,
0 206 NaN London 1879 [1878] A. A.
& Co. novel.] By A. Walter.
A
All for Greed. BLAZE DE

London;
Virtue & [A novel. The A., A. BURY, Marie
1 216 NaN Virtue & 1868
Co. dedication A. Pauline Rose
Yorston
signed... - Baroness
Love the BLAZE DE

Bradbury,
Avenger. By A., A. BURY, Marie
2 218 NaN London 1869 Evans &
the author of A. Pauline Rose
Co.
“All for Gr... - Baroness
Welsh
Sketches, Appleyard,
James A., E.
3 472 NaN London 1851 chiefly Ernest
Darling S.
ecclesiastical, Silvanus.
to the...
A new [The World in

Wertheim
edition, which I live, A., E. BROOME,
4 480 London 1857 &
revised, and my place S. John Henry.
Macintosh
etc. in it...

In [5]:
df t il()
Out[5]:
Edition Place of Date of

Identifier Publisher Title Author
Statement Publication Publication
The Parochial GIDDY, B

History of afterwards S
8282 4158088 NaN London 1838 NaN
Cornwall, GILBERT, Willi
founded on,... Davies.
The History
GLOVER,
M. Mozley and Gazetteer
8283 4158128 NaN Derby 1831, 32 Stephen - NO
& Son of the County
of Derby
of Der...
LYSONS,
Magna
Daniel -
T. Cadell Britannia;
M.A.,
8284 4159563 NaN London [1806]-22 and W. being a Matth
F.R.S., and
Davies concise S
LYSONS
topographical...
(Sam...
An historical,
Newcastle Mackenzie topographical Mackenzie,
8285 4159587 NaN 1834
upon Tyne & Dent and descriptive E. (Eneas)
v...
Collectanea
Topographica
8286 4160339 NaN London 1834-43 NaN et NaN Bul
Genealogica.
[Firs...
1. Dropping Columns in a DataFrame
In [6]:
columns_with_nan = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'For

'Contributors', 'Issuance type', 'Shelfmarks']
# Method 1
df.drop(columns_with_nan, inplace = True, axis = 1) # axis = 1 (or axis = 'columns') is ver
# when inplace = True => the data is modified in place, it will return nothing and the dataf
# when inplace = False => new dataframe is created
# Method 2
# df d ( l l ith i l T ) # d l i l
2. Changing the Index of a DataFrame
In [7]:
# check if values in the object are unique

df['Id tifi '] i i
Out[7]:
True

In [8]:
# change index to the values in Identifier

df = df.set_index('Identifier')
df
Out[8]:
Place of Date of
Publisher Title Author
Publication Publication
Identifier
Walter Forbes.
S. Tinsley
206 London 1879 [1878] [A novel.] By A. A. A. http://www.flickr.com/photo
& Co.
A
All for Greed.

London;
Virtue & [A novel. The
216 Virtue & 1868 A., A. A. http://www.flickr.com/photo
Co. dedication
Yorston
signed...
Love the
Bradbury,
Avenger. By
218 London 1869 Evans & A., A. A. http://www.flickr.com/photo
the author of
Co.
“All for Gr...
Welsh
Sketches,
James
472 London 1851 chiefly A., E. S. http://www.flickr.com/photo
Darling
ecclesiastical,
to the...
[The World in
Wertheim
which I live,
480 London 1857 & A., E. S. http://www.flickr.com/photo
and my place
Macintosh
in it...
... ... ... ... ... ...
The Parochial GIDDY,

History of afterwards
4158088 London 1838 NaN http://www.flickr.com/photo
Cornwall, GILBERT,
founded on,... Davies.
The History
GLOVER,
M. Mozley and Gazetteer
4158128 Derby 1831, 32 Stephen - http://www.flickr.com/photo
& Son of the County
of Derby
of Der...
LYSONS,
Magna
Daniel -
T. Cadell Britannia;
M.A.,
4159563 London [1806]-22 and W. being a http://www.flickr.com/photo
F.R.S., and
Davies concise
LYSONS
topographical...
(Sam...
An historical,
Newcastle Mackenzie topographical Mackenzie,
4159587 1834 http://www.flickr.com/photo
upon Tyne & Dent and descriptive E. (Eneas)
v...
Collectanea
Topographica
4160339 London 1834-43 NaN et NaN http://www.flickr.com/photo
Genealogica.
[Firs...
8287 rows × 6 columns
3. Tidying up Fields in the Data

In [9]:
# we want only numerical or categorical data to perform calculations

df dt l t () # t t f i dt i thi bj t
Out[9]:
object 6
dtype: int64
In [10]:
df l [1879 'D t f P bli ti ']

Out[10]:
Identifier
1905 1888
1929 1839, 38-54
2836 1897
2854 1865
2956 1860-63
...
4158088 1838
4158128 1831, 32
4159563 [1806]-22
4159587 1834
4160339 1834-43
Name: Date of Publication, Length: 8275, dtype: object
In [11]:
# since, a particular book can have only one date of publication

# so, remove the extra dates, convert date ranges to their 'start date', remove/replace dat
# & convert the string nan to NumPy’s NaN value
# r'^(\d{4})' => regex finds any four digits at the beginning of a string
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand = False)
t
Out[11]:
Identifier
206 1879
216 1868
218 1869
472 1851
480 1857
...
4158088 1838
4158128 1831
4159563 NaN
4159587 1834
4160339 1834
Name: Date of Publication, Length: 8287, dtype: object
In [12]:
# convert the string nan to NumPy’s NaN value (i.e. where dtype is float)
df['Date of Publication'] = pd.to_numeric(extr)
df['D t f P bli ti '] dt
Out[12]:
dtype('float64')

4. Combining str Methods with NumPy to Clean Columns
In [13]:
# we will clean Place of Publication since this column has string objects
df['Pl f P bli ti ']
Out[13]:
Identifier
206 London
216 London; Virtue & Yorston
218 London
472 London
480 London
...
4158088 London
4158128 Derby
4159563 London
4159587 Newcastle upon Tyne
4160339 London
Name: Place of Publication, Length: 8287, dtype: object
In [14]:
df l [4157862]
Out[14]:
Place of Publication Newcastle-upon-Tyne
Date of Publication 1867
Publisher T. Fordyce
Title Local Records; or, Historical Register of rema...
Author FORDYCE, T. - Printer, of Newcastle-upon-Tyne
Flickr URL http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object
In [15]:
df.loc[4159587]
# t th t th h l i t h h h i th f th l hil th
Out[15]:
Place of Publication Newcastle upon Tyne
Date of Publication 1834
Publisher Mackenzie & Dent
Title An historical, topographical and descriptive v...
Author Mackenzie, E. (Eneas)
Flickr URL http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object
In [16]:
# str.contains() => test if pattern or regex is contained within a string of a Series or In
pub = df['Place of Publication']

london = pub.str.contains('London')
f d b t t i ('O f d')

In [17]:
# numpy.where(condition[, x, y]) => when True, yield x, otherwise yield y

# we also replace hyphens with a space with str.replace() and reassign to the column in our
df['Place of Publication'] = np.where(london, 'London', np.where(oxford, 'Oxford', pub.str.

df['Pl f P bli ti ']
Out[17]:
Identifier
206 London
216 London
218 London
472 London
480 London
...
4158088 London
4158128 Derby
4159563 London
4159587 Newcastle upon Tyne
4160339 London
Name: Place of Publication, Length: 8287, dtype: object
5. Cleaning the Entire Dataset Using the applymap Function
In [18]:
with open('university_towns.txt') as file:

lines = file.readlines()
lines
# t t h th '[ dit]' b t i i th hi h b t k d t f
Out[18]:
['Alabama[edit]\n',
'Auburn (Auburn University)[1]\n',
'Florence (University of North Alabama)\n',
'Jacksonville (Jacksonville State University)[2]\n',
'Livingston (University of West Alabama)[2]\n',
'Montevallo (University of Montevallo)[2]\n',
'Troy (Troy University)[2]\n',
'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3]

[4]\n',
'Tuskegee (Tuskegee University)[5]\n',
'Alaska[edit]\n',
'Fairbanks (University of Alaska Fairbanks)[2]\n',
'Arizona[edit]\n',
'Flagstaff (Northern Arizona University)[6]\n',
'Tempe (Arizona State University)\n',
'Tucson (University of Arizona)\n',
'Arkansas[edit]\n',
'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]

In [19]:
university_towns = [] # creating a list of (state, city) tuples

with open('university_towns.txt') as file:
for line in file:
if '[edit]' in line:
# remember this `state` until the next is found
state = line
else:
# otherwise, we have a city; keep `state` as last-seen
university_towns.append((state, line))
i it t [1 10]
Out[19]:
[('Alabama[edit]\n', 'Florence (University of North Alabama)\n'),
('Alabama[edit]\n', 'Jacksonville (Jacksonville State University)[2]\n'),
('Alabama[edit]\n', 'Livingston (University of West Alabama)[2]\n'),
('Alabama[edit]\n', 'Montevallo (University of Montevallo)[2]\n'),
('Alabama[edit]\n', 'Troy (Troy University)[2]\n'),
('Alabama[edit]\n',
'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]

\n'),
('Alabama[edit]\n', 'Tuskegee (Tuskegee University)[5]\n'),
('Alaska[edit]\n', 'Fairbanks (University of Alaska Fairbanks)[2]\n'),
('Arizona[edit]\n', 'Flagstaff (Northern Arizona University)[6]\n')]
In [20]:
# wrapping that list in a DataFrame and setting column names
towns_df = pd.DataFrame(university_towns, columns = ['State', 'Region Name'])

t df h d()
Out[20]:
State Region Name
0 Alabama[edit]\n Auburn (Auburn University)[1]\n
1 Alabama[edit]\n Florence (University of North Alabama)\n
2 Alabama[edit]\n Jacksonville (Jacksonville State University)[2]\n
3 Alabama[edit]\n Livingston (University of West Alabama)[2]\n
4 Alabama[edit]\n Montevallo (University of Montevallo)[2]\n
In [21]:
# function that takes an element from the DataFrame as its parameter

# inside the function, checks are performed to determine whether there’s a ( or [ in the el
def get_citystate(item):
if '(' in item:
return item[:item.find('(')]
elif '[' in item:
return item[:item.find('[')]
else:
ret rn it

In [22]:
# df.applymap => apply a function to a Dataframe elementwise
towns_df = towns_df.applymap(get_citystate)
t df h d()
Out[22]:
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
6. Renaming Columns and Skipping Rows
In [23]:
olympics_df = pd.read_csv(r'https://raw.githubusercontent.com/realpython/python-data-cleani
l i df
Out[23]:
0 1 2 3 4 5 6 7 8 9 10 11 12
? ? ?
0 NaN 01 ! 02 ! 03 ! Total 01 ! 02 ! 03 ! Total 01
Summer Winter Games
Afghanistan
1 13 0 0 2 2 0 0 0 0 0 13 0
(AFG)
Algeria
2 12 5 2 8 15 3 0 0 0 0 15 5
(ALG)
Argentina
3 23 18 24 28 70 18 0 0 0 0 41 18
(ARG)
Armenia
4 5 1 2 9 12 6 0 0 0 0 11 1
(ARM)
... ... ... ... ... ... ... ... ... ... ... ... ... ..
Independent
Olympic
143 1 0 1 2 3 0 0 0 0 0 1 0
Participants
(IOP) [IOP]
Zambia
144 (ZAM) 12 0 1 1 2 0 0 0 0 0 12 0
[ZAM]
Zimbabwe
145 12 3 4 1 8 1 0 0 0 0 13 3
(ZIM) [ZIM]
Mixed team
146 3 8 5 4 17 0 0 0 0 0 3 8
(ZZX) [ZZX]
147 Totals 27 4809 4775 5130 14714 22 959 958 948 2865 49 5768

In [24]:
# if we were to go to the source of this dataset, we’d see that NaN above should really be
# like “Country”, ? Summer is supposed to represent “Summer Games”, 01 ! should be “Gold”,
# skipping one row and setting the header as the first (0-indexed) row
olympics_df = pd.read_csv(r'https://raw.githubusercontent.com/realpython/python-data-cleani
header = 1)
l i df
Out[24]:
Unnamed: ? ? 01 02 03 ?
01 ! 02 ! 03 ! Total Total.1
0 Summer Winter !.1 !.1 !.1 Games
Afghanistan
0 13 0 0 2 2 0 0 0 0 0 13
(AFG)
Algeria
1 12 5 2 8 15 3 0 0 0 0 15
(ALG)
Argentina
2 23 18 24 28 70 18 0 0 0 0 41
(ARG)
Armenia
3 5 1 2 9 12 6 0 0 0 0 11
(ARM)
Australasia
4 2 3 4 5 12 0 0 0 0 0 2
(ANZ) [ANZ]
... ... ... ... ... ... ... ... ... ... ... ... ...
Independent
Olympic
142 1 0 1 2 3 0 0 0 0 0 1
Participants
(IOP) [IOP]
Zambia
143 (ZAM) 12 0 1 1 2 0 0 0 0 0 12
[ZAM]
Zimbabwe
144 12 3 4 1 8 1 0 0 0 0 13
(ZIM) [ZIM]
Mixed team
145 3 8 5 4 17 0 0 0 0 0 3
(ZZX) [ZZX]
146 Totals 27 4809 4775 5130 14714 22 959 958 948 2865 49 5

In [25]:
# dictionary that maps current column names (as keys) to more usable ones (the dictionary’s
new_names = {'Unnamed: 0': 'Country',
'? Summer': 'Summer Olympics',
'01 !': 'Gold',
'02 !': 'Silver',
'03 !': 'Bronze',
'? Winter': 'Winter Olympics',
'01 !.1': 'Gold.1',
'02 !.1': 'Silver.1',
'03 !.1': 'Bronze.1',
'? Games': '# Games',
'01 !.2': 'Gold.2',
'02 !.2': 'Silver.2',
'03 ! 2' 'B 2'}
In [26]:
# renaming the columns

olympics_df.rename(columns = new_names, inplace = True)
l i df h d()
Out[26]:
Summer Winter
Country Gold Silver Bronze Total Gold.1 Silver.1 Bronze.1 Tot
Olympics Olympics
Afghanistan
0 13 0 0 2 2 0 0 0 0
(AFG)
Algeria
1 12 5 2 8 15 3 0 0 0
(ALG)
Argentina
2 23 18 24 28 70 18 0 0 0
(ARG)
Armenia
3 5 1 2 9 12 6 0 0 0
(ARM)
Australasia
4 (ANZ) 2 3 4 5 12 0 0 0 0
[ANZ]
Seaborn Practice
In [27]:
# python data visualization library based on matplotlib

# it provides a high-level interface for drawing attractive and informative statistical gra
import b as

In [28]:
# using the dataset of Google Playstore

pstore = pd.read_csv('googleplaystore.csv')
t h d(5)
Out[28]:
Content
App Category Rating Reviews Size Installs Type Price
Rating
Photo
Editor &
Candy
0 ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone
Camera &
Grid &
ScrapBook
Coloring
1 book ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone
moana
U
Launcher
Lite –
2 FREE Live ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone
Cool
Themes,
Hide ...
Sketch -
3 Draw & ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen
Paint
Pixel Draw
- Number
4 Art ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone
D
Coloring
Book
1. Displot

In [29]:
sns.displot(pstore.Rating, bins = 20, kde = True) # A kernel density estimate (KDE) plot is
# the distribution of observations in a d
plt.title('Distribution of app ratings')
lt h ()

In [30]:
# sns.set_style('darkgrid')
# sns.set_context('paper')
sns.displot(pstore.Rating, bins = 25, kde = False)

plt.title('Distribution of app ratings')
lt h ()
2. Pie Chart & Bar Chart
In [31]:
# analyzing the Content Rating column

t ['C t t R ti '] l t ()
Out[31]:
Everyone 8714
Teen 1208
Mature 17+ 499
Everyone 10+ 414
Adults only 18+ 3
Unrated 2
Name: Content Rating, dtype: int64

In [32]:
# remove the rows with values which are less represented

pstore = pstore[~pstore['Content Rating'].isin(['Adults only 18+','Unrated'])]
# resetting the index

pstore.reset_index(inplace = True, drop = True)
t ['C t t R ti '] l t ()
Out[32]:
Everyone 8714
Teen 1208
Mature 17+ 499
Everyone 10+ 414
Name: Content Rating, dtype: int64
In [33]:
# pie chart
pstore['Content Rating'].value_counts().plot.pie()
plt.legend()
lt h ()
In [34]:
# bar chart
# sns.set_style(style = None)
# sns.set_context(context = None)
pstore['Content Rating'].value_counts().plot.barh()
lt h ()

3. Scatter Plot
In [35]:
# draw a plot of two variables with bivariate and univariate graphs

sns.jointplot(data = pstore, x = pstore.Size, y = pstore.Rating)
plt.ylim(1,5)
plt.xlim(0,100000)
lt h ()
4. Pair Plot

In [36]:
# plot pairwise relationships in a dataset

planets = sns.load_dataset('planets')
sns.pairplot(planets, diag_kind = 'auto')
lt h ()

In [37]:
sns.pairplot(planets, vars = ['mass', 'distance'])

lt h ()
5. Heat maps

In [38]:
# load the example flights dataset and conver to long-form

flights_long = sns.load_dataset("flights")
flights = flights_long.pivot("month", "year", "passengers")
# draw a heatmap with the numeric values in each cell

f, ax = plt.subplots(figsize = (9, 7))
sns.heatmap(flights, annot = True, fmt = "d", linewidths = .5)
lt h ()

Lab Assignment 3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab Assignment 3

Uploaded by

Copyright:

Available Formats

8/31/2021 Lab 4 Data Cleaning & Seaborn - Jupyter Notebook

Lab 4 Data Cleaning & Seaborn

Data Cleaning With Pandas and NumPy

RangeIndex: 8287 entries, 0 to 8286

Data columns (total 15 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Identifier 8287 non-null int64

1 Edition Statement 773 non-null object

2 Place of Publication 8287 non-null object

3 Date of Publication 8106 non-null object

4 Publisher 4092 non-null object

5 Title 8287 non-null object

6 Author 6509 non-null object

7 Contributors 8287 non-null object

8 Corporate Author 0 non-null float64

9 Corporate Contributors 0 non-null float64

10 Former owner 1 non-null object

11 Engraver 0 non-null float64

12 Issuance type 8287 non-null object

13 Flickr URL 8287 non-null object

14 Shelfmarks 8287 non-null object

dtypes: float64(3), int64(1), object(11)

memory usage: 971.3+ KB

Identifier Corporate Author Corporate Contributors Engraver

count 8.287000e+03 0.0 0.0 0.0

mean 2.017344e+06 NaN NaN NaN

std 1.190379e+06 NaN NaN NaN

min 2.060000e+02 NaN NaN NaN

25% 9.157875e+05 NaN NaN NaN

50% 2.043707e+06 NaN NaN NaN

75% 3.047430e+06 NaN NaN NaN

max 4.160339e+06 NaN NaN NaN

localhost:8888/notebooks/CO327(ML) Lab/Lab 4/Lab 4 Data Cleaning %26 Seaborn.ipynb 1/19

Edition Place of Date of

All for Greed. BLAZE DE

Love the BLAZE DE

A new [The World in

localhost:8888/notebooks/CO327(ML) Lab/Lab 4/Lab 4 Data Cleaning %26 Seaborn.ipynb 2/19

Edition Place of Date of

The Parochial GIDDY, B

1. Dropping Columns in a DataFrame

columns_with_nan = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'For

2. Changing the Index of a DataFrame

# check if values in the object are unique

localhost:8888/notebooks/CO327(ML) Lab/Lab 4/Lab 4 Data Cleaning %26 Seaborn.ipynb 3/19

# change index to the values in Identifier

All for Greed.

... ... ... ... ... ...

The Parochial GIDDY,

8287 rows × 6 columns

3. Tidying up Fields in the Data

# we want only numerical or categorical data to perform calculations

df l [1879 'D t f P bli ti ']

1929 1839, 38-54

Name: Date of Publication, Length: 8275, dtype: object

# since, a particular book can have only one date of publication

Name: Date of Publication, Length: 8287, dtype: object

localhost:8888/notebooks/CO327(ML) Lab/Lab 4/Lab 4 Data Cleaning %26 Seaborn.ipynb 5/19

4. Combining str Methods with NumPy to Clean Columns

216 London; Virtue & Yorston

4159587 Newcastle upon Tyne

Name: Place of Publication, Length: 8287, dtype: object