0% found this document useful (0 votes)
71 views6 pages

.Ipynb - Checkpoints: Python-Data-Cleaning

The document is a tutorial on data cleaning using Python, specifically focusing on cleaning a dataset of book information. It includes code examples for dropping unnecessary columns, cleaning specific fields like dates and author names, and renaming columns. The tutorial demonstrates various techniques using the pandas library to prepare data for analysis.

Uploaded by

artemiss3000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views6 pages

.Ipynb - Checkpoints: Python-Data-Cleaning

The document is a tutorial on data cleaning using Python, specifically focusing on cleaning a dataset of book information. It includes code examples for dropping unnecessary columns, cleaning specific fields like dates and author names, and renaming columns. The tutorial demonstrates various techniques using the pandas library to prepare data for analysis.

Uploaded by

artemiss3000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

realpython / python-data-cleaning Public

forked from MalayAgr/real_python_data_cleaning_tutorial

Code Pull requests 1 Actions Projects Wiki Security Insights

python-data-cleaning / .ipynb_checkpoints / Data Cleaning Tutorial - Real Python-checkpoint.ipynb

MalayAgr Synch with article a8d1017 · 6 years ago History

1633 lines (1633 loc) · 56.6 KB


BONUS: Contains two more examples of cleaning specific columns

In [36]: import pandas as pd


import numpy as np
from functools import reduce

Dropping unnecessary columns


In [37]: df = pd.read_csv('Datasets\BL-Flickr-Images-Book.csv')
df.head()

Out[37]: Edition Place of Date of Co


Identifier Publisher Title Author Contributors
Statement Publication Publication

Walter
S. Tinsley Forbes. [A FORBES,
0 206 NaN London 1879 [1878] A. A.
& Co. novel.] By A. Walter.
A

All for Greed. BLAZE DE


London;
Virtue & [A novel. The A., A. BURY, Marie
1 216 NaN Virtue & 1868
Co. dedication A. Pauline Rose
Yorston
signed... - Baroness

Love the BLAZE DE


Bradbury,
Avenger. By A., A. BURY, Marie
2 218 NaN London 1869 Evans &
the author of A. Pauline Rose
Co.
“All for Gr... - Baroness

Welsh
Sketches, Appleyard,
James
3 472 NaN London 1851 chiefly A., E. S. Ernest
Darling
ecclesiastical, Silvanus.
to the...

A new [The World


Wertheim
edition, in which I BROOME,
4 480 London 1857 & A., E. S.
revised, live, and my John Henry.
Macintosh
etc. place in it...

In [38]: to_drop = ['Edition Statement',


'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']

df.drop(to_drop, inplace = True, axis = 1) #or: df.drop(columns = to_drop, inplace = True)

Changing the index


In [39]: df = df.set_index('Identifier') #or: df.set_index('Identifier', inplace = True)
df.head()

Out[39]: Place of Date of


Publisher Title Author
Publication Publication

Identifier

Walter
S. Tinsley Forbes. [A
206 London 1879 [1878] A. A. http://www.flickr.com/photos/britis
& Co. novel.] By A.
A

All for Greed.


London;
Virtue & [A novel. The A., A.
216 Virtue & 1868 http://www.flickr.com/photos/britis
Co. dedication A.
Yorston
signed...

Love the
Bradbury,
Avenger. By A., A.
218 London 1869 Evans & http://www.flickr.com/photos/britis
the author of A.
Co.
All for Gr...

Welsh
Sketches,
James
472 London 1851 chiefly A., E. S. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...

[The World
Wertheim
in which I
480 London 1857 & A., E. S. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...

Cleaning specific columns


In [40]: unwanted_characters = ['[', ',', '-']

def clean_dates(item):
dop= str(item.loc['Date of Publication'])

if dop == 'nan' or dop[0] == '[':


return np.NaN

for character in unwanted_characters:


if character in dop:
character_index = dop.find(character)
dop = dop[:character_index]

return dop

df['Date of Publication'] = df.apply(clean_dates, axis = 1)

In [41]: df.head()

Out[41]: Place of Date of


Publisher Title Author
Publication Publication

Identifier

Walter
S. Tinsley Forbes. [A
206 London 1879 A. A. http://www.flickr.com/photos/britis
& Co. novel.] By A.
A

All for Greed.


London;
Virtue & [A novel. The A., A.
216 Virtue & 1868 http://www.flickr.com/photos/britis
Co. dedication A.
Yorston
signed...

Love the
Bradbury,
Avenger. By A., A.
218 London 1869 Evans & http://www.flickr.com/photos/britis
the author of A.
Co.
“All for Gr...

Welsh
Sketches,
James
472 London 1851 chiefly A., E. S. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...

[The World
Wertheim
in which I
480 London 1857 & A., E. S. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...

In [42]: def clean_author_names(item):

author = str(item.loc['Author'])

if author == 'nan':
return np.NaN

author = author.split(',')

if len(author) == 1:
name = filter(lambda x: x.isalpha(), author[0])
return reduce(lambda x, y: x + y, name)

last_name, first_name = author[0], author[1]

first_name = first_name[:first_name.find('-')] if '-' in first_name else first_name

if first_name.endswith(('.', '.|')):
parts = first_name.split('.')

if len(parts) > 1:
first_occurence = first_name.find('.')
final_occurence = first_name.find('.', first_occurence + 1)
first_name = first_name[:final_occurence]
else:
first_name = first_name[:first_name.find('.')]

last_name = last_name.capitalize()

return f'{first_name} {last_name}'

df['Author'] = df.apply(clean_author_names, axis = 1)

In [43]: df.head()

Out[43]: Place of Date of


Publisher Title Author
Publication Publication

Identifier

Walter
S. Tinsley Forbes. [A
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. novel.] By A.
A

All for Greed.


London;
Virtue & [A novel. The
216 Virtue & 1868 A. A A. http://www.flickr.com/photos/britis
Co. dedication
Yorston
signed...

Love the
Bradbury,
Avenger. By
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
the author of
Co.
“All for Gr...

Welsh
Sketches,
James
472 London 1851 chiefly E. S A. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...

[The World
Wertheim
in which I
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...

In [44]: pub = df['Place of Publication']


df['Place of Publication'] = np.where(pub.str.contains('London'), 'London',
np.where(pub.str.contains('Oxford'), 'Oxford',
np.where(pub.eq('Newcastle upon Tyne'),
'Newcastle-upon-Tyne', df['Place of Publication'])))

In [47]: df.head()

Out[47]: Place of Date of


Publisher Title Author
Publication Publication

Identifier

Walter
S. Tinsley Forbes. [A
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. novel.] By A.
A

All for Greed.


Virtue & [A novel. The
216 London 1868 A. A A. http://www.flickr.com/photos/britis
Co. dedication
signed...

Love the
Bradbury,
Avenger. By
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
the author of
Co.
“All for Gr...

Welsh
Sketches,
James
472 London 1851 chiefly E. S A. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...

[The World
Wertheim
in which I
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...

In [29]: def clean_title(item):


title = str(item['Title'])

python-data-cleaning / .ipynb_checkpoints
if title == 'nan': / Data Cleaning Tutorial - Real Python-checkpoint.ipynb Top
return np.NaN

Preview Code if title[0]


Blame == '[': Raw
title = title[1: title.find(']')]

if 'by' in title:
title = title[:title.find('by')]
elif 'By' in title:
title = title[:title.find('By')]

if '[' in title:
title = title[:title.find('[')]

title = title[:-2]

title = list(map(str.capitalize, title.split()))


return ' '.join(title)

df['Title'] = df.apply(clean_title, axis = 1)

In [30]: df.head()

Out[30]: Place of Date of


Publisher Title Author
Publication Publication

Identifier

S. Tinsley Walter
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. Forbes

Virtue &
216 London 1868 All For Greed A. A A. http://www.flickr.com/photos/britis
Co.

Bradbury,
Love The
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
Avenger
Co.

Welsh
Sketches,
James
472 London 1851 Chiefly E. S A. http://www.flickr.com/photos/britis
Darling
Ecclesiastical,
To The...

The World In
Wertheim
Which I Live,
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
And My
Macintosh
Place In It

Cleaning entire dataset


In [31]: university_towns = []

with open('Datasets\\university_towns.txt', 'r') as file:


items = file.readlines()
states = list(filter(lambda x: '[edit]' in x, items))

for index, state in enumerate(states):


start items.index(state) + 1
if index == 49: #since 50 states
end = len(items)
else:
end = items.index(states[index + 1])

pairs = map(lambda x: [state, x], items[start:end])


university_towns.extend(pairs)

towns_df = pd.DataFrame(university_towns, columns = ['State', 'RegionName'])


towns_df.head()

Out[31]: State RegionName

0 Alabama[edit]\n Auburn (Auburn University)[1]\n

1 Alabama[edit]\n Florence (University of North Alabama)\n

2 Alabama[edit]\n Jacksonville (Jacksonville State University)[2]\n

3 Alabama[edit]\n Livingston (University of West Alabama)[2]\n

4 Alabama[edit]\n Montevallo (University of Montevallo)[2]\n

In [32]: def clean_up(item):


if '(' in item:
return item[:item.find('(') - 1] #since space before '('

if '[' in item:
return item[:item.find('[')]

towns_df = towns_df.applymap(clean_up)
towns_df.head()

Out[32]: State RegionName

0 Alabama Auburn

1 Alabama Florence

2 Alabama Jacksonville

3 Alabama Livingston

4 Alabama Montevallo

Renaming columns and skipping rows


In [33]: olympics df = pd.read csv('Datasets\olympics.csv', skiprows=1, header=0)

You might also like