realpython / python-data-cleaning Public
forked from MalayAgr/real_python_data_cleaning_tutorial
Code Pull requests 1 Actions Projects Wiki Security Insights
python-data-cleaning / .ipynb_checkpoints / Data Cleaning Tutorial - Real Python-checkpoint.ipynb
MalayAgr Synch with article a8d1017 · 6 years ago History
1633 lines (1633 loc) · 56.6 KB
BONUS: Contains two more examples of cleaning specific columns
In [36]: import pandas as pd
import numpy as np
from functools import reduce
Dropping unnecessary columns
In [37]: df = pd.read_csv('Datasets\BL-Flickr-Images-Book.csv')
df.head()
Out[37]: Edition Place of Date of Co
Identifier Publisher Title Author Contributors
Statement Publication Publication
Walter
S. Tinsley Forbes. [A FORBES,
0 206 NaN London 1879 [1878] A. A.
& Co. novel.] By A. Walter.
A
All for Greed. BLAZE DE
London;
Virtue & [A novel. The A., A. BURY, Marie
1 216 NaN Virtue & 1868
Co. dedication A. Pauline Rose
Yorston
signed... - Baroness
Love the BLAZE DE
Bradbury,
Avenger. By A., A. BURY, Marie
2 218 NaN London 1869 Evans &
the author of A. Pauline Rose
Co.
“All for Gr... - Baroness
Welsh
Sketches, Appleyard,
James
3 472 NaN London 1851 chiefly A., E. S. Ernest
Darling
ecclesiastical, Silvanus.
to the...
A new [The World
Wertheim
edition, in which I BROOME,
4 480 London 1857 & A., E. S.
revised, live, and my John Henry.
Macintosh
etc. place in it...
In [38]: to_drop = ['Edition Statement',
'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']
df.drop(to_drop, inplace = True, axis = 1) #or: df.drop(columns = to_drop, inplace = True)
Changing the index
In [39]: df = df.set_index('Identifier') #or: df.set_index('Identifier', inplace = True)
df.head()
Out[39]: Place of Date of
Publisher Title Author
Publication Publication
Identifier
Walter
S. Tinsley Forbes. [A
206 London 1879 [1878] A. A. http://www.flickr.com/photos/britis
& Co. novel.] By A.
A
All for Greed.
London;
Virtue & [A novel. The A., A.
216 Virtue & 1868 http://www.flickr.com/photos/britis
Co. dedication A.
Yorston
signed...
Love the
Bradbury,
Avenger. By A., A.
218 London 1869 Evans & http://www.flickr.com/photos/britis
the author of A.
Co.
All for Gr...
Welsh
Sketches,
James
472 London 1851 chiefly A., E. S. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...
[The World
Wertheim
in which I
480 London 1857 & A., E. S. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...
Cleaning specific columns
In [40]: unwanted_characters = ['[', ',', '-']
def clean_dates(item):
dop= str(item.loc['Date of Publication'])
if dop == 'nan' or dop[0] == '[':
return np.NaN
for character in unwanted_characters:
if character in dop:
character_index = dop.find(character)
dop = dop[:character_index]
return dop
df['Date of Publication'] = df.apply(clean_dates, axis = 1)
In [41]: df.head()
Out[41]: Place of Date of
Publisher Title Author
Publication Publication
Identifier
Walter
S. Tinsley Forbes. [A
206 London 1879 A. A. http://www.flickr.com/photos/britis
& Co. novel.] By A.
A
All for Greed.
London;
Virtue & [A novel. The A., A.
216 Virtue & 1868 http://www.flickr.com/photos/britis
Co. dedication A.
Yorston
signed...
Love the
Bradbury,
Avenger. By A., A.
218 London 1869 Evans & http://www.flickr.com/photos/britis
the author of A.
Co.
“All for Gr...
Welsh
Sketches,
James
472 London 1851 chiefly A., E. S. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...
[The World
Wertheim
in which I
480 London 1857 & A., E. S. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...
In [42]: def clean_author_names(item):
author = str(item.loc['Author'])
if author == 'nan':
return np.NaN
author = author.split(',')
if len(author) == 1:
name = filter(lambda x: x.isalpha(), author[0])
return reduce(lambda x, y: x + y, name)
last_name, first_name = author[0], author[1]
first_name = first_name[:first_name.find('-')] if '-' in first_name else first_name
if first_name.endswith(('.', '.|')):
parts = first_name.split('.')
if len(parts) > 1:
first_occurence = first_name.find('.')
final_occurence = first_name.find('.', first_occurence + 1)
first_name = first_name[:final_occurence]
else:
first_name = first_name[:first_name.find('.')]
last_name = last_name.capitalize()
return f'{first_name} {last_name}'
df['Author'] = df.apply(clean_author_names, axis = 1)
In [43]: df.head()
Out[43]: Place of Date of
Publisher Title Author
Publication Publication
Identifier
Walter
S. Tinsley Forbes. [A
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. novel.] By A.
A
All for Greed.
London;
Virtue & [A novel. The
216 Virtue & 1868 A. A A. http://www.flickr.com/photos/britis
Co. dedication
Yorston
signed...
Love the
Bradbury,
Avenger. By
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
the author of
Co.
“All for Gr...
Welsh
Sketches,
James
472 London 1851 chiefly E. S A. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...
[The World
Wertheim
in which I
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...
In [44]: pub = df['Place of Publication']
df['Place of Publication'] = np.where(pub.str.contains('London'), 'London',
np.where(pub.str.contains('Oxford'), 'Oxford',
np.where(pub.eq('Newcastle upon Tyne'),
'Newcastle-upon-Tyne', df['Place of Publication'])))
In [47]: df.head()
Out[47]: Place of Date of
Publisher Title Author
Publication Publication
Identifier
Walter
S. Tinsley Forbes. [A
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. novel.] By A.
A
All for Greed.
Virtue & [A novel. The
216 London 1868 A. A A. http://www.flickr.com/photos/britis
Co. dedication
signed...
Love the
Bradbury,
Avenger. By
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
the author of
Co.
“All for Gr...
Welsh
Sketches,
James
472 London 1851 chiefly E. S A. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...
[The World
Wertheim
in which I
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...
In [29]: def clean_title(item):
title = str(item['Title'])
python-data-cleaning / .ipynb_checkpoints
if title == 'nan': / Data Cleaning Tutorial - Real Python-checkpoint.ipynb Top
return np.NaN
Preview Code if title[0]
Blame == '[': Raw
title = title[1: title.find(']')]
if 'by' in title:
title = title[:title.find('by')]
elif 'By' in title:
title = title[:title.find('By')]
if '[' in title:
title = title[:title.find('[')]
title = title[:-2]
title = list(map(str.capitalize, title.split()))
return ' '.join(title)
df['Title'] = df.apply(clean_title, axis = 1)
In [30]: df.head()
Out[30]: Place of Date of
Publisher Title Author
Publication Publication
Identifier
S. Tinsley Walter
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. Forbes
Virtue &
216 London 1868 All For Greed A. A A. http://www.flickr.com/photos/britis
Co.
Bradbury,
Love The
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
Avenger
Co.
Welsh
Sketches,
James
472 London 1851 Chiefly E. S A. http://www.flickr.com/photos/britis
Darling
Ecclesiastical,
To The...
The World In
Wertheim
Which I Live,
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
And My
Macintosh
Place In It
Cleaning entire dataset
In [31]: university_towns = []
with open('Datasets\\university_towns.txt', 'r') as file:
items = file.readlines()
states = list(filter(lambda x: '[edit]' in x, items))
for index, state in enumerate(states):
start items.index(state) + 1
if index == 49: #since 50 states
end = len(items)
else:
end = items.index(states[index + 1])
pairs = map(lambda x: [state, x], items[start:end])
university_towns.extend(pairs)
towns_df = pd.DataFrame(university_towns, columns = ['State', 'RegionName'])
towns_df.head()
Out[31]: State RegionName
0 Alabama[edit]\n Auburn (Auburn University)[1]\n
1 Alabama[edit]\n Florence (University of North Alabama)\n
2 Alabama[edit]\n Jacksonville (Jacksonville State University)[2]\n
3 Alabama[edit]\n Livingston (University of West Alabama)[2]\n
4 Alabama[edit]\n Montevallo (University of Montevallo)[2]\n
In [32]: def clean_up(item):
if '(' in item:
return item[:item.find('(') - 1] #since space before '('
if '[' in item:
return item[:item.find('[')]
towns_df = towns_df.applymap(clean_up)
towns_df.head()
Out[32]: State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
Renaming columns and skipping rows
In [33]: olympics df = pd.read csv('Datasets\olympics.csv', skiprows=1, header=0)