0% found this document useful (0 votes)

71 views6 pages

.Ipynb - Checkpoints: Python-Data-Cleaning

The document is a tutorial on data cleaning using Python, specifically focusing on cleaning a dataset of book information. It includes code examples for dropping unnecessary columns, cleaning specific fields like dates and author names, and renaming columns. The tutorial demonstrates various techniques using the pandas library to prepare data for analysis.

Uploaded by

artemiss3000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views6 pages

.Ipynb - Checkpoints: Python-Data-Cleaning

Uploaded by

artemiss3000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

realpython / python-data-cleaning Public

forked from MalayAgr/real_python_data_cleaning_tutorial

Code Pull requests 1 Actions Projects Wiki Security Insights

python-data-cleaning / .ipynb_checkpoints / Data Cleaning Tutorial - Real Python-checkpoint.ipynb

MalayAgr Synch with article a8d1017 · 6 years ago History

1633 lines (1633 loc) · 56.6 KB

BONUS: Contains two more examples of cleaning specific columns

In [36]: import pandas as pd

import numpy as np
from functools import reduce

Dropping unnecessary columns

In [37]: df = pd.read_csv('Datasets\BL-Flickr-Images-Book.csv')
df.head()

Out[37]: Edition Place of Date of Co

Identifier Publisher Title Author Contributors
Statement Publication Publication

Walter
S. Tinsley Forbes. [A FORBES,
0 206 NaN London 1879 [1878] A. A.
& Co. novel.] By A. Walter.
A

All for Greed. BLAZE DE

London;
Virtue & [A novel. The A., A. BURY, Marie
1 216 NaN Virtue & 1868
Co. dedication A. Pauline Rose
Yorston
signed... - Baroness

Love the BLAZE DE

Bradbury,
Avenger. By A., A. BURY, Marie
2 218 NaN London 1869 Evans &
the author of A. Pauline Rose
Co.
“All for Gr... - Baroness

Welsh
Sketches, Appleyard,
James
3 472 NaN London 1851 chiefly A., E. S. Ernest
Darling
ecclesiastical, Silvanus.
to the...

A new [The World

Wertheim
edition, in which I BROOME,
4 480 London 1857 & A., E. S.
revised, live, and my John Henry.
Macintosh
etc. place in it...

In [38]: to_drop = ['Edition Statement',

'Corporate Author',
'Corporate Contributors',
'Former owner',
'Engraver',
'Contributors',
'Issuance type',
'Shelfmarks']

df.drop(to_drop, inplace = True, axis = 1) #or: df.drop(columns = to_drop, inplace = True)

Changing the index

In [39]: df = df.set_index('Identifier') #or: df.set_index('Identifier', inplace = True)
df.head()

Out[39]: Place of Date of

Publisher Title Author
Publication Publication

Identifier

Walter
S. Tinsley Forbes. [A
206 London 1879 [1878] A. A. http://www.flickr.com/photos/britis
& Co. novel.] By A.
A

All for Greed.

London;
Virtue & [A novel. The A., A.
216 Virtue & 1868 http://www.flickr.com/photos/britis
Co. dedication A.
Yorston
signed...

Love the
Bradbury,
Avenger. By A., A.
218 London 1869 Evans & http://www.flickr.com/photos/britis
the author of A.
Co.
All for Gr...

Welsh
Sketches,
James
472 London 1851 chiefly A., E. S. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...

[The World
Wertheim
in which I
480 London 1857 & A., E. S. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...

Cleaning specific columns

In [40]: unwanted_characters = ['[', ',', '-']

def clean_dates(item):
dop= str(item.loc['Date of Publication'])

if dop == 'nan' or dop[0] == '[':

return np.NaN

for character in unwanted_characters:

if character in dop:
character_index = dop.find(character)
dop = dop[:character_index]

return dop

df['Date of Publication'] = df.apply(clean_dates, axis = 1)

In [41]: df.head()

Out[41]: Place of Date of

Publisher Title Author
Publication Publication

Identifier

Walter
S. Tinsley Forbes. [A
206 London 1879 A. A. http://www.flickr.com/photos/britis
& Co. novel.] By A.
A

All for Greed.

London;
Virtue & [A novel. The A., A.
216 Virtue & 1868 http://www.flickr.com/photos/britis
Co. dedication A.
Yorston
signed...

Love the
Bradbury,
Avenger. By A., A.
218 London 1869 Evans & http://www.flickr.com/photos/britis
the author of A.
Co.
“All for Gr...

Welsh
Sketches,
James
472 London 1851 chiefly A., E. S. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...

[The World
Wertheim
in which I
480 London 1857 & A., E. S. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...

In [42]: def clean_author_names(item):

author = str(item.loc['Author'])

if author == 'nan':
return np.NaN

author = author.split(',')

if len(author) == 1:
name = filter(lambda x: x.isalpha(), author[0])
return reduce(lambda x, y: x + y, name)

last_name, first_name = author[0], author[1]

first_name = first_name[:first_name.find('-')] if '-' in first_name else first_name

if first_name.endswith(('.', '.|')):
parts = first_name.split('.')

if len(parts) > 1:
first_occurence = first_name.find('.')
final_occurence = first_name.find('.', first_occurence + 1)
first_name = first_name[:final_occurence]
else:
first_name = first_name[:first_name.find('.')]

last_name = last_name.capitalize()

return f'{first_name} {last_name}'

df['Author'] = df.apply(clean_author_names, axis = 1)

In [43]: df.head()

Out[43]: Place of Date of

Publisher Title Author
Publication Publication

Identifier

Walter
S. Tinsley Forbes. [A
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. novel.] By A.
A

All for Greed.

London;
Virtue & [A novel. The
216 Virtue & 1868 A. A A. http://www.flickr.com/photos/britis
Co. dedication
Yorston
signed...

Love the
Bradbury,
Avenger. By
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
the author of
Co.
“All for Gr...

Welsh
Sketches,
James
472 London 1851 chiefly E. S A. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...

[The World
Wertheim
in which I
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...

In [44]: pub = df['Place of Publication']

df['Place of Publication'] = np.where(pub.str.contains('London'), 'London',
np.where(pub.str.contains('Oxford'), 'Oxford',
np.where(pub.eq('Newcastle upon Tyne'),
'Newcastle-upon-Tyne', df['Place of Publication'])))

In [47]: df.head()

Out[47]: Place of Date of

Publisher Title Author
Publication Publication

Identifier

Walter
S. Tinsley Forbes. [A
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. novel.] By A.
A

All for Greed.

Virtue & [A novel. The
216 London 1868 A. A A. http://www.flickr.com/photos/britis
Co. dedication
signed...

Love the
Bradbury,
Avenger. By
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
the author of
Co.
“All for Gr...

Welsh
Sketches,
James
472 London 1851 chiefly E. S A. http://www.flickr.com/photos/britis
Darling
ecclesiastical,
to the...

[The World
Wertheim
in which I
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
live, and my
Macintosh
place in it...

In [29]: def clean_title(item):

title = str(item['Title'])

python-data-cleaning / .ipynb_checkpoints
if title == 'nan': / Data Cleaning Tutorial - Real Python-checkpoint.ipynb Top
return np.NaN

Preview Code if title[0]

Blame == '[': Raw
title = title[1: title.find(']')]

if 'by' in title:
title = title[:title.find('by')]
elif 'By' in title:
title = title[:title.find('By')]

if '[' in title:
title = title[:title.find('[')]

title = title[:-2]

title = list(map(str.capitalize, title.split()))

return ' '.join(title)

df['Title'] = df.apply(clean_title, axis = 1)

In [30]: df.head()

Out[30]: Place of Date of

Publisher Title Author
Publication Publication

Identifier

S. Tinsley Walter
206 London 1879 AA http://www.flickr.com/photos/britis
& Co. Forbes

Virtue &
216 London 1868 All For Greed A. A A. http://www.flickr.com/photos/britis
Co.

Bradbury,
Love The
218 London 1869 Evans & A. A A. http://www.flickr.com/photos/britis
Avenger
Co.

Welsh
Sketches,
James
472 London 1851 Chiefly E. S A. http://www.flickr.com/photos/britis
Darling
Ecclesiastical,
To The...

The World In
Wertheim
Which I Live,
480 London 1857 & E. S A. http://www.flickr.com/photos/britis
And My
Macintosh
Place In It

Cleaning entire dataset

In [31]: university_towns = []

with open('Datasets\\university_towns.txt', 'r') as file:

items = file.readlines()
states = list(filter(lambda x: '[edit]' in x, items))

for index, state in enumerate(states):

start items.index(state) + 1
if index == 49: #since 50 states
end = len(items)
else:
end = items.index(states[index + 1])

pairs = map(lambda x: [state, x], items[start:end])

university_towns.extend(pairs)

towns_df = pd.DataFrame(university_towns, columns = ['State', 'RegionName'])

towns_df.head()

Out[31]: State RegionName

0 Alabama[edit]\n Auburn (Auburn University)[1]\n

1 Alabama[edit]\n Florence (University of North Alabama)\n

2 Alabama[edit]\n Jacksonville (Jacksonville State University)[2]\n

3 Alabama[edit]\n Livingston (University of West Alabama)[2]\n

4 Alabama[edit]\n Montevallo (University of Montevallo)[2]\n

In [32]: def clean_up(item):

if '(' in item:
return item[:item.find('(') - 1] #since space before '('

if '[' in item:
return item[:item.find('[')]

towns_df = towns_df.applymap(clean_up)
towns_df.head()

Out[32]: State RegionName

0 Alabama Auburn

1 Alabama Florence

2 Alabama Jacksonville

3 Alabama Livingston

4 Alabama Montevallo

Renaming columns and skipping rows

In [33]: olympics df = pd.read csv('Datasets\olympics.csv', skiprows=1, header=0)

.Ipynb - Checkpoints: Data Cleaning Tutorial - Real Python-Checkpoint - Ipynb
No ratings yet
.Ipynb - Checkpoints: Data Cleaning Tutorial - Real Python-Checkpoint - Ipynb
9 pages
Python-Data-Cleaning - Data Cleaning Tutorial - Real Python - Ipynb at Master Realpython - Python-Data-Cleaning GitHub
No ratings yet
Python-Data-Cleaning - Data Cleaning Tutorial - Real Python - Ipynb at Master Realpython - Python-Data-Cleaning GitHub
15 pages
Lab Assignment 3
No ratings yet
Lab Assignment 3
19 pages
DSA Lab Manual
No ratings yet
DSA Lab Manual
10 pages
Data Cleaning with NumPy & Pandas
No ratings yet
Data Cleaning with NumPy & Pandas
14 pages
Python Data Cleaning with Pandas & NumPy
No ratings yet
Python Data Cleaning with Pandas & NumPy
15 pages
Filter Data in WPS by Conditions
No ratings yet
Filter Data in WPS by Conditions
79 pages
Python Data Cleaning with Pandas
No ratings yet
Python Data Cleaning with Pandas
11 pages
Analyzing Study Hours and Book Data
No ratings yet
Analyzing Study Hours and Book Data
3 pages
Pandas Data Cleaning Techniques
No ratings yet
Pandas Data Cleaning Techniques
6 pages
Comprehensive Python Cheatsheet
No ratings yet
Comprehensive Python Cheatsheet
52 pages
Pandas String Methods Guide
No ratings yet
Pandas String Methods Guide
19 pages
Python Pandas DataFrame Guide
No ratings yet
Python Pandas DataFrame Guide
1 page
Pandas
No ratings yet
Pandas
27 pages
Line by Line 12 IP
No ratings yet
Line by Line 12 IP
21 pages
Python Data Analysis Techniques
No ratings yet
Python Data Analysis Techniques
22 pages
Data Cleaning with Pandas & NumPy
No ratings yet
Data Cleaning with Pandas & NumPy
20 pages
Essential Python Libraries for Data Analysis
No ratings yet
Essential Python Libraries for Data Analysis
2 pages
Comprehensive Python Cheatsheet
No ratings yet
Comprehensive Python Cheatsheet
56 pages
Data Cleaning Techniques in Python
No ratings yet
Data Cleaning Techniques in Python
46 pages
00 Data Wrangling
No ratings yet
00 Data Wrangling
10 pages
CS3361 - Data Science University Question Paper Answers
No ratings yet
CS3361 - Data Science University Question Paper Answers
46 pages
Marvel vs DC Movie Gross Analysis
No ratings yet
Marvel vs DC Movie Gross Analysis
1 page
Unit - Iii
No ratings yet
Unit - Iii
78 pages
Introducing Pandas String Operations & Plots
No ratings yet
Introducing Pandas String Operations & Plots
16 pages
Numpy and Pandas Basics Guide
No ratings yet
Numpy and Pandas Basics Guide
3 pages
Python Data Structures and Libraries Guide
No ratings yet
Python Data Structures and Libraries Guide
7 pages
Filtering Data in Pandas with Python
No ratings yet
Filtering Data in Pandas with Python
52 pages
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
No ratings yet
Web Scraping Weather Data Using Python - by Abhishek Khatri - Medium
8 pages
Pandas
No ratings yet
Pandas
20 pages
Data Processing Exercises in Python
No ratings yet
Data Processing Exercises in Python
7 pages
Ip Study
No ratings yet
Ip Study
18 pages
Comprehensive Python Cheatsheet
No ratings yet
Comprehensive Python Cheatsheet
60 pages
Data Science Web Scraping Guide
No ratings yet
Data Science Web Scraping Guide
4 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Python Cheatsheet for Quick Reference
No ratings yet
Python Cheatsheet for Quick Reference
64 pages
Block 02
No ratings yet
Block 02
4 pages
DataFrame Project Acknowledgement and Overview
No ratings yet
DataFrame Project Acknowledgement and Overview
25 pages
Data Cleaning Techniques in Python
No ratings yet
Data Cleaning Techniques in Python
43 pages
1-Python Pandas Case Study
No ratings yet
1-Python Pandas Case Study
25 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Data Processing and Analysis Techniques
No ratings yet
Data Processing and Analysis Techniques
18 pages
Unpacking Columns and Filenames Explained
No ratings yet
Unpacking Columns and Filenames Explained
8 pages
Ip Practice Test (14in)
No ratings yet
Ip Practice Test (14in)
9 pages
Python & Pandas Cheat Sheet Guide
No ratings yet
Python & Pandas Cheat Sheet Guide
11 pages
Python Cheatsheet
No ratings yet
Python Cheatsheet
2 pages
Spell Correction Using Word Probabilities
No ratings yet
Spell Correction Using Word Probabilities
10 pages
Python Pandas and DataFrame Basics
No ratings yet
Python Pandas and DataFrame Basics
20 pages
Assignment 5 (Dictionary)
No ratings yet
Assignment 5 (Dictionary)
2 pages
Dict Ssets
No ratings yet
Dict Ssets
14 pages
Fds Lab
No ratings yet
Fds Lab
16 pages
Housing Data Cleaning & Analysis
No ratings yet
Housing Data Cleaning & Analysis
7 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Python Basics Cheat Sheet
No ratings yet
Python Basics Cheat Sheet
3 pages
002 Python Pandas
No ratings yet
002 Python Pandas
19 pages
Data Cleaning Techniques in Python
No ratings yet
Data Cleaning Techniques in Python
47 pages
API Broadcast
No ratings yet
API Broadcast
9 pages
IB ACIO Reasoning Chapter-Wise - RBE - Compressed
No ratings yet
IB ACIO Reasoning Chapter-Wise - RBE - Compressed
53 pages
Use and Calibration of Ultraprecision Axes of Rotation With Nanometer Level Metrology
No ratings yet
Use and Calibration of Ultraprecision Axes of Rotation With Nanometer Level Metrology
147 pages
Senior Data Engineer Profile Summary
No ratings yet
Senior Data Engineer Profile Summary
11 pages
Syllabus
No ratings yet
Syllabus
4 pages
Unnati Mini Project 1
No ratings yet
Unnati Mini Project 1
36 pages
Microsoft - Premium.az 900.by .VCEplus.48q
100% (4)
Microsoft - Premium.az 900.by .VCEplus.48q
32 pages
Railway Reservation System Project Report
No ratings yet
Railway Reservation System Project Report
25 pages
Manual - Hydrospex Strand Jacks - ENG REV-B PDF
No ratings yet
Manual - Hydrospex Strand Jacks - ENG REV-B PDF
56 pages
SP3D Electrical User's Guide
50% (4)
SP3D Electrical User's Guide
228 pages
Fated by Anindana PDF
100% (3)
Fated by Anindana PDF
516 pages
Stacks
No ratings yet
Stacks
61 pages
Activator Scholarship 2024 - Rules & Regulations
No ratings yet
Activator Scholarship 2024 - Rules & Regulations
3 pages
Platform To Powertrain Electrical Interface Specification: Worldwide Engineering Standards
100% (1)
Platform To Powertrain Electrical Interface Specification: Worldwide Engineering Standards
300 pages
AVEVA Licensing: 60 Tips in 60 Mins
No ratings yet
AVEVA Licensing: 60 Tips in 60 Mins
58 pages
3500 SYSTEM: Installation Manual
No ratings yet
3500 SYSTEM: Installation Manual
686 pages
Google PESTEL/PESTLE Analysis & Recommendations
No ratings yet
Google PESTEL/PESTLE Analysis & Recommendations
4 pages
Assignment
No ratings yet
Assignment
5 pages
Rsa NW 11.5 Esa Alerting User Guide
No ratings yet
Rsa NW 11.5 Esa Alerting User Guide
216 pages
Fluent Overset Mesh: 张理想/高级工程师 Ansys
No ratings yet
Fluent Overset Mesh: 张理想/高级工程师 Ansys
76 pages
SENSORS AND TRANSDUCERS Question Paper 21 22
No ratings yet
SENSORS AND TRANSDUCERS Question Paper 21 22
3 pages
Antminer APW12 Power Supply Repair Guide (En) - Zeus Mining
No ratings yet
Antminer APW12 Power Supply Repair Guide (En) - Zeus Mining
19 pages
LAS 2 Evaluating Functions and Operations On Functions
No ratings yet
LAS 2 Evaluating Functions and Operations On Functions
9 pages
TSB FC 2013 141 Onboard Charger Failure
No ratings yet
TSB FC 2013 141 Onboard Charger Failure
4 pages
WCM WPM WCM GMBH 1
No ratings yet
WCM WPM WCM GMBH 1
54 pages
Technical Report Writing Exam Guide
No ratings yet
Technical Report Writing Exam Guide
6 pages
Firepower 4200 Datasheet
No ratings yet
Firepower 4200 Datasheet
14 pages
Professional 3D Printer Buyers Guide Update - Aug 2024 - WEB
No ratings yet
Professional 3D Printer Buyers Guide Update - Aug 2024 - WEB
8 pages
Compound Scaling Encoder-Decoder CoSED Network For Diabetic Retinopathy Related Bio-Marker Detection
No ratings yet
Compound Scaling Encoder-Decoder CoSED Network For Diabetic Retinopathy Related Bio-Marker Detection
12 pages
Understanding Digital Literacy in India
No ratings yet
Understanding Digital Literacy in India
6 pages

.Ipynb - Checkpoints: Python-Data-Cleaning

Uploaded by

.Ipynb - Checkpoints: Python-Data-Cleaning

Uploaded by

realpython / python-data-cleaning Public

forked from MalayAgr/real_python_data_cleaning_tutorial

Code Pull requests 1 Actions Projects Wiki Security Insights

python-data-cleaning / .ipynb_checkpoints / Data Cleaning Tutorial - Real Python-checkpoint.ipynb

MalayAgr Synch with article a8d1017 · 6 years ago History

1633 lines (1633 loc) · 56.6 KB

In [36]: import pandas as pd

Dropping unnecessary columns

Out[37]: Edition Place of Date of Co

All for Greed. BLAZE DE

Love the BLAZE DE

A new [The World

In [38]: to_drop = ['Edition Statement',

df.drop(to_drop, inplace = True, axis = 1) #or: df.drop(columns = to_drop, inplace = True)

Changing the index

Out[39]: Place of Date of

All for Greed.

Cleaning specific columns

if dop == 'nan' or dop[0] == '[':

for character in unwanted_characters:

df['Date of Publication'] = df.apply(clean_dates, axis = 1)

Out[41]: Place of Date of

All for Greed.

In [42]: def clean_author_names(item):

last_name, first_name = author[0], author[1]

first_name = first_name[:first_name.find('-')] if '-' in first_name else first_name

return f'{first_name} {last_name}'

df['Author'] = df.apply(clean_author_names, axis = 1)

Out[43]: Place of Date of

All for Greed.

In [44]: pub = df['Place of Publication']

Out[47]: Place of Date of

All for Greed.

In [29]: def clean_title(item):

Preview Code if title[0]

title = list(map(str.capitalize, title.split()))

df['Title'] = df.apply(clean_title, axis = 1)

Out[30]: Place of Date of

Cleaning entire dataset

with open('Datasets\\university_towns.txt', 'r') as file:

for index, state in enumerate(states):

pairs = map(lambda x: [state, x], items[start:end])

towns_df = pd.DataFrame(university_towns, columns = ['State', 'RegionName'])

Out[31]: State RegionName

0 Alabama[edit]\n Auburn (Auburn University)[1]\n

1 Alabama[edit]\n Florence (University of North Alabama)\n

2 Alabama[edit]\n Jacksonville (Jacksonville State University)[2]\n

3 Alabama[edit]\n Livingston (University of West Alabama)[2]\n

4 Alabama[edit]\n Montevallo (University of Montevallo)[2]\n

In [32]: def clean_up(item):

Out[32]: State RegionName

Renaming columns and skipping rows

You might also like