Quick Guide To Data Cleaning With Examples

BACK
Quick Guide to Data Cleaning with Examples
Maria Chojnowska
5 April 2023, 10 min read
Data cleaning is a critical part of every data science project. There’s no doubt about it. You
should know that preparing data for proper analysis is often time-consuming and challenging.
Data scientists tend to spend a lot of time at this level of data exploration. Moreover, this step
sometimes requires domain knowledge. Here’s what usually happens Data scientists often get
data from different sources. The data in question might be inconsistent, dirty, or missing. That’s
why data cleaning is so important. This is where data scientists decide how to deal with such
problems and preserve as much valuable data as possible. In this article, I present a few handy
Hi there, we use cookies to provide you with an amazing experience on our site. If you continue without changing
data-cleaning techniques every data scientist needs to know.
the settings, we’ll assume that you’re happy to receive all cookies on Sunscrapers website. You can change your
cookie settings at any time.
What's inside
Accept Learn more
1. Let’s start with data cleaning
2. Merging data from two CSV files

3. Making data consistent
4. Deleting columns
5. Missing values
6. Mapping values
7. Outliers
8. Scaling
9. Normalization
10. Summarize
11. Contact us
Let’s start with data cleaning

First things first. What is the definition of data cleaning?
It is identifying and correcting or removing errors, inconsistencies, and inaccuracies in data to improve its quality
and usefulness. Data cleaning aims to ensure that data is accurate, complete, consistent, and formatted correctly
for analysis.
The data I will use in this article comes from a travel company that offers trips from London to New Zealand and
Australia.
We get two tables:
traveler (users)
travel (travel records)
First of all, we need to import Pandas.

import pandas as pd
If you’d like to follow this tutorial, prepare this data:
# preparing data
traveler = pd.DataFrame({
'user_id': [136, 284, 101, 529, 800, 823],
'age': [None, 38, 30, 43, 49, 28],
'name': ["Ann", "Ben", "Tom", "Bianca", "Caroline", "Kate"]
})
travel = pd.DataFrame({
'user_id': [101, 284, 136, 800, 101, 800, 823, 529, 284],
'date_of_journey': [
'2018-01-16',
'2017-07-13',
'2019-10-10',
'2018/03/20',
'2019-12-24',
'2017-10-17',
'2016/11/02',
'2019/09/14',
'2019-08-07'
],
'duration': [10, 10,7,13,7,11,14, 8, 12],
'destination': [
"New Zeland",
"australia",
"Australia",
"New_Zealand",
"Australia/",
"Australia",
"New Zealand",
"Australia",
"New_zealand"
],
'cost': [None, 2325.0, 1760.0, 2740.0, 4000.0, 2475.0, 3140.0, 1840.0, 2910.0],
'currency': [None, 'EUR', 'GBP', 'GBP', 'GBP','EUR', 'GBP', 'GBP', 'GBP']
})
traveler.to_csv("traveler.csv", index=False)
travel.to_csv("travel.csv", index=False)
Now that we’ve got our data, it’s time to do some data cleaning.
Merging data from two CSV files

Our first step is merging data from two CSV files. The data we need is separated into two tables, so we need to
combine them using the user_id column in both files. Here’s how you merge this data in Pandas.
traveler = pd.read_csv("traveler.csv")
travel = pd.read_csv("travel.csv")
thedata
settings, we’ll assume traveler,
= pd.merge(travel, that you’reon='user_id')
happy to receive all cookies on Sunscrapers website. You can change your
Making data consistent
Date
To create visualizations of data in time, we need to transform it to get the same data format within the column. In
our case, one column includes several date formats, so we need to use infer_datetime_format and allow Pandas to
try to infer the proper format. Note that the library might not work properly if a user is too creative with the date
format.
data.date_of_journey = pd.to_datetime(data.date_of_journey, infer_datetime_format=True)
Cost
When we analyze the cost and currency columns, we easily notice that costs are given in two currencies. We want
our data to be consistent, so changing the cost values to (for example) GBP is a good idea.
We can do that by using Pandas slicing. First, we need to choose only the cells where the cost value is given in the
currency 'EUR' and update these cells (multiplying by 0.8). In the second step, we change all the "EUR" to "GBP" in
the currency column.
data.loc[
data.currency == 'EUR',
['cost']
] = data.cost \* 0.8
data.currency.replace("EUR", "GBP", inplace=True)
Destination
As we can see, the destination column contains more unique values than expected. It should include two values,
which is why we need to fix it by changing every string respectively to new_zealand or to australia. In the second
cell, all categories occur in the destination column.
data.destination.describe()
# We need to get all compromised categories and try to unify them
Hi there, we use
categories cookies to provide you with an amazing experience on our site. If you continue without changing
= data.destination.drop_duplicates()
print(categories)
data.destination.loc[data.destination.str.lower().str.startswith('n', na=False)] = "new_zealand"
data.destination.loc[data.destination.str.lower().str.startswith('a', na=False)] = "australia"
Deleting columns
If there are columns that we suppose could be dismissed from analysis, we can drop them at this point. In our
example, we will drop currency since it’s the same for all records. We will also delete the name column because we
don’t need this information for our analysis.
data = data.drop(['currency', 'name'], axis=1)
Missing values
We can check whether there are any missing values data.isna() in the first cell.
If we spot missing values, we can perform one of the two following actions:
1. If we can't figure out how to fill the missing values in a way that isn’t going to be misleading, we can delete the
rows that include incomplete data by using data.dropna().
2. If we have enough knowledge to fill in the missing values, we should do it. The idea here is to preserve as much
data/rows as possible. We can use many strategies to do that, but domain knowledge will be required for this
task. For example, in the column that contains only positive values, we can fill the empty values with (-1) to
highlight its difference. Another solution is using some arbitrarily chosen value or calculated values like mean,
max, min value.
data.isna()
In our case, we’re going to fill the missing values with:
cost with mean travel costs for 'new_zealand'

age with a mean value calculated from age column
values = {
Hi there, we use
'cost': cookies to provide you with an amazing experience on our site. If you continue without changing
data.cost.where(data.destination=='new_zealand').mean(),
'age': round(data.age.mean())
the}settings, we’ll assume that you’re happy to receive all cookies on Sunscrapers website. You can change your
cookie settings at any
print('values:', time.
values)
data = data.fillna(value=values, limit=1)
Mapping values
Now it’s time to map the destination column values to {'new_zealand': 0, 'australia': 1} , which is more readable
for a machine than strings.
data.destination.replace(('new_zealand', 'australia'), (0,1), inplace=True)
Outliers
Outliers are observations that are significantly different from other observations, sometimes called abnormal
observations or anomalies. The easiest way to notice outliers is by preparing plots. To do that, we will use scatter
plot and boxplot.
In both plots, we can see that one observation differs notably from the others (marked with red). Using our
business knowledge, we can decide whether such an observation is an error, or maybe we can explain it somehow
and, ultimately, it’s valuable for us. For example, when analyzing health data, outliers can indicate anomalies that
might be caused by disease.
In our situation, let's say that this anomaly is caused by the dates when the user was traveling (from Christmas to
New Year's Eve), which probably caused the trip cost to skyrocket. We can delete it because such a situation is
abnormal and misleading.
We can also use clustering, standard deviation analysis, and other strategies to detect outliers.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
plt.rcParams['figure.figsize'] = [10, 7]
x = data.duration y = data.cost z = data.destination
fig, ax = plt.subplots()
scatter = ax.scatter(x, y, c=z, s=np.pi\*100, marker='o', alpha=0.5)
# produce a legend with the unique colors from the scatter

legend = ax.legend(\*scatter.legend_elements(), loc=1, title="Destination")
ax.add_artist(legend)
legend.get_texts()[0].set_text('new_zealand')
legend.get_texts()[1].set_text('australia')
plt.title('Travel costs vs travel duration for Australia and New Zealand')
plt.xlabel('travel duration')
plt.ylabel('travel costs')
plt.show()
boxplot = data.boxplot(column=['cost'], by=['destination']) plt.xticks([1, 2], ['new_zealand', 'australia'])
data = data.drop(data.index[data.cost == data.cost.max()], axis=0)
Scaling
Data scaling, also known as feature scaling (or sometimes mistakenly called a normalization - more about that
below!), is a process of transforming numerical data into a standardized format that can be more easily compared
and analyzed.
In data scaling, the original numerical values of features in a dataset are transformed to fall within a specific range
or have a particular distribution. Data scaling aims to ensure that features are on a similar scale, which can improve
the performance of machine learning algorithms, especially those that rely on distance or magnitude calculations.
Data scaling is important in many data analysis and machine learning workflows. By scaling data, you can make it
easier to compare different features, identify patterns, and build more accurate models.
Sometimes scaling data can be useful or necessary because many analytics methods are sensitive to the data
scale. But what does that mean? Scaling refers to changing the range of features. For example, scaling transforms
data to fit a specific range like [0-1]. We can use several scaling methods here. Our pick is the Min-Max algorithm,
which uses the min and max values for calculation.
We will scale the cost column.
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
scaler.fit(data[['cost']])
print('Scaled cost value: \\n', scaler.transform(data[['cost']]))
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
ax = plt.subplot(121)
plt.scatter(data.duration, data.cost, s=np.pi\*100, marker='o', alpha=0.5)
plt.title('Cost data')
plt.xlabel('duration', multialignment='center')
plt.ylabel('cost', multialignment='center')
plt.grid(True)
plt.subplot(122)
plt.scatter(data.duration, scaler.transform(data[['cost']]),
s=np.pi\*100, marker='o', alpha=0.5)
plt.title('Scaled cost data')
Hi there, we use cookies to
plt.xlabel('duration', provide you with an amazing experience on our site. If you continue without changing
multialignment='center')
plt.ylabel('cost', multialignment='center')
theplt.grid(True)
settings, we’ll assume that you’re happy to receive all cookies on Sunscrapers website. You can change your
plt.show()
Conclusion: The scatterplots above show that the points are in the same locations, but their values have changed.
Normalization
Remember that scaling isn’t the same thing as normalization, even if sometimes these terms are confused.
Data normalization is the process of organizing data in a database to reduce redundancy and dependency while
improving data integrity and consistency. Normalization aims to make the database more efficient, easier to use,
and less prone to errors or inconsistencies.
Normalization involves breaking down a database into smaller, more manageable tables and establishing
relationships between them. This is typically done by dividing the data into smaller units or "atomic" values and
grouping related data into tables. Each table is given a primary key, uniquely identifying each row of data, and data
repeated across tables is stored only once.
Normalization is a more radical process than scaling. It changes the shape of the data distribution to the bell curve
( normal distribution).
Summarize
I hope this article helps you take your first steps in data cleaning and building the foundation for further analysis. To
sum it up for you:
1. Define your goals: Before you start cleaning your data, define your goals and what you hope to achieve with your
data analysis. This will help you prioritize your cleaning tasks and ensure that your data is fit for purpose.
2. Collect your data: Collect all the relevant data you need to achieve your goals. This may involve gathering data
from multiple sources or systems.
3. Assess your data: Assess the quality of your data by checking for missing values, duplicate records, and
inconsistencies. Identify any patterns or outliers that may affect your analysis.
4. Clean your data: Clean your data by removing or correcting errors, filling in missing values, and standardizing
formats. You may also need to restructure your data or combine datasets to make them more usable.
5. Validate your data: Validate your data to ensure that it is accurate and consistent. This may involve cross-
Hi checking
there, we your
use cookies to provide
data against yousources
external with anor
amazing
runningexperience on our site. If you continue without changing
quality checks.
6. Document
cookie your
settings datatime.
at any cleaning process: Include any changes made in case you need to refer back to it later or
share it with others.
7. Test your data: Test your data to ensure it is fit for purpose and meets your goals. This may involve running
some initial data analysis or building a prototype model.
8. Iterate: Data cleaning is an iterative process, and you may need to repeat some or all of the steps above as you
refine your data analysis.
In addition to these steps, it's also important to consider data normalization, which involves standardizing data
values and units of measurement to eliminate inconsistencies and make it easier to compare different data points.
Normalization can help ensure that your data is accurate, meaningful and can facilitate more accurate analysis and
interpretation.
By following these steps, you can ensure that your data is clean, accurate, and fit for purpose and that your data
analysis is based on a solid foundation.
Contact us
Are you looking for more information about data cleaning and more data-related topics? Follow our company blog,
where our experts share their knowledge about data science with our community, contact us through the form or
at hello@sunscrapers.com.
Tags
python business intelligence
Share
Recent posts
Unlocking
HTMX,
Advanced
Tailwind
Features
CSS,
inand
Django
Django:
with
From
HTMX
Basics
to
Advanced.
Part
2
Read
Read
article
article
See all blog posts
Are you ready for your next project?

Whether you need a full product, consulting, tech investment or an extended team, our experts will
help you find the best solutions.
Contact us
A software development shop with over a decade of experience in developing software solutions, building tech
teams and educating the programmers community.
Contact us
Sunscrapers sp. z o.o.

ul. Tadeusza Czackiego 15/17
00-043 Warsaw
New Projects
hello@sunscrapers.com
Careers
hr@sunscrapers.com
Privacy policy
Copyright 2023 © Sunscrapers sp. z o.o. All rights reserved.

Quick Guide To Data Cleaning With Examples - Sunscrapers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quick Guide To Data Cleaning With Examples - Sunscrapers

Uploaded by

Copyright:

Available Formats

BACK

2. Merging data from two CSV files

Let’s start with data cleaning

We get two tables:

First of all, we need to import Pandas.

Merging data from two CSV files

Making data consistent

data.date_of_journey = pd.to_datetime(data.date_of_journey, infer_datetime_format=True)

data = data.drop(['currency', 'name'], axis=1)

In our case, we’re going to fill the missing values with:

cost with mean travel costs for 'new_zealand'

data.destination.replace(('new_zealand', 'australia'), (0,1), inplace=True)

x = data.duration y = data.cost z = data.destination

scatter = ax.scatter(x, y, c=z, s=np.pi\*100, marker='o', alpha=0.5)

# produce a legend with the unique colors from the scatter

boxplot = data.boxplot(column=['cost'], by=['destination']) plt.xticks([1, 2], ['new_zealand', 'australia'])

data = data.drop(data.index[data.cost == data.cost.max()], axis=0)

We will scale the cost column.

from sklearn import preprocessing

import matplotlib.pyplot as plt

python business intelligence

See all blog posts

Are you ready for your next project?

Sunscrapers sp. z o.o.

Copyright 2023 © Sunscrapers sp. z o.o. All rights reserved.

You might also like