You are on page 1of 11

Data Cleansing

Data Cleansing
Data cleansing, Data cleaning and Data scrubbing is the process of detecting and
correcting corrupt or inaccurate records from a data set
This involves exploring raw data, tidying messy data and preparing data for
analysis
In data preprocessing phase, often cleaning data takes 50-80% of time before
actually mining them for insights

Data Quality
Data Quality
Business decisions often revolve around

identifying prospects
understanding customers to stay connected
knowing about competitors and partners
being current and relevant with marketing campaigns
Data quality is an important factor that impacts the outcomes of the data analysis
to arrive at accurate decision making. Qualitative predictions cannot be made with
data having nil to low quality.

But Dirty Data is inevitable in the system due to various reasons. Hence it is
essential to clean your data at all times. This is an ongoing exercise that the
organizations have to follow.

Dirty data
Dirty data
Dirty data refers to data with erroneous information. Following are considered as
dirty data.

Misleading data
Duplicate data
Inaccurate data
Non-integrated data
Data that violates business rules
Data without a generalized formatting
Incorrectly punctuated or spelled data
*source - Techopedia

Why Data Cleansing is needed


Why Data Cleansing is needed
While integrating data, there might be quality issues, some of which are listed
below.

Inconsistent values. Ex - 'TWO' and '2' for same field


Additional fields or missing fields.
JSON files having different structure.
Out of range numbers. Ex - 'Age' being negative
Outliers to standard distributions
Variation in the date format
Data cleansing process helps to handle:

Missing values
Inaccurate values
Duplicates values
Outliers like typographic / measurement errors
Noisy values
Data timeliness (age of data)
How to manage Missing Values
How to manage Missing Values
Ignoring or removing missing values, is not a right approach as they may be too
important to ignore. Similarly, filling the missing value manually may be tedious
and not feasible.

Other options to consider for filling missing values could be:

Use a global constant e.g., “NA”


Use the attribute mean
Use the most probable value. Ex: inference based such as regression, Bayesian
formula, decision tree
You will see more about managing missing values in the subsequent topics

5 of 11

How to manage Noisy Data


How to manage Noisy Data
Noisy data is a random error or variance in a measured variable. i.e. these are the
outliers of a dataset. This can be managed by following ways.

Binning Method – First sort the data and partition them into equi-depth bins. Then,
smooth the data by bin means, bin median, bin boundaries etc.
Clustering – Group the data into clusters, then identify and remove outliers
Regression – Using regression functions to smooth the data

How to Manage Inconsistent Data?


How to Manage Inconsistent Data?
Inconsistent data can be removed by

Manual correction using external references


Semi-automatic ways using various tools i. to detect violation of known functional
dependencies and data constraints ii. to correct redundant data

Dirty data is everywhere. In fact, most real-world datasets start off dirty in one
way or another, but need to be cleaned and prepared for analysis.

In this video we will learn about the typical steps involved like exploring raw
data, tidying data, and preparing data for analysis.

Hi, I'm Nick. I'm a data scientist at DataCamp and I'll be your instructor for this
course on Cleaning Data in R. Let's kick things off by looking at an example of
dirty data.

You're looking at the top and bottom, or head and tail, of a dataset containing
various weather metrics recorded in the city of Boston over a 12 month period of
time. At first glance these data may not appear very dirty. The information is
already organized into rows and columns, which is not always the case. The rows are
numbered and the columns have names. In other words, it's already in table format,
similar to what you might find in a spreadsheet document. We wouldn't be this lucky
if, for example, we were scraping a webpage, but we have to start somewhere.

Despite the dataset's deceivingly neat appearance, a closer look reveals many
issues that should be dealt with prior to, say, attempting to build a statistical
model to predict weather patterns in the future. For starters, the first column X
(all the way on the left) appears be meaningless; it's not clear what the columns
X1, X2, and so forth represent (and if they represent days of the month, then we
have time represented in both rows and columns); the different types of
measurements contained in the measure column should probably each have their own
column; there are a bunch of NAs at the bottom of the data; and the list goes on.
Don't worry if these things are not immediately obvious to you -- they will be by
the end of the course. In fact, in the last chapter of this course, you will clean
this exact same dataset from start to finish using all of the amazing new things
you've learned.

Dirty data are everywhere. In fact, most real-world datasets start off dirty in one
way or another, but by the time they make their way into textbooks and courses,
most have already been cleaned and prepared for analysis. This is convenient when
all you want to talk about is how to analyze or model the data, but it can leave
you at a loss when you're faced with cleaning your own data.

With the rise of so-called "big data", data cleaning is more important than ever
before. Every industry - finance, health care, retail, hospitality, and even
education - is now doggy-paddling in a large sea of data. And as the data get
bigger, the number of things that can go wrong do too. Each imperfection becomes
harder to find when you can't simply look at the entire dataset in a spreadsheet on
your computer.

In fact, data cleaning is an essential part of the data science process. In simple
terms, you might break this process down into four steps: collecting or acquiring
your data, cleaning your data, analyzing or modeling your data, and reporting your
results to the appropriate audience. If you try to skip the second step, you'll
often run into problems getting the raw data to work with traditional tools for
analysis in, say, R or Python. This could be true for a variety of reasons. For
example, many common algorithms require variables to be arranged into columns and
for missing values to be either removed or replaced with non-missing values,
neither of which was the case with the weather data you just saw.

Not only is data cleaning an essential part of the data science process - it's also
often the most time-consuming part. As the New York Times reported in a 2014
article called "For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights",
"Data scientists ... spend from 50 percent to 80 percent of their time mired in
this more mundane labor of collecting and preparing unruly digital data, before it
can be explored for useful nuggets." Unfortunately, data cleaning is not as sexy as
training a neural network to identify images of cats on the internet, so it's
generally not talked about in the media nor is it taught in most intro data science
and statistics courses. No worries, we're here to help.

In this course, we'll break data cleaning down into a three step process: exploring
your raw data, tidying your data, and preparing your data for analysis. Each of the
first three chapters of this course will cover one of these steps in depth, then
the fourth chapter will require you to use everything you've learned to take the
weather data from raw to ready for analysis.

Let's jump right in!

Exploring Raw Data

The first step in the data cleaning process is exploring your raw data. We can
think of data exploration itself as a three step process consisting of
understanding the structure of your data, looking at your data, and visualizing
your data.

To understand the structure of your data, you have several tools at your disposal
in R. Here, we read in a simple dataset called lunch, which contains information on
the number of free, reduced price, and full price school lunches served in the US
from 1969 through 2014. First, we check the class of the lunch object to verify
that it's a data frame, or a two-dimensional table consisting of rows and columns,
of which each column is a single data type such as numeric, character, etc.

We then view the dimensions of the dataset with the dim() function. This particular
dataset has 46 rows and 7 columns. dim() always displays the number of rows first,
followed by the number of columns.

Next, we take a look at the column names of lunch with the names() function. Each
of the 7 columns has a name: year, avg_free, avg_reduced, and so on.

Okay, so we're starting to get a feel for things, but let's dig deeper. The str()
(for "structure") function is one of the most versatile and useful functions in the
R language because it can be called on any object and will normally provide a
useful and compact summary of its internal structure. When passed a data frame, as
in this case, str() tells us how many rows and columns we have. Actually, the
function refers to rows as observations and columns as variables, which, strictly
speaking, is true in a tidy dataset, but not always the case as you'll see in the
next chapter. In addition, you see the name of each column, followed by its data
type and a preview of the data contained in it. The lunch dataset happens to be
entirely integers and numerics. We'll have a closer look at these datatypes in
chapter 3.

The dplyr package offers a slightly different flavor of str() called glimpse(),
which offers the same information, but attempts to preview as much of each column
as will fit neatly on your screen. So here, we first load dplyr with the library()
command, then call glimpse() with a single argument, lunch.

Another extremely helpful function is summary(), which, when applied to a data


frame, provides a useful summary of each column. Since the lunch data are entirely
integers and numerics, we see a summary of the distribution of each column
including the minimum and maximum, the mean, and the 25th, 50th, and 75th percent
quartiles (also referred to as the first quartile, median, and third quartile,
respectively.) As you'll soon see, when faced with character or factor variables,
summary() will produce different summaries.

To review, you've seen how we can use the class() function to see the class of a
dataset, the dim() function to view its dimensions, names() to see the column
names, str() to view its structure, glimpse() to do the same in a slightly enhanced
format, and summary() to see a helpful summary of each column.

Okay, so we've seen some useful summaries of our data, but there's no substitute
for just looking at it. The head() function shows us the first 6 rows by default.
If you add one additional argument, n, you can control how many rows to display.
For example, head(lunch, n = 15) will display the first 15 rows of the data.

We can also view the bottom of lunch with the tail() function, which displays the
last 6 rows by default, but that behavior can be altered in the same way with the n
argument.

Viewing the top and bottom of your data only gets you so far. Sometimes the easiest
way to identify issues with the data are to plot them. Here, we use hist() to plot
a histogram of the percent free and reduced lunch column, which quickly gives us a
sense of the distribution of this variable. It looks like the value of this
variable falls between 50 and 60 for 20 out of the 46 years contained in the lunch
dataset.

Finally, we can produce a scatter plot with the plot() function to look at the
relationship between two variables. In this case, we clearly see that the percent
of lunches that are either free or reduced price has been steadily rising over the
years, going from roughly 15 to 70 percent between 1969 and 2014.

To review, head() and tail() can be used to view the top and bottom of your data,
respectively. Of course, you can also just print() your data to the console, which
may be okay when working with small datasets like lunch, but is definitely not
recommended when working with larger datasets.

Lastly, hist() will show you a histogram of a single variable and plot() can be
used to produce a scatter plot showing the relationship between two variables.

#
# Complete the 'mtcars_data' function below.
#
# The function is expected to return an INTEGER.
#

mtcars_data <- function() {


print(dim(mtcars))
print(str(mtcars))
print(colnames(mtcars))
print(summary(mtcars))

mtcars_data()

What is Tidy Data?


What is Tidy Data?
One of the important components of data cleaning is data tidying. Before
understanding what is tidy dataset, let us understand about datasets.

Please examine the data set 1 in the picture.

Let us understand Variables, Observations and Values in a dataset.

value belongs to a variable and an observation.


variable contains all values corresponding to an attribute across units. In the
example, 'Population' variable contains values across all units (Country).
observation contains all values measured on the same unit across attributes. Here,
for a 'Country', values across Year / Cases / Population are observations.
4 of 12

R follows a set of conventions that makes one layout of tabular data much easier to
work with than others. Any dataset that follows following three rules is said to be
Tidy data.

Each variable forms a column


Each observation forms a row
Each type of observational unit forms a table
Messy data is any other arrangement of data.

Messy data features


Messy data features
Column headers are values, not variable names
Multiple variables are stored in one column
Variables are stored in both rows and columns
Multiple types of observational units are stored in the same table
A single observational unit is stored in multiple tables
Column headers are values, not variable names
Column headers are values, not variable names
In this example, column headers <$10k, $10-20k, $20-30k etc are themselves values
and not
variables
Multiple variables are stored in one column
Multiple variables are stored in one column
In this example variables:

Name - stores firstname, lastname Address - stores city, state

Variables are stored in both rows and columns


Variables are stored in both rows and columns
This dataset shows the variables across rows and columns i.e..,

Months on columns and


maxtemp, mintemp elements on rows.

Multiple types of observational units are stored in the same table


Multiple types of observational units are stored in the same table
This example, the table stores rank and song information, resulting in redundancy
of data.

A single observational unit is stored in multiple tables


A single observational unit is stored in multiple tables
In this example, Maxtemp for the same city is stored in different tables based on
Year.

gather() collapses multiple columns into two columns:

A key column that contains the former column names.


A value column that contains the former column cells.
spread() generates multiple columns from two columns:

Each unique value in the key column becomes a column name.


Each value in the value column becomes a cell in the new column.
spread() and gather() help you reshape the layout of your data to place variables
in columns and observations in rows.

separate() and unite() help you split and combine cells to place a single, complete
value in each cell.

Data for gather() and spread()

Let us create a new dataframe called "paverage.df" that stores the average scores
by player across 3 different years. We will use this dataframe to examine gather()
and spread() behaviour.

Steps for new data frame

player <- c("Sachin Tendulkar", "Sourav Ganguly", "VVS Laxman", "Rahul Dravid")
Y2010 <- c(48.8, 40.22, 51.02, 53.34)
Y2011 <- c(53.7, 41.9, 50.8, 59.44)
Y2012 <- c(60.0, 52.39, 61.2, 61.44)
paverage.df <- data.frame(player,Y2010,Y2011,Y2012)
print(paverage.df)

Data for separate() and unite()


Data for separate() and unite()
Let us create a small dataframe that we can refer before trying your hands with
separate().

fname <- c("Martina", "Monica", "Stan", "Oscar")


lname <- c("Welch", "Sobers", "Griffith", "Williams")
DoB <- c("1-Oct-1980", "2-Nov-1982", "13-Dec-1979", "27-Jan-1988")
first.df <- data.frame(fname,lname,DoB)
print(first.df)

When column headers are values and not variables, use gather() function to tidy the
data.

Now try to recreate this dataset again after tidying the data.

For datasets that are messy due to single column holding multiple variables,
separate() function can be used.

We have seen this example on separate().

When the variables are both at rows and columns, use gather() function followed by
spread() function.

library(tidyr)
tidyr_operations <- function(){

#Write your code here


player <- c("Sachin Tendulkar", "Sourav Ganguly", "VVS Laxman", "Rahul Dravid")
Y2010 <- c(48.8, 40.22, 51.02, 53.34)
Y2011 <- c(53.7, 41.9, 50.8, 59.44)
Y2012 <- c(60.0, 52.39, 61.2, 61.44)
paverage.df <- data.frame(player,Y2010,Y2011,Y2012)
pavg_gather <- gather(paverage.df, key = "year", value = "pavg", c(2,3,4))

##print(paverage.df)
print(pavg_gather)
paverage1.df <- spread(pavg_gather,key = "year",value = "pavg")

print(paverage1.df)

fname <- c("Martina","Monica","Stan","Oscar")


lname <- c("Welch","Sobers","Griffith","Williams")
DoB <- c("1-Oct-1980","2-Nov-1982","13-Dec-1979","27-Jan-1988")

first.df <- data.frame(fname,lname,DoB)

print(first.df)
print(separate(first.df,col = "DoB",into = c('date','month','year'),sep = '-'))
print(unite(first.df,col = "Name",c('fname','lname'),sep = ' '))

religion <- c("Agnostic","Atheist","Buddhist","Catholic")


usd10k <- c(27,12,27,41)
usd20to30k <- c(60, 37, 30, 732)
usd30to40k <- c(81, 52, 34, 670)
mydf1.df <- data.frame(religion, usd10k, usd20to30k, usd30to40k)

print(gather(mydf1.df, key = "usd_range", value = "usd", c(2:4)))

City <- c("Chennai", "Chennai","Hyderabad", "Hyderabad")


Year <- c(2010, 2010, 2010, 2010)
Element <- c("MaxTemp", "MinTemp","MaxTemp", "MinTemp")
Jan <- c(36,24,32,22)
Feb <- c(37,25,34,23)
Mar <- c(37.5,27,36,25)
mydf2.df <- data.frame(City,Year,Element,Jan,Feb,Mar)

print(mydf2.df)
print(spread(gather(mydf2.df,key = "month",value = "temp", c(4,5,6)),key =
"month",value = "temp"))

tidyr_operations()

library(dplyr)
dplyr_operations <- function(){

#Write your code here

mtcars1 <- mtcars[, c(1:6)]


rownames(mtcars1) <- NULL
cars1 <- rownames(mtcars)
mtcars1$cars <- cars1
#print(mtcars1)

print(filter(mtcars1, mpg > 20 & cyl == 6))

print(arrange(mtcars1,cyl,-mpg))

mt_select <- select(mtcars,mpg,hp)


rownames(mt_select) <- NULL

print(mt_select)

mt_newcols <- mtcars1


mt_newcols$disp2 <- mtcars1$disp **2

print(mt_newcols)

print(mean(mtcars$mpg))
print(max(mtcars$mpg))
print(quantile(mtcars$mpg, probs = 0.25))
}

dplyr_operations()

library(stringr)
stringr_operations <- function(){

#Write your code here

x = "R"
print(str_c(x,"Tutorial",sep=" "))

X = "hop a little, jump a little, eat a little, drive a little"


print(str_count(X,"little"))

Y = "hop a little, jump a little"


print(str_locate(Y,"little"))
print(str_locate_all(Y,"little"))

print(str_detect(Y,'z'))

Z = "TRUE NA TRUE NA NA NA FALSE"

print(str_extract(Z,"NA"))
print(str_extract_all(Z,"NA"))

print(str_length(Z))

print(str_to_lower(Z))
print(str_to_upper(Z))

y <- c("alpha","gama","duo","uno","beta")
print(y[str_order(y)])

print(str_pad("alpha", 13, "both", "%"))

z <- c(" A"," B"," C")

print(str_trim(z))
}

stringr_operations()

Your Output (stdout)


[1] "R Tutorial"
[1] 4
start end
[1,] 7 12
[[1]]
start end
[1,] 7 12
[2,] 22 27
[1] FALSE
[1] "NA"
[[1]]
[1] "NA" "NA" "NA" "NA"
[1] 27
[1] "true na true na na na false"
[1] "TRUE NA TRUE NA NA NA FALSE"
[1] "alpha" "beta" "duo" "gama" "uno"
[1] "%%%%alpha%%%%"
[1] "A" "B" "C"
Data type conversions of scalars, vectors (logical, character, numeric), matrix,
dataframes is possible in R. Converting a variable from one type to another is
called coercion

is.numeric(), is.character(), is.vector(), is.matrix(), is.data.frame() returns


true or false,
as.numeric(), as.character(), as.vector(), as.matrix(), as.data.frame() converts
one data type to another.

date_operations <- function() {

#Write your code here

strDates <- "01/05/1965"


date1 <- as.Date(strDates,format = "%d/%m/%Y")
print(date1)

}
[1] "1965-05-01"
date_operations()

Practice - Winsorizing technique


Consider the same set called 'Outlierset' from earlier example.

Outlierset <- c(19, 13, 29, 17, 5, 16, 18, 20, 55, 22,33,14,25, 10,29, 56)
Copy Outlierset to a new dataset Outlierset1.
Replace the outliers with 36 which is 3rd quartile + minimum.
Compare boxplot on both Outlierset and Outlierset1. You should see no outliers in
the new dataset.

Obvious errors
We have so far seen how to handle missing values, special values & outliers.

Sometimes, we might come across some obvious errors which cannot be caught by
previously learnt technical techniques.

Errors such as age field having a negative value, or height field being, say, 0 or
a smaller number. Such erroneous data would still need manual checks and
corrections.

correct_data <- function(){

#Write your code here


x <- c(19, 13, NA, 17, 5, 16, NA, 20, 55, 22,33,14,25, NA,29, 56)
print(replace(x,is.na(x),mean(x, na.rm= TRUE)))
print(replace(x,is.na(x),median(x, na.rm= TRUE)))
Outlierset <- c(19, 13, 29, 17, 5, 16, 18, 20, 55, 22,33,14,25, 10,29, 56)
print(summary(Outlierset))
Cleanset <- Outlierset[Outlierset < 36]
print(Cleanset)
}

correct_data()

Your Output (stdout)


[1] 19.00000 13.00000 24.92308 17.00000 5.00000 16.00000 24.92308 20.00000
[9] 55.00000 22.00000 33.00000 14.00000 25.00000 24.92308 29.00000 56.00000
[1] 19 13 20 17 5 16 20 20 55 22 33 14 25 20 29 56
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.00 15.50 19.50 23.81 29.00 56.00
[1] 19 13 29 17 5 16 18 20 22 33 14 25 10 29

You might also like