You are on page 1of 18

Tutorial 3 - Data Cleansing

The software to be used is R Studio which is an open source integrated suite of software
facilities for data manipulation, calculation and graphical display. The R Studio is available
from with the University via AppsAnywhere or, as R and R Studio are open source, we can
download the software for free and install them on our own PC.

• If you are using the software at SHU then open AppsAnywhere and search for R
Studio using the search function. Once the application has been found click launch
which should then open the R Studio application.
• If you are using the software at home then go to the following website:

https://www.rstudio.com/products/rstudio/download/

and follow the online setup instructions for your relevant machine.

3.1 OPENING R STUDIO

When we launch RStudio the following screen will appear and look something like this:

Figure 1.1 - R Studio Opening Screen

The pane on the left is the console. The upper right is the workspace/history (selected by using
the tabs) and the lower right is the Files/Plots/Packages/Help/Viewer area.
3.2 BEFORE YOU BEGIN
3.2.1 SETTING UP FOR THE FIRST TIME

You should create a folder on the root of your homespace called DQ.

In the Advanced Data Management Project Blackboard site for the Data Integration topic
under Data Quality (go to: Learning Materials >> Data Integration >> Dirty Data), find the
relevant data files and download these to your DQ folder

The read program can be used to create and view the required data sets in R, for this session.

• Start R Studio.

• Create a new R project called DQtuts and save this to your DQ folder.

• Copy the read program this contains the following statements, note you will need to amend
these for your own individual homespace. Ensure you have the readr package installed.

library(readr)
# Tell R where your data is located
patients <- read_csv("F:/DQ/patients.csv")
consfile <- read_csv("F:/DQ/consfile.csv")
flights <- read_csv("F:/DQ/flights.csv")
patfile <- read_csv("F:/DQ/patfile.csv")
# View the data sets that you have created
View(patients)
View(consfile)
View(flights)
View(patfile)

Note that the read statements of for each data set should correspond to the exact location of
your DQ folder.

• Submit the library(readr)statements (above) and check to ensure that the


appropriate data sets have indeed been created.

• Save the Program as ImportData in your DQ folder and save the R project.

• Note that the View() statement used in the above code should have opened each of the
data sets in the top left window in R studio. Now take some time to familiarise yourself with
the four data sets flights, patients, consfile and patfile.
3.3 CHECKING THE VALUES OF CHARACTER VARIABLES

3.3.1 USING FREQUENCY TABLES AND CHARTS

These techniques are appropriate only for variables that can take relatively few values. They
indicate which values are present in the data, including invalid and missing values, but do not
help to locate these values.

count will list all values found, including missing values. Ensure you installed dplyr package

barplot and pie can be used to create horizontal frequency bar charts and pie charts. The
horizontal bar charts will show all values found. The NA option allows us to include missing
values in the charts. The any function reports if any of the values in the given argument are
TRUE. Run the following R code and review the output

Example 1.1 - Frequency Tables for Character Variables

# Produce a Simple Frequency Count for Gender Variable


count(patients, 'GENDER')
# Produce a Simple Frequency Count for AE Variable
count(patients, 'AE')
# Produce a Simple Frequency Count for DX Variable
count(patients, 'DX')

QUESTIONS
1 - What are these results telling us about our data?
2 – Do you think this is useful and why?

Write your answers below:


Example 1.2 – Horizontal Bar Plots

# Simple Horizontal Bar Plot with Added Labels - Gender


# Read the missing cells into the counts object along with any
missing values
counts <- table(patients$GENDER, useNA ="ifany")

# Assign name "NA" to the missing values within the counts object
names(counts)[is.na(names(counts))] <- "NA"

# Display barplot
barplot(counts, main="Gender Distribution", xlab='Counts',
ylab='Gender', horiz=TRUE)
# Simple Horizontal Bar Plot with Added Labels - AE
# Read the missing cells into the counts object along with any
missing values
counts <- table(patients$AE, useNA ="ifany")

# Assign name "NA" to the missing values within the counts object
names(counts)[is.na(names(counts))] <- "NA"

# Display barplot
barplot(counts, main="AE Distribution", xlab='Counts',
ylab='Gender', horiz=TRUE)
# Simple Horizontal Bar Plot with Added Labels - DX
# Read the missing cells into the counts object along with any
missing any values
counts <- table(patients$DX, useNA ="ifany")
# Assign name "NA" to the missing values within the counts object
names(counts)[is.na(names(counts))] <- "NA"
# Display barplot
barplot(counts, main="DX Distribution", xlab='Counts',
ylab='Gender', horiz=TRUE)

QUESTIONS:
1 – How many of each value are recorded?
2 – How many missing values are there?
3 – Why do we have to assign the name NA to the missing values?

Write your answers below:


Example 1.3 – Pie Charts

# Simple Pie Chart with Added Labels - Gender


# Read the missing cells into the counts object along with any
missing values
counts <- table(patients$GENDER, useNA ="ifany")
#Assign name "NA" to the missing values within the counts object
names(counts)[is.na(names(counts))] <- "NA"
#Display pie chart
pie(counts, main="DX Distribution")

# Simple Pie Chart with Added Labels - DX


# Read the missing cells into the counts object along with missing
any values
counts <- table(patients$DX, useNA ="ifany")
# Assign name "NA" to the missing values within the counts object
names(counts)[is.na(names(counts))] <- "NA"
# Display pie chart
pie(counts)

# Simple Pie Chart with Added Labels - AE


# Read the missing cells into the counts object along with missing
any values
counts <- table(patients$AE, useNA ="ifany")
# Assign name "NA" to the missing values within the counts object
names(counts)[is.na(names(counts))] <- "NA"
# Display pie chart
pie(counts)

QUESTIONS:
1 – How useful is this type of chart?
2 – Which of the three methods (frequency tables, bar charts and pie charts) is your
favourite and why?

Write your answers below:


Exercise 1.1
Use counts and bar charts and pie charts (as in Examples above) to produce frequency
counts, horizontal bar charts and pie charts indicating missing and invalid values of the
variables Crew and Dest from the flights data set. Use the space below to write down your
findings in relation to the flights data set.

Write your answers below:

3.3.2 IDENTIFY AND LOCATE INVALID VALUES

is.na can be used be used to identify and locate missing, is.numeric and is.charecter can
be used to detect invalid values of any character or numeric variable, even one capable of
taking many different values. The following programs detects missing values and invalid
values in separate, successive steps. Run the following R code and review the output

Example 1.4 – Listing Missing Values

#Store indexes of missing values in an integer-valued vector


MissingValues= which(is.na(patients), arr.ind=TRUE)

#Get rownames of missing values and store in object x


x = rownames(patients)[MissingValues[,1]]

#Get column names of missing values and store in object y


y = colnames(patients)[MissingValues[,2]]

#Merge objects x and y with equal dimensions


LocatedMissingValues = paste(x, y, sep=" ")
LocatedMissingValues

Write your findings below:


Example 1.5 – Check for Non Numeric Values

#Check if the columns contain any non-numeric values


NonNum <- unlist(lapply(patients, is.character))
NonNum
#List all values in non-numeric columns
patients[ , NonNum]

Write your findings below:

Example 1.6 – Check for Non Character Values

#Check if the columns contain any non-character values


NonChar <- unlist(lapply(patients, is.numeric))
NonChar
#List all values in non-character columns
patients[ , NonChar]

Write your findings below:

• Exercise 1.2

Use the techniques above to locate and identify all missing and invalid character variable
values in the flights data set.

Write your findings below:


3.4 CHECKING THE VALUES OF NUMERIC VARIABLES

It is impossible for a numeric variable to be assigned a value that includes, say, an alphabetic
character. In general, then, the techniques for checking the values of numeric variables will
differ considerably from those used to check the values of character variables.

3.4.1 USING FREQUENCY TABLES AND BAR CHARTS

Where a numeric variable can only take relatively few distinct values, frequency tables and
bar charts can be applied in exactly the same way as for character variables that can only
take relatively few distinct values (see section 3.1.1 for details).

Indeed, where such variables simply represent some kind of arbitrary “coding” system, such
as the variables PATNO and DX in the patient’s data set, they can be defined as character
variables and checked accordingly.

As before, the above methods based on count and barplot with is.na will indicate which
values are present in the data, including missing values, but will not help to locate any of these
values.

3.4.2 USING SUMMARY AND HIST

Where a numeric variable can take relatively many distinct values, summary can be used to
obtain summary information and numbers of missing values. hist can be used to produce a
histogram-style summary of the data, and is especially useful if there are many observations
in the data set. The resulting charts will indicate whether potential outliers (“rogue” data
values) are present.

Note that these methods will indicate the numbers of missing values and the possible
existence of potential outliers, but will not help to locate any of these values.

Example 2.1 - Produce summary information

#Generate Summary Stat Measures – Min,Max,AVG,MED,STD,Quartiles


summary(patients)

Write your findings below:


#Generate Histograms for HR, SBP and DBP.
hist(patients$HR,col="red")
hist(patients$SBP,col="blue")
hist(patients$DBP,col="green")

Write your findings below:

• Exercise 2.1

Use summary and hist (as in Example 2.1) to produce summary statistics and histograms to
identify missing and potential outlying values of all the numeric variables in the flights data set.

Write your findings below:

3.4.3 IDENTIFYING AND LOCATING INVALID VALUES

If it is known in advance into what range the values of a numeric variable should fall, then a
data frame can be used to identify and locate missing and invalid (out of range) values of any
numeric variable, even one capable of taking many different values.

In the following example based on the patient’s data set, suppose that it is known in advance
that

• HR should lie between 40 and 100


• SBP should lie between 80 and 200
• DBP should lie between 60 and 120

For this example we will need to install the dplyr package, which allows us to efficiently
manipulate data sets
Example 2.2 - Identifying and locating invalid values

#install the dplyr package to use the %>% and 'filter' function
install.package(dplyr)
library(dplyr)
#Select & display missing HR values
patients %>% filter(is.na(HR))
#Select & display missing SBP values
patients %>% filter(is.na(SBP))
#Select & display missing DBP values
patients %>% filter(is.na(DBP))

Searching for values in the given ranges:

#HR should lie between 40 and 100 (using the subset function)
outliers1 <- subset(patients, HR < 40 | HR > 100)
#display the out-of-range HR values
outliers1
#SBP should lie between 80 and 200 (using the variable values)
outliers2 <- subset(patients, SBP < 80 | SBP > 200)
#display the out-of-range SPB values
outlier2
#DBP should lie between 60 and 120 (using the subset function)
outliers3 <- subset(patients, DBP < 60 | DBP > 120)
#display the out-of-range DBP values
outliers3

Write your findings below:


• Exercise 2.2

In the flights data set, suppose that it is known in advance that:

Boarded should lie between 200 and 500


Freight should lie between 150 and 550
Mail should lie between 100 and 250
Revenue should lie between 25000 and 65000

Use the data frame of Example 2.2 to locate and identify all missing and invalid (out of range)
numeric variable values in the flights data set.

Write your findings below:

3.4.4 IDENTIFYING POTENTIAL OUTLIERS AND REASONABLE RANGES


USING NORMAL PLOTS

So far, we have assumed that reasonable ranges can be easily identified for each numeric
variable. However, this pre-supposes some considerable familiarity with the data. Where data
is unfamiliar, we can use the values of the numeric variables themselves to help find outliers
– abnormally small or large “rogue” observations. Having identified these, a “reasonable”
range can then easily be identified for each numeric variable – it is the range within which only
the non-outlying observations lie.

Previously we have seen in section 3.4.2 that vertical bar charts can be used to indicate the
possible existence of potential outliers (see Example 2.1).

A more reliable method involves the use of normal plots. These are usually used to determine
whether the values of a numeric variable form a credible sample from a normal distribution.
This is a statistical distribution whose histogram resembles a symmetric bell. Such data
generate a normal probability plot that approximates to a straight line.

However, in the current application, interest focuses rather on using the plot (whether or not
an approximate straight line) to identify potential outliers. These manifest themselves on the
plot as abnormally extreme (very small or very large) observations.

To find the outliers, just look at the observations at each end of the plot:
• If any of those at the lower end lie noticeably below the approximate line or curve to which
the other observations conform, they are potential outliers.

• If any of those at the upper end lie noticeably above the approximate line or curve to which
the other observations conform, they are potential outliers.

Example 2.3 – Identifying outliers and reasonable ranges

HR The three largest observations are clearly outliers. An initial estimate of a reasonable
range for these data may be 0 –180. However, the largest observation is so enormous
that it is distorting the plot. We re-plot the data with this observation removed (see
Example 2.4 below).

SBP The three smallest observations are clearly outliers. The three largest may be, but the
plot is curving upwards at the upper end, so they do not stand out so distinctly. It’s
safest to check them out, however. A reasonable range is thus about 90 - 220.

DBP It’s very clear that the two smallest and two largest observations are outliers. A
reasonable range for this variable is thus about 60 – 130.

Compare these results with the output from the hist in Example 2.1 (re-run this part of the
program if necessary). It's clear from these charts that the HR value(s) near 900 are
abnormally high, as are DBP values around 200. However, no obvious low outliers are
discernible from these plots, and no outliers at all can be identified on the SBP plot. Thus the
normal plots are far more informative!

We now re-run the normal plot for HR, but without the massive highest observation.

Example 2.4 – Removing highest observation for HR

HR It is now clear that excluding the largest observation reveals that the next two highest
observations are also clear outliers, as are probably the smallest three observations.
A reasonable range for this variable is thus about 40 –100.

Execute the following R code, does this look better?

#HR should lie between 40 and 100 (using the subset function)
NewRange <- subset(patients, HR >= 40 & HR <= 100)
NewRange
#Generate Histograms for HR, SBP and DBP.
hist(NewRange$HR,col="red")

• Exercise 2.3

Use normal plots to identify potential outliers and to suggest reasonable ranges for each of
the numeric variables in the flights data set. Compare your findings with the vertical bar charts
for these variables from Exercise 2.1.

If necessary, re-plot without any very extreme outliers to identify further outliers and to refine
your estimate of a reasonable range.
Of course, once reasonable ranges have been established for each numeric variable using
these plots, the techniques of sections 3.4.3 or 3.4.4 above can be employed to identify and
locate the outliers – these are simply the invalid (out of range) observations.

Write your findings below:

3.5 WORKING WITH DATES


In R dates are numeric variables, and are stored as the number of days from a fixed date,
namely 1 January 1970. The as.numeric function can be used to convert a Date object to its
internal form. However, dates are usually read in and printed out using one of a number of
specialised date formats.

R provides several options for dealing with date and date/time data. The as.Date function
handles dates (without times), there are other options that can be used to deal with different
time zones. However, the general rule for date/time data in R is to use the simplest technique
possible. Thus, for date only data, as.Date will usually be the best choice. The as.Date
function allows a variety of input formats through the format= argument. The default format is
a four digit year, followed by a month, then a day, separated by either dashes or slashes. The
code and output below shows some examples of different date formats.

as.Date('1915-6-16')
[1] "1915-06-16"
as.Date('1990/02/17')
[1] "1990-02-17"

If input dates aren’t in the standard format, a format string can be composed using a number
of different elements, these include: %d (day of the month (decimal number)); %m (month
(decimal number)); %b (month (abbreviated); %B (month (full name)); %y (year (two digit);
and %Y (year (4 digit). Below is an example as to how we use these:

as.Date('1/15/2001',format='%m/%d/%Y')
[1] "2001-01-15"
as.Date('April 26, 2001',format='%B %d, %Y')
[1] "2001-04-26"
as.Date('22JUN01',format='%d%b%y')
# %y is system-specific; use with caution
[1] "2001-06-22"

To extract the components of the dates, the weekdays, months, days or quarters functions
can be used. The sys.date and sys.time functions can be used to return the current date and
time. Note many of the statistical summary functions, like mean, min, max, etc are able to
transparently handle date objects. However, unlike some other numeric variables, valid
ranges for dates are usually relatively easy to specify. We investigate some simple methods
of identifying and locating missing and invalid dates.

3.5.1 IDENTIFYING AND LOCATING INVALID DATES

As usual, we address missing and invalid dates separately. In the examples below based on
the patients data set, suppose it is known that all patient visits should have taken place
between 1 June 1998 and 15 October 1999.

Note that to reference a particular date inside a program, say in an is.date or as.date function,
the date must take the form of a date constant.

A date constant is written as a two-digit day, a three-character month name, and a two- or
four- digit year, all without embedded spaces, enclosed in single quotes. For example:

'07Jan02’ ‘1998-10-14’

For this example we will need to install the lubridate package, which allows us to easily
manipulate date values

Example 3.1 – Checking for predefined data types

#Create data frame to hold date values


is.date <- function(x) inherits(x, 'Date')

#check for predefined data types including date


sapply(list(as.Date('2000-01-01'), 123, 'ABC'), is.date)

#Use lubridate package for easy manipulation of date values


install.package(lubridate)
library(lubridate)

#Check that the VISIT variable is a “Date”


class(patients$VISIT)

#format VISIT variable as Date (if not already).


#We use the 'mdy' function as VISIT variable is in the d-m-y format
patients$VISIT = mdy(patients$VISIT)

#Select & display invalid visits outside specified dates


patients %>% filter(!(VISIT >= "1998-06-01" & VISIT <= "1999-10-15"))

#Select & display missing values for VISIT


patients %>% filter(is.na(VISIT))

Write your findings below:


Note also that the default format for printing dates is the YYYY-MM-DD.

• Exercise 3.1

Suppose that all the individual flights in the data set flight should have taken place between
the 4'th and 17'th of March 2002 inclusive.

Use the techniques outlined above in Example 3.1 to locate and identify all missing and invalid
(out of range) dates in the flights data set.

Write your findings below:

3.6 DEALING WITH DUPLICATE OBSERVATIONS


Many data sets contain an ID variable – a variable that takes a different value for every
observation, so can be used to identify each individual observation. In database parlance, an
ID variable is simply a key field.

Thus, the data set patients contains the ID variable PATNO, whilst the ID variable for flights
is Id.

It is thus extremely undesirable for there to be duplicate observations with the same value of
the ID variable – it should be different for every observation.

The following program creates a frequency table of PATNO where they appear more than
once in the patient’s data set.
Example 4.1 – Identifying duplicate values of ID (PATNO)

#create data frame to hold duplicate values from defined column


duplicate <- data.frame(table(patients$PATNO))

#count the frequency of each duplicate(s)


duplicate[duplicate$Freq > 1,]

#show observations with user-defined duplicates


patients[patients$PATNO %in% duplicate$Var1[duplicate$Freq > 1],]
Write your findings below:

• Exercise 4.1

On the flights data set, the variable Id is an ID variable.

Using the techniques of Example 4.1, identify all observations in this data set with duplicate
values of Id.

The above ideas can be extended to data sets without a single ID variable, but where
combinations of certain variables must be unique. For example, suppose that on the patients
data set, any patient could be present on several occasions, but could never on the same visit
date. Then each PATNO by VISIT combination should be unique and unduplicated.

Observations with duplicated PATNO by VISIT combinations would be detected in Example


4.1 if it were modified. Duplicated observations (those where the values of patient ID’s are
duplicated) can be removed as follows:

DuplicatePatientsByID <- patients[patients$PATNO %in%


duplicate$Var1[duplicate$Freq > 1],]
UniqueDuplicatePatientsByID <-
DuplicatePatientsByID[!duplicated(DuplicatePatientsByID),]
UniqueDuplicatePatientsByID

The above short program would remove all completely duplicated observations by patient ID
(but leaving one copy of each) from the data in the set data set.

3.7 INTEGRITY CHECKS BETWEEN DATA SETS


Every collection of data will have its own associated rules. In particular, certain variables will
often appear in more than one data set.

We will use the following two data sets:

• consfile
• patfile

You can browse each of these data sets through the R console or by finding them in your DQ
folder located on your homespace.

The data set consfile is a record of patient consultations. Each observation comprises a
consultation date (VISIT) and an identifying code (PATNO, a character variable) for the
individual patient who received the particular consultation. Of course, in practice, such a file
would probably contain other relevant information relating to the outcome of the consultation,
such as diagnosis, recommended treatment, and so on.

The data set patfile is a reference file listing all possible patients. Each observation comprises
the identifying code variable PATNO for the individual patient, together with the patient’s
surname and forename (respectively, the character variables Surname and Forename).
Again, in practice, such a file would probably also contain other contact details, such as the
patient’s address, telephone number and e-mail address.

Every patient who attends for a consultation should be a valid patient, and therefore listed in
the patient file. In other words, every value of PATNO that appear in the data set consfile
should also be present in the data set patfile. Checking that this is so constitutes an integrity
check between the two files.

The following program uses the anti_join function in the dplyr package to identify patients
(PATNO’s) in both the consfile and patfile data sets.

An anti join returns the rows of the first table where it cannot find a match in the second table.
Anti joins are a type of filtering join, since they return the contents of the first table, but with
their rows filtered depending upon the match conditions.

Note where all ID’s appear in both files a NULL is returned, otherwise output for those ID’s
that can be found in one file but not the other are returned.

Example 5.1 – Locating illegal identification codes

# Use the 'anti_join' function in dplyr


library(dplyr)

# Identify the patients in consfile but NOT in the patfile by ID


anti_join(consfile, patfile, by='PATNO')

Write your findings below:

Today we have just scratched the surface in how to use R for data manipulation, therefore,
please take some time to review the recommended “reading resources” for manipulating data
with R – these can be found on the Advanced Data Management Project blackboard site under
“Learning Materials>>Data Integration” and are available from the university library.

You might also like