You are on page 1of 6


Cuza University of Iai

Faculty of Economics and Business Administration
Department of Accounting, Information Systems and

Data Analysis & Data

Science with R

By Marin

Introduction to Data

Why need Data Processing ?


is gathered from different sources

Databases (PostgreSQL, Oracle, MongoDB, etc.)

Spreadsheets (MS Excel, Google Spreadsheets)
HTML/XML tables, Web pages
Other statistical packages


is extremely rare that imported data is ready for


There are missing values (NAs)

Data was gathered with errors
Variables must be renamed, recoded, removed
Data are spread among many datasets
Aggregation is needed


with some authors, 80% of the data

analysis time is spent with data

Topics covered

basic (and available in packages) functions

for dealing with NAs, NULLs, strings, dates etc.
Options for renaming, recoding and removing
variables in data frames
Data bind and join
Basics of programming in R
Group operations: apply "family", packages "plyr"
and "dplyr"
Case studies

Data cleaning best practices
(Leon Polovets)
Sort the data by different attributes. Negative numbers,
strings that start with obscure symbols, and other outliers
will often appear at the top of your dataset after a sort.

at summary statistics (mean, standard deviation,

number of missing values, etc.) for each column. These can
be used to quickly zero in on the most common problems.


track of every cleaning operation you do so that you

can modify, repeat, or remove operations as necessary.


large, messy datasets, reaching 100% cleanliness is next

to impossible. If you have a list of cleanliness issues, sort
them by estimated frequency and attack the most common
problems until your dataset is good enough.

Data cleaning best practices
(Leon Polovets) - (cont.)
You can also use sampling to test data quality. For example, the
acceptance test for your dataset could be something like, "If
you pick 250 random rows, no more than 5 of them should have
formatting issues."
Tools like OpenRefine can save you a lot of time by performing a
lot of simple cleanings with very little user input.
Become an expert at using regular expressions.
Create a set of utility functions/scripts/tools to handle common
cleaning tasks. These might include: regex search-and-replace,
remapping values based on a SQL database or CSV file,
blanking out all values that don't match a regex or an allowed
list of values, deleting rows that do/don't match some filter, and
so on. If you find yourself repeating several steps over and over,
write a script to handle all of the steps at once.

Data cleaning best practices
(Alon Nir) - (cont.)


for data-type errors. e.g. numerical values which

were read as strings, dates as ranges (for example 1-10
instead of 10/1).
Inconsistencies. For example, say you have a list of
product code numbers and the name of the products
(194392: white socks). You should check that you don't
have any other product names associated with 194392,
and no other product code for white socks.
Also, when adding labels to records based on a certain
value (e.g. Small, Medium, Large for company size, based
on the number of employees), avoid the practice of using
an if statement that uses logic such as "if none of the
previous n-1 labels, then it's the nth label". That would be
a catchall that leads to mistakes.