You are on page 1of 6

Al.I.

Cuza University of Iai


Faculty of Economics and Business Administration
Department of Accounting, Information Systems and
Statistics

Data Analysis & Data


Science with R

By Marin
Fotache

Introduction to Data
Processing

Why need Data Processing ?


Data

is gathered from different sources

Databases (PostgreSQL, Oracle, MongoDB, etc.)


Spreadsheets (MS Excel, Google Spreadsheets)
HTML/XML tables, Web pages
Other statistical packages

It

is extremely rare that imported data is ready for


analysis:

There are missing values (NAs)


Data was gathered with errors
Variables must be renamed, recoded, removed
Data are spread among many datasets
Aggregation is needed

According

with some authors, 80% of the data


analysis time is spent with data
preparation/processing

Topics covered
Other

basic (and available in packages) functions


for dealing with NAs, NULLs, strings, dates etc.
Options for renaming, recoding and removing
variables in data frames
Subsetting
Data bind and join
Aggregation
Reshaping
Basics of programming in R
Group operations: apply "family", packages "plyr"
and "dplyr"
Case studies

Data cleaning best practices


http://www.quora.com/What-are-some-of-the-best-prac
tices-for-data-cleaning
(Leon Polovets)
Sort the data by different attributes. Negative numbers,
strings that start with obscure symbols, and other outliers
will often appear at the top of your dataset after a sort.
Look

at summary statistics (mean, standard deviation,


number of missing values, etc.) for each column. These can
be used to quickly zero in on the most common problems.

Keep

track of every cleaning operation you do so that you


can modify, repeat, or remove operations as necessary.

For

large, messy datasets, reaching 100% cleanliness is next


to impossible. If you have a list of cleanliness issues, sort
them by estimated frequency and attack the most common
problems until your dataset is good enough.

Data cleaning best practices


http://www.quora.com/What-are-some-of-the-best-prac
tices-for-data-cleaning
(Leon Polovets) - (cont.)
You can also use sampling to test data quality. For example, the
acceptance test for your dataset could be something like, "If
you pick 250 random rows, no more than 5 of them should have
formatting issues."
Tools like OpenRefine can save you a lot of time by performing a
lot of simple cleanings with very little user input.
Become an expert at using regular expressions.
Create a set of utility functions/scripts/tools to handle common
cleaning tasks. These might include: regex search-and-replace,
remapping values based on a SQL database or CSV file,
blanking out all values that don't match a regex or an allowed
list of values, deleting rows that do/don't match some filter, and
so on. If you find yourself repeating several steps over and over,
write a script to handle all of the steps at once.

Data cleaning best practices


http://www.quora.com/What-are-some-of-the-best-prac
tices-for-data-cleaning
(Alon Nir) - (cont.)

Search

for data-type errors. e.g. numerical values which


were read as strings, dates as ranges (for example 1-10
instead of 10/1).
Inconsistencies. For example, say you have a list of
product code numbers and the name of the products
(194392: white socks). You should check that you don't
have any other product names associated with 194392,
and no other product code for white socks.
Also, when adding labels to records based on a certain
value (e.g. Small, Medium, Large for company size, based
on the number of employees), avoid the practice of using
an if statement that uses logic such as "if none of the
previous n-1 labels, then it's the nth label". That would be
a catchall that leads to mistakes.