You are on page 1of 5

8/9/2021

Data Wrangling
Amit Das
amit.das@krea.edu.in

Fixed width and delimited files


• Fixed width file format
• All rows of equal length
• Each column allocated same width
of digits / characters
• Unused spaces to be filled with padding
(NULL) characters
• Application must “know” the layout
• Delimited file formats
• Columns separated by delimiters
• Spaces, commas, other …
• Rows of unequal length
• Might continue on multiple lines

1
8/9/2021

Missing values
• If the value for an attribute (a column) is missing
• Will show up as a short row in fixed-file format
• May be harder to detect in a delimited file
• Use a missing value indicator such as NULL, NaN, or some other string
• Failure to detect missing values can corrupt entire file in
reading
• Missing values must be understood properly (why missing)
• No response
• For survey data, is it “WILL NOT ANSWER” or “NOT APPLICABLE”?
• Zero response – this should NOT be recorded as a missing value

Dealing with missing values


• SAFE OPTION: Exclude rows with (any) missing values (“LISTWISE”)
• Downsides
• Loss of sample size
• Bias unless “missing at random” (systematic non-response)
• OTHER OPTIONS
• Some analyses can use incomplete data (“PAIRWISE”)
• In a correlation matrix, pairwise correlations can have different sample sizes
• UNSAFE OPTION: IMPUTATION
• Replace missing values with the means of the columns
• “Predict” missing values from other attributes in the same row
• “Predict” missing value by comparison with “similar” rows
• YOU ARE MAKING UP DATA, AT YOUR PERIL!

2
8/9/2021

Outlier detection
• Single extreme values (univariate)
• Assuming normal distribution
• Unlikely values (on the tails)
may be discarded / capped
• “x% Trimmed Mean”
• Unlikely  Impossible

Outlier detection (2)


• Single extreme values (univariate)
• Without assuming normal distribution

3
8/9/2021

Density-based clustering
• Density-based spatial clustering
of applications with noise
• (DBSCAN) is a data clustering
algorithm proposed by Martin
Ester, Hans-Peter Kriegel, Jörg
Sander and Xiaowei Xu in 1996.
• Applicable to multi-dimensional
data (difficult to spot manually)

Syntactic vs semantic data cleaning


• Sometimes the meaning of
the data is clear, though it
does not match exactly.
• ISO 3166-1 alpha-3 codes
three-letter country codes
IDN Indonesia
IMN Isle of Man
IND India
IOT British Indian Ocean Territory
IRL Ireland
IRN Iran (Islamic Republic of)
IRQ Iraq
Write code to reconcile other
(variant) spellings.

4
8/9/2021

Wide and long forms of data (e.g. time series)


• Long data sometimes called
“tidy” data
• Some analyses require
one or the other
• Tidy data is handled better
by machines (normalized)
• Wide data may be easier
for human readers
• Stats software (and Python, R) have routines for reshaping data

A modern view of data cleaning


• Data cleaning requires domain knowledge
• Number of cylinders: 3, 4, 6, 7, 8, 10, 12, 16 … in automobiles
which ones are meaningful?
• Orders of magnitude: GHz, ns, microns / nm … in electronics
• The “problem” lies in the “brittleness” of learning algorithms
• Some can run on incomplete data, others cannot
• Some are strongly affected by noise in data, others are more robust
• Mean vs Median

• “Data Cleaning” is a legitimate part of the modeling process


Bill Gates, Warren Buffet, LeBron James, Lionel Messi: outliers?

You might also like