You are on page 1of 110

Cleaning data at scale:

Boosting Performance of Industrial Data Science


Dr. Phil Winder
https://WinderResearch.com

1 / 110
Introduction
Focus: why and how
Requirements: intermediate knowledge + python
Goal: raise the profile, immediately useful

2 / 110
Introduction
Visualising Data
Availability, Consistency, Leakage
Types of Data
Corrupted Data
Transforming Data and Normality
Working With Scales
Derived Variables
Feature Selection
Series Variables
Related Techniques
Contact Details

3 / 110
How Bad Data A ects Results

4 / 110
How Bad Data A ects Results

5 / 110
Why Bad Data A ects Results
Deduction: Newton
Induction: Sherlock Holmes

6 / 110
Why Bad Data A ects Results

7 / 110
Why Bad Data A ects Results

8 / 110
Group Discussion
Can you share an example where bad data has affected one of your projects?

9 / 110
Visualising Data - Tools of the Trade
Always visualise your data
How?

10 / 110
Visualising Data - Histograms

11 / 110
Visualising Data - Missing data

12 / 110
Visualising Data - Missing data

13 / 110
Visualising Data - Missing data

14 / 110
Visualising Data - Correlation

15 / 110
Visualising Data - Correlation

16 / 110
Visualising Data - Beware of Summary Statistics

Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through
Simulated Annealing, Justin Matejka, George Fitzmaurice

17 / 110
Notebook Exercise 1: Visualising Data

18 / 110
Bilogur, (2018). Missingno: a missing data visualization suite. Journal of Open Source Software, 3(22), 547,
https://doi.org/10.21105/joss.00547 https://github.com/ResidentMario/missingno

19 / 110
Data Availability
The availability of data defines what you can and can't use (see nullity plots).
Keep as much detail as possible
Preserve versions
CR not CRUD!

20 / 110
Data Consistency
Consistent data is stable (over time, space, ...)
Can improve the quantity and quality of data, and hence improve model performance.

Use consistent definitions for metrics

21 / 110
Data Leakage
Very easy to accidentally include future data in training data.
Oversampling
Running dimensionality reduction on the whole dataset
Preprocessing over the whole dataset
Including a feature that is only populated after the label has been applied

22 / 110
Examples:
https://www.reddit.com/r/MachineLearning/comments/965zgf/d_tell_me_about_how_you_were_a_victim_of_data/
Book: Janert, P.K. Data Analysis with Open Source Tools: A Hands­On Guide for Programmers and Data Scientists.
O’Reilly Media, 2010. https://amzn.to/2VFqOYx.

23 / 110
Types of Data
Recap
Different data types have to be cleaned in different ways

24 / 110
Types of Data - Categorical
Nominal: labels, not ordered
male, female
UK, USA, Netherlands
Ordinal: ordered
great, average, rubbish
large, medium, small

25 / 110
Types of Data - Numerical
Can't be counted, but can be measured
Continuous
Temperature
Distance
Integers
Website visitors
Ranking
Aside: interval data and ratio data.

26 / 110
Data Types in Statistics, Niklas Donges ­ https://towardsdatascience.com/data­types­in­statistics­347e152e8bee

27 / 110
Corrupted data
Pessimist:
Data that is wrong and cannot be used

Optimist:
Something interesting...

28 / 110
Missing data
Missing data doesn't necessarily mean  numpy.nan !

>>> print(titanic.count())
pclass 1309
survived 1309
name 1309
sex 1309
age 1046
sibsp 1309
parch 1309
ticket 1309
fare 1308
cabin 295
embarked 1307
boat 486
body 121
home.dest 745
dtype: int64

29 / 110
Fixing missing data
Remove (rows or columns)
Impute Simple
Natural null
Mean
Median
Impute Complex
Regression
Random Sampling
Jitter

30 / 110
Notebook Exercise 2: Fix missing data

31 / 110
Noise: What is noise?
Weeds are just flowers that you don't like. Noise is data that you don't like.

32 / 110
Noise: Types of Noise
Class
Feature (column)
Observation (row)

33 / 110
Noise: Improving Noise
Bad news: As much as you keep picking those weeds, they keep reappearing.
Aggregation

Average (stacking/beamforming/radon transform)
Median (popcorn noise)
Simple modelling
Smoothing
Normalisation
Complex modelling
Regression or fitting
Dimensionality Reduction and Restoration
Transformations (FFT, Wavelet)
Encoding/Embedding (Autoencoder, NLP Embeddings)

34 / 110
Anomalies
a.k.a. outliers, inconsistent
Data that is not expected (in a statistical sense)

35 / 110
Anomaly Types
Contextual ­ possibly good
Corrupted ­ usually not good
Measurement errors or failures
API changes
Regulatory changes
Shift in behaviour
Formatting changes

36 / 110
Detecting Anomalies
a large field in its own right
1. Define what is normal (through a model)
2. Set a threshold to define "not normal"

37 / 110
Detecting Anomalies for Data Cleaning
1. Visualise your data!
2. Everything else
a. Classification task
b. Clustering
c. Regression/fitting + thresholds

38 / 110
Example Regression Task - Wine Quality

39 / 110
40 / 110
41 / 110
Windsorizing
Remove data that lies outside a specified percentile (typically 90th percentile)

42 / 110
Notebook Exercise 3: Detect and remove outliers

43 / 110
Quick intro to handling missing data: https://towardsdatascience.com/the­tale­of­missing­values­in­python­
c96beb0e8a9d
Pandas documentation on missing data: https://pandas.pydata.org/pandas­docs/stable/missing_data.html
Bit more information about anomaly detection: https://towardsdatascience.com/a­note­about­finding­anomalies­
f9cedee38f0b
Good short free book on anomaly detection: Practical Machine Learning: A New Look at Anomaly Detection, Ted
Dunning, Ellen Friedman, O'Reilly Media, Inc., 2014, ISBN 1491914181, 9781491914182
Cool Library for benchmarking time series anomaly detection: https://github.com/numenta/NAB
Nice run through of day­to­day problems with data: https://medium.com/@bertil_hatt/what­does­bad­data­look­like­
91dc2a7bcb7a
Short section on dealing with corrupted data ­ Raschka, S. Python Machine Learning. Packt Publishing, 2015.
https://books.google.co.uk/books?id=GOVOCwAAQBAJ.

44 / 110
Transforming Data

45 / 110
Recap: Statistical Distributions

46 / 110
Non-normal Data

47 / 110
Normal Q-Q Plots
If the two distributions being compared are equal then the Q­Q plot follows the 45° (y=x) line

48 / 110
Distribution Fitting

49 / 110
Others
Box plots

50 / 110
Notebook Exercise 4: Is it Normal?

51 / 110
Group Discussion: Why is normality important?

52 / 110
Why Normality is Important

53 / 110
Look again at the parameters of all these distributions. Note how few of them use "mean".
The vast majority of data cannot be represented by a mean. And the algorithm will not work. 54 / 110
55 / 110
56 / 110
Fixing: Domain Knowledge

57 / 110
58 / 110
59 / 110
60 / 110
Fixing: Arbitrary Functions
We can use any mathematical function to transform our data*
*so long as it's invertible

61 / 110
62 / 110
63 / 110
64 / 110
65 / 110
66 / 110
67 / 110
Box-Cox Transform

y = (x**lambda - 1) / lambda, for lambda != 0


log(x), for lambda = 0

68 / 110
y = (x**lambda - 1) / lambda, for lambda != 0
log(x), for lambda = 0

69 / 110
70 / 110
71 / 110
Notebook Exercise 5: Fix Non-Normal Data

72 / 110
Presentation on Seaborn Styles ­
https://s3.amazonaws.com/assets.datacamp.com/production/course_6919/slides/chapter2.pdf
Code to fit all distributions: https://stackoverflow.com/questions/6620471/fitting­empirical­distribution­to­theoretical­
ones­with­scipy­python
Box Cox Plot, Python Data Analysis Cookbook, https://learning.oreilly.com/library/view/python­data­
analysis/9781785282287/ch04s05.html
More box­cox code: http://dataunderthehood.com/2018/01/15/box­cox­transformation­with­python/
Log transforms and Box­Cox ­ Zheng, A., and A. Casari. Feature Engineering for Machine Learning: Principles and
Techniques for Data Scientists. O’Reilly Media, 2018. https://books.google.co.uk/books?id=sthSDwAAQBAJ.

73 / 110
Working with scales

74 / 110
Why scale is important
Many optimisation algorithms expect your data to have equal scales.
The amount of error is defined by the scale
Optimisers spend more "effort" trying to improve large errors

75 / 110
76 / 110
Altering the scale of numerical values
Name Result When?

StandardScaler Zero mean and variance Most of the time


of one

MinMaxScaler, Rescale feature to lie When data really can


MaxAbsScaler between zero and one be zero, catagorical
values

RobustScaler A version of When there are


StandardScaler that is outliers
less sensitive to outliers

77 / 110
78 / 110
Handling categorical data
We can't pass raw categorical data into the algorithms.
Create new columns for each possible category (One­Hot Encoding)

X = [
{'sex': 'female', 'location': 'Europe', 'age': 33},
{'sex': 'male', 'location': 'US', 'age': 65},
{'sex': 'female', 'location': 'Asia', 'age': 48},
]

age location=Asia location=Europe location=US sex=female sex=male

33.0 0.0 1.0 0.0 1.0 0.0

65.0 0.0 0.0 1.0 0.0 1.0

48.0 1.0 0.0 0.0 1.0 0.0

79 / 110
One hot encoding is simple, but uses one more bit than necessary (we could use zeros to represent one state).

So often we use dummy encoding:

X = [
{'sex': 'female', 'location': 'Europe', 'age': 33},
{'sex': 'male', 'location': 'US', 'age': 65},
{'sex': 'female', 'location': 'Asia', 'age': 48},
]

age location_Asia location_Europe sex

33.0 0.0 1.0 0.0

65.0 0.0 0.0 1.0

48.0 1.0 0.0 0.0

80 / 110
Other Techniques
Effect coding: Provide a numeric cost/benefit according to the category
Feature hashing: Maps unbounded numbers of categories to a limited range ­ used in large datasets
Bin counting: mapping common categories (or events) by how often it happens

81 / 110
Notebook Exercise 6: Fixing Scales and
Categorical Data

82 / 110
Good section on categorical variables ­ Zheng, A., and A. Casari. Feature Engineering for Machine Learning:
Principles and Techniques for Data Scientists. O’Reilly Media, 2018. https://books.google.co.uk/books?
id=sthSDwAAQBAJ.

83 / 110
Derived variables
New features that are calculated from other features

84 / 110
Why new data can be better than the original
Better features lead to:
Preventing overfitting
Reduced computational complexity
Simpler model explanations
Greater levels of robustness
Note that:
Accuracy rarely increases (more complex models can compensate for more complex data)
Other scores may increase as better predictions are made (e.g. right class, fewer errors)

85 / 110
Domain speci c feature generation
Many domains have:
important combinations of features:
velocity, body mass index, price­earnings ratio, queries per second, etc.
important domain transformations
FFT, Convolutions, Embeddings, etc.

86 / 110
Multiplication or division (ratio)
Change over time, or rate
Subtraction of a baseline
Normalisation: Normalising one variable with respect to another. E.g. the number of failures in itself is probably not
that useful. However a failure rate as a percentage could be very useful. (i.e. number of failures / total requests) ???
Note: This is effectively the same as the previous sections. But now we're applying transformations to create more
informative features.

87 / 110
Brute-Force Combinations
Polynomials
Combinations of features

88 / 110
Arbitrary Transformations
Logarithmic or Power laws
Kernels (a type of embedding)

89 / 110
Group discussion: Do you think that feature
extraction can ever be automatic?

90 / 110
Feature selection

91 / 110
Why select features?
To reduce computational complexity
To increase robustness
To simplify
To prevent overfitting

92 / 110
How to select features
Filtering: "Does this feature represent the label?"
Correlation
Mutual information
Model inference: "Which features are the most informative?"
Trees
Regularisation
Brute force (a.k.a. Wrapper approach): "Which features produce the best model?"
Others: Genetic algorithms, neural networks with dropout

93 / 110
Model Inference
Using decision trees
Recap:

Trees establish rules to segment the data using a purity measure.
The feature with the best split (most pure) is deemed the most informative

94 / 110
95 / 110
Brute Force
1. Iteratively add or remove features whilst measuring performance.
2. Remove features that don't improve performance

Pros:
Related to your chosen performance metric
Cons:
Computationally expensive

96 / 110
97 / 110
Notebook Exercise 7: Model Improvement through
Feature Selection

98 / 110
Great review paper of feature selection methods: http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf ­
Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." Journal of machine
learning research 3, no. Mar (2003): 1157­1182.

99 / 110
Series variables

100 / 110
Group Discussion: How do you think timeseries
data can be corrupted?

101 / 110
How Ordered Data Di ers
Difficult to throw away data
Jumps or spikes in the data can seriously upset algorithms
Usually affected by moving biases and noise
Seasonality
Changes in variance
Autocorrelation
The current observation is usually correlated with past and future observations
We can't randomly shuffle data

102 / 110
Applying preprocessing techniques to series data
Removing outliers ­ similar to before
Imputing missing data
Usually reliant on a filtering method. E.g. exponential decay smoothing
Or modelling method. E.g. interpolation
Suppressing noise
Time series filters
Seasonality
Decompose the seasonal elements. For example: trends, cyclic changes, etc.
Several python libraries:  statsmodels  and  prophet  are prominent
Scaling
Similar to before, but be careful about changes in the trend

103 / 110
Notebook Exercise 8: Cleaning Timeseries Data

104 / 110
Related Techniques
Things we haven't had time to talk about

105 / 110
Dimensionality Reduction
Aggregation, MDS, ICA, PCA, T­SNE, etc.
All part of the data cleaning process and another way to simplify/reduce data.

106 / 110
Data Integration
ETL
Moving data
Ingesting data
Organising data
Storing data
Exposing data
Software Engineering
Monitoring
Testing

107 / 110
Data governance
Access, authentication and authorisation
Security
Auditing/Monitoring
Privacy

108 / 110
Data integration: https://en.wikipedia.org/wiki/Data_integration
Data governance: https://en.wikipedia.org/wiki/Data_governance

109 / 110
Contact Details
Dr. Phil Winder
Website: https://WinderResearch.com

Email: phil@WinderResearch.com
Twitter: @DrPhilWinder
LinkedIn: https://www.linkedin.com/in/DrPhilWinder/

110 / 110

You might also like