Professional Documents
Culture Documents
190319windercleaningdatascience1576692643371 PDF
190319windercleaningdatascience1576692643371 PDF
1 / 110
Introduction
Focus: why and how
Requirements: intermediate knowledge + python
Goal: raise the profile, immediately useful
2 / 110
Introduction
Visualising Data
Availability, Consistency, Leakage
Types of Data
Corrupted Data
Transforming Data and Normality
Working With Scales
Derived Variables
Feature Selection
Series Variables
Related Techniques
Contact Details
3 / 110
How Bad Data A ects Results
4 / 110
How Bad Data A ects Results
5 / 110
Why Bad Data A ects Results
Deduction: Newton
Induction: Sherlock Holmes
6 / 110
Why Bad Data A ects Results
7 / 110
Why Bad Data A ects Results
8 / 110
Group Discussion
Can you share an example where bad data has affected one of your projects?
9 / 110
Visualising Data - Tools of the Trade
Always visualise your data
How?
10 / 110
Visualising Data - Histograms
11 / 110
Visualising Data - Missing data
12 / 110
Visualising Data - Missing data
13 / 110
Visualising Data - Missing data
14 / 110
Visualising Data - Correlation
15 / 110
Visualising Data - Correlation
16 / 110
Visualising Data - Beware of Summary Statistics
Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through
Simulated Annealing, Justin Matejka, George Fitzmaurice
17 / 110
Notebook Exercise 1: Visualising Data
18 / 110
Bilogur, (2018). Missingno: a missing data visualization suite. Journal of Open Source Software, 3(22), 547,
https://doi.org/10.21105/joss.00547 https://github.com/ResidentMario/missingno
19 / 110
Data Availability
The availability of data defines what you can and can't use (see nullity plots).
Keep as much detail as possible
Preserve versions
CR not CRUD!
20 / 110
Data Consistency
Consistent data is stable (over time, space, ...)
Can improve the quantity and quality of data, and hence improve model performance.
Use consistent definitions for metrics
21 / 110
Data Leakage
Very easy to accidentally include future data in training data.
Oversampling
Running dimensionality reduction on the whole dataset
Preprocessing over the whole dataset
Including a feature that is only populated after the label has been applied
22 / 110
Examples:
https://www.reddit.com/r/MachineLearning/comments/965zgf/d_tell_me_about_how_you_were_a_victim_of_data/
Book: Janert, P.K. Data Analysis with Open Source Tools: A HandsOn Guide for Programmers and Data Scientists.
O’Reilly Media, 2010. https://amzn.to/2VFqOYx.
23 / 110
Types of Data
Recap
Different data types have to be cleaned in different ways
24 / 110
Types of Data - Categorical
Nominal: labels, not ordered
male, female
UK, USA, Netherlands
Ordinal: ordered
great, average, rubbish
large, medium, small
25 / 110
Types of Data - Numerical
Can't be counted, but can be measured
Continuous
Temperature
Distance
Integers
Website visitors
Ranking
Aside: interval data and ratio data.
26 / 110
Data Types in Statistics, Niklas Donges https://towardsdatascience.com/datatypesinstatistics347e152e8bee
27 / 110
Corrupted data
Pessimist:
Data that is wrong and cannot be used
Optimist:
Something interesting...
28 / 110
Missing data
Missing data doesn't necessarily mean numpy.nan !
>>> print(titanic.count())
pclass 1309
survived 1309
name 1309
sex 1309
age 1046
sibsp 1309
parch 1309
ticket 1309
fare 1308
cabin 295
embarked 1307
boat 486
body 121
home.dest 745
dtype: int64
29 / 110
Fixing missing data
Remove (rows or columns)
Impute Simple
Natural null
Mean
Median
Impute Complex
Regression
Random Sampling
Jitter
30 / 110
Notebook Exercise 2: Fix missing data
31 / 110
Noise: What is noise?
Weeds are just flowers that you don't like. Noise is data that you don't like.
32 / 110
Noise: Types of Noise
Class
Feature (column)
Observation (row)
33 / 110
Noise: Improving Noise
Bad news: As much as you keep picking those weeds, they keep reappearing.
Aggregation
Average (stacking/beamforming/radon transform)
Median (popcorn noise)
Simple modelling
Smoothing
Normalisation
Complex modelling
Regression or fitting
Dimensionality Reduction and Restoration
Transformations (FFT, Wavelet)
Encoding/Embedding (Autoencoder, NLP Embeddings)
34 / 110
Anomalies
a.k.a. outliers, inconsistent
Data that is not expected (in a statistical sense)
35 / 110
Anomaly Types
Contextual possibly good
Corrupted usually not good
Measurement errors or failures
API changes
Regulatory changes
Shift in behaviour
Formatting changes
36 / 110
Detecting Anomalies
a large field in its own right
1. Define what is normal (through a model)
2. Set a threshold to define "not normal"
37 / 110
Detecting Anomalies for Data Cleaning
1. Visualise your data!
2. Everything else
a. Classification task
b. Clustering
c. Regression/fitting + thresholds
38 / 110
Example Regression Task - Wine Quality
39 / 110
40 / 110
41 / 110
Windsorizing
Remove data that lies outside a specified percentile (typically 90th percentile)
42 / 110
Notebook Exercise 3: Detect and remove outliers
43 / 110
Quick intro to handling missing data: https://towardsdatascience.com/thetaleofmissingvaluesinpython
c96beb0e8a9d
Pandas documentation on missing data: https://pandas.pydata.org/pandasdocs/stable/missing_data.html
Bit more information about anomaly detection: https://towardsdatascience.com/anoteaboutfindinganomalies
f9cedee38f0b
Good short free book on anomaly detection: Practical Machine Learning: A New Look at Anomaly Detection, Ted
Dunning, Ellen Friedman, O'Reilly Media, Inc., 2014, ISBN 1491914181, 9781491914182
Cool Library for benchmarking time series anomaly detection: https://github.com/numenta/NAB
Nice run through of daytoday problems with data: https://medium.com/@bertil_hatt/whatdoesbaddatalooklike
91dc2a7bcb7a
Short section on dealing with corrupted data Raschka, S. Python Machine Learning. Packt Publishing, 2015.
https://books.google.co.uk/books?id=GOVOCwAAQBAJ.
44 / 110
Transforming Data
45 / 110
Recap: Statistical Distributions
46 / 110
Non-normal Data
47 / 110
Normal Q-Q Plots
If the two distributions being compared are equal then the QQ plot follows the 45° (y=x) line
48 / 110
Distribution Fitting
49 / 110
Others
Box plots
50 / 110
Notebook Exercise 4: Is it Normal?
51 / 110
Group Discussion: Why is normality important?
52 / 110
Why Normality is Important
53 / 110
Look again at the parameters of all these distributions. Note how few of them use "mean".
The vast majority of data cannot be represented by a mean. And the algorithm will not work. 54 / 110
55 / 110
56 / 110
Fixing: Domain Knowledge
57 / 110
58 / 110
59 / 110
60 / 110
Fixing: Arbitrary Functions
We can use any mathematical function to transform our data*
*so long as it's invertible
61 / 110
62 / 110
63 / 110
64 / 110
65 / 110
66 / 110
67 / 110
Box-Cox Transform
68 / 110
y = (x**lambda - 1) / lambda, for lambda != 0
log(x), for lambda = 0
69 / 110
70 / 110
71 / 110
Notebook Exercise 5: Fix Non-Normal Data
72 / 110
Presentation on Seaborn Styles
https://s3.amazonaws.com/assets.datacamp.com/production/course_6919/slides/chapter2.pdf
Code to fit all distributions: https://stackoverflow.com/questions/6620471/fittingempiricaldistributiontotheoretical
oneswithscipypython
Box Cox Plot, Python Data Analysis Cookbook, https://learning.oreilly.com/library/view/pythondata
analysis/9781785282287/ch04s05.html
More boxcox code: http://dataunderthehood.com/2018/01/15/boxcoxtransformationwithpython/
Log transforms and BoxCox Zheng, A., and A. Casari. Feature Engineering for Machine Learning: Principles and
Techniques for Data Scientists. O’Reilly Media, 2018. https://books.google.co.uk/books?id=sthSDwAAQBAJ.
73 / 110
Working with scales
74 / 110
Why scale is important
Many optimisation algorithms expect your data to have equal scales.
The amount of error is defined by the scale
Optimisers spend more "effort" trying to improve large errors
75 / 110
76 / 110
Altering the scale of numerical values
Name Result When?
77 / 110
78 / 110
Handling categorical data
We can't pass raw categorical data into the algorithms.
Create new columns for each possible category (OneHot Encoding)
X = [
{'sex': 'female', 'location': 'Europe', 'age': 33},
{'sex': 'male', 'location': 'US', 'age': 65},
{'sex': 'female', 'location': 'Asia', 'age': 48},
]
79 / 110
One hot encoding is simple, but uses one more bit than necessary (we could use zeros to represent one state).
So often we use dummy encoding:
X = [
{'sex': 'female', 'location': 'Europe', 'age': 33},
{'sex': 'male', 'location': 'US', 'age': 65},
{'sex': 'female', 'location': 'Asia', 'age': 48},
]
80 / 110
Other Techniques
Effect coding: Provide a numeric cost/benefit according to the category
Feature hashing: Maps unbounded numbers of categories to a limited range used in large datasets
Bin counting: mapping common categories (or events) by how often it happens
81 / 110
Notebook Exercise 6: Fixing Scales and
Categorical Data
82 / 110
Good section on categorical variables Zheng, A., and A. Casari. Feature Engineering for Machine Learning:
Principles and Techniques for Data Scientists. O’Reilly Media, 2018. https://books.google.co.uk/books?
id=sthSDwAAQBAJ.
83 / 110
Derived variables
New features that are calculated from other features
84 / 110
Why new data can be better than the original
Better features lead to:
Preventing overfitting
Reduced computational complexity
Simpler model explanations
Greater levels of robustness
Note that:
Accuracy rarely increases (more complex models can compensate for more complex data)
Other scores may increase as better predictions are made (e.g. right class, fewer errors)
85 / 110
Domain speci c feature generation
Many domains have:
important combinations of features:
velocity, body mass index, priceearnings ratio, queries per second, etc.
important domain transformations
FFT, Convolutions, Embeddings, etc.
86 / 110
Multiplication or division (ratio)
Change over time, or rate
Subtraction of a baseline
Normalisation: Normalising one variable with respect to another. E.g. the number of failures in itself is probably not
that useful. However a failure rate as a percentage could be very useful. (i.e. number of failures / total requests) ???
Note: This is effectively the same as the previous sections. But now we're applying transformations to create more
informative features.
87 / 110
Brute-Force Combinations
Polynomials
Combinations of features
88 / 110
Arbitrary Transformations
Logarithmic or Power laws
Kernels (a type of embedding)
89 / 110
Group discussion: Do you think that feature
extraction can ever be automatic?
90 / 110
Feature selection
91 / 110
Why select features?
To reduce computational complexity
To increase robustness
To simplify
To prevent overfitting
92 / 110
How to select features
Filtering: "Does this feature represent the label?"
Correlation
Mutual information
Model inference: "Which features are the most informative?"
Trees
Regularisation
Brute force (a.k.a. Wrapper approach): "Which features produce the best model?"
Others: Genetic algorithms, neural networks with dropout
93 / 110
Model Inference
Using decision trees
Recap:
Trees establish rules to segment the data using a purity measure.
The feature with the best split (most pure) is deemed the most informative
94 / 110
95 / 110
Brute Force
1. Iteratively add or remove features whilst measuring performance.
2. Remove features that don't improve performance
Pros:
Related to your chosen performance metric
Cons:
Computationally expensive
96 / 110
97 / 110
Notebook Exercise 7: Model Improvement through
Feature Selection
98 / 110
Great review paper of feature selection methods: http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." Journal of machine
learning research 3, no. Mar (2003): 11571182.
99 / 110
Series variables
100 / 110
Group Discussion: How do you think timeseries
data can be corrupted?
101 / 110
How Ordered Data Di ers
Difficult to throw away data
Jumps or spikes in the data can seriously upset algorithms
Usually affected by moving biases and noise
Seasonality
Changes in variance
Autocorrelation
The current observation is usually correlated with past and future observations
We can't randomly shuffle data
102 / 110
Applying preprocessing techniques to series data
Removing outliers similar to before
Imputing missing data
Usually reliant on a filtering method. E.g. exponential decay smoothing
Or modelling method. E.g. interpolation
Suppressing noise
Time series filters
Seasonality
Decompose the seasonal elements. For example: trends, cyclic changes, etc.
Several python libraries: statsmodels and prophet are prominent
Scaling
Similar to before, but be careful about changes in the trend
103 / 110
Notebook Exercise 8: Cleaning Timeseries Data
104 / 110
Related Techniques
Things we haven't had time to talk about
105 / 110
Dimensionality Reduction
Aggregation, MDS, ICA, PCA, TSNE, etc.
All part of the data cleaning process and another way to simplify/reduce data.
106 / 110
Data Integration
ETL
Moving data
Ingesting data
Organising data
Storing data
Exposing data
Software Engineering
Monitoring
Testing
107 / 110
Data governance
Access, authentication and authorisation
Security
Auditing/Monitoring
Privacy
108 / 110
Data integration: https://en.wikipedia.org/wiki/Data_integration
Data governance: https://en.wikipedia.org/wiki/Data_governance
109 / 110
Contact Details
Dr. Phil Winder
Website: https://WinderResearch.com
Email: phil@WinderResearch.com
Twitter: @DrPhilWinder
LinkedIn: https://www.linkedin.com/in/DrPhilWinder/
110 / 110