Professional Documents
Culture Documents
Data Wrangling
Data Wrangling
Data Wrangling
Amit Das
amit.das@krea.edu.in
1
8/9/2021
Missing values
• If the value for an attribute (a column) is missing
• Will show up as a short row in fixed-file format
• May be harder to detect in a delimited file
• Use a missing value indicator such as NULL, NaN, or some other string
• Failure to detect missing values can corrupt entire file in
reading
• Missing values must be understood properly (why missing)
• No response
• For survey data, is it “WILL NOT ANSWER” or “NOT APPLICABLE”?
• Zero response – this should NOT be recorded as a missing value
2
8/9/2021
Outlier detection
• Single extreme values (univariate)
• Assuming normal distribution
• Unlikely values (on the tails)
may be discarded / capped
• “x% Trimmed Mean”
• Unlikely Impossible
3
8/9/2021
Density-based clustering
• Density-based spatial clustering
of applications with noise
• (DBSCAN) is a data clustering
algorithm proposed by Martin
Ester, Hans-Peter Kriegel, Jörg
Sander and Xiaowei Xu in 1996.
• Applicable to multi-dimensional
data (difficult to spot manually)
4
8/9/2021