Welcome to Scribd!

Data Wrangling

Uploaded by

0% found this document useful (0 votes)

7 views5 pages

This document discusses various challenges in data wrangling and cleaning including: 1) Fixed width and delimited file formats and how to handle missing or unused values. 2) Different options for dealing with missing values like excluding rows, imputation, or other techniques. 3) Methods for detecting outliers in data like trimming means or density-based clustering algorithms.

Original Description:

Original Title

DataWrangling

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

7 views5 pages

Data Wrangling

Uploaded by

Ankit Mukhopadhyay

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 5

Search inside document

8/9/2021

Data Wrangling
Amit Das
amit.das@krea.edu.in

Fixed width and delimited files

• Fixed width file format
• All rows of equal length
• Each column allocated same width
of digits / characters
• Unused spaces to be filled with padding
(NULL) characters
• Application must “know” the layout
• Delimited file formats
• Columns separated by delimiters
• Spaces, commas, other …
• Rows of unequal length
• Might continue on multiple lines

1
8/9/2021

Missing values
• If the value for an attribute (a column) is missing
• Will show up as a short row in fixed-file format
• May be harder to detect in a delimited file
• Use a missing value indicator such as NULL, NaN, or some other string
• Failure to detect missing values can corrupt entire file in
reading
• Missing values must be understood properly (why missing)
• No response
• For survey data, is it “WILL NOT ANSWER” or “NOT APPLICABLE”?
• Zero response – this should NOT be recorded as a missing value

Dealing with missing values

• SAFE OPTION: Exclude rows with (any) missing values (“LISTWISE”)
• Downsides
• Loss of sample size
• Bias unless “missing at random” (systematic non-response)
• OTHER OPTIONS
• Some analyses can use incomplete data (“PAIRWISE”)
• In a correlation matrix, pairwise correlations can have different sample sizes
• UNSAFE OPTION: IMPUTATION
• Replace missing values with the means of the columns
• “Predict” missing values from other attributes in the same row
• “Predict” missing value by comparison with “similar” rows
• YOU ARE MAKING UP DATA, AT YOUR PERIL!

2
8/9/2021

Outlier detection
• Single extreme values (univariate)
• Assuming normal distribution
• Unlikely values (on the tails)
may be discarded / capped
• “x% Trimmed Mean”
• Unlikely  Impossible

Outlier detection (2)

• Single extreme values (univariate)
• Without assuming normal distribution

3
8/9/2021

Density-based clustering
• Density-based spatial clustering
of applications with noise
• (DBSCAN) is a data clustering
algorithm proposed by Martin
Ester, Hans-Peter Kriegel, Jörg
Sander and Xiaowei Xu in 1996.
• Applicable to multi-dimensional
data (difficult to spot manually)

Syntactic vs semantic data cleaning

• Sometimes the meaning of
the data is clear, though it
does not match exactly.
• ISO 3166-1 alpha-3 codes
three-letter country codes
IDN Indonesia
IMN Isle of Man
IND India
IOT British Indian Ocean Territory
IRL Ireland
IRN Iran (Islamic Republic of)
IRQ Iraq
Write code to reconcile other
(variant) spellings.

4
8/9/2021

Wide and long forms of data (e.g. time series)

• Long data sometimes called
“tidy” data
• Some analyses require
one or the other
• Tidy data is handled better
by machines (normalized)
• Wide data may be easier
for human readers
• Stats software (and Python, R) have routines for reshaping data

A modern view of data cleaning

• Data cleaning requires domain knowledge
• Number of cylinders: 3, 4, 6, 7, 8, 10, 12, 16 … in automobiles
which ones are meaningful?
• Orders of magnitude: GHz, ns, microns / nm … in electronics
• The “problem” lies in the “brittleness” of learning algorithms
• Some can run on incomplete data, others cannot
• Some are strongly affected by noise in data, others are more robust
• Mean vs Median

• “Data Cleaning” is a legitimate part of the modeling process

Bill Gates, Warren Buffet, LeBron James, Lionel Messi: outliers?

Level Control (CLCS) Control Unit, Component Description
Document1 page
Level Control (CLCS) Control Unit, Component Description
rudi
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
Document61 pages
COS10022 - Lecture 03 - Data Preparation PDF
Papersdock Taha
No ratings yet
IntroGIS Presentation Jan2018 ShortenedChoropleth
Document8 pages
IntroGIS Presentation Jan2018 ShortenedChoropleth
Sergio Alcala
No ratings yet
1 3-Exchanging-Data 280155520
Document2 pages
1 3-Exchanging-Data 280155520
Mark Robson
No ratings yet
Lecture NoSqlIntro
Document30 pages
Lecture NoSqlIntro
Kriti Gautam
No ratings yet
Lecture 5 - Data Preparation
Document31 pages
Lecture 5 - Data Preparation
Khánh Hoàng Văn
No ratings yet
3 Data Preprocessing
Document33 pages
3 Data Preprocessing
Marco
No ratings yet
GIS Data Model
Document46 pages
GIS Data Model
FAizal Abdillah
No ratings yet
DCICT 2 Databases
Document26 pages
DCICT 2 Databases
yoweemagambo
No ratings yet
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
Document19 pages
Data Mining and Predictive Modelling: Lecture 4: Data Pre-Processing
Mohit
No ratings yet
3 Ravi
Document82 pages
3 Ravi
Krishna Chauhan
No ratings yet
3 Vector Structure JD
Document17 pages
3 Vector Structure JD
Moises Jose
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
Document49 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
Jason Zeng
No ratings yet
Unit 4 Data Models
Document34 pages
Unit 4 Data Models
Cotton Candy
No ratings yet
Data Preprocessing
Document22 pages
Data Preprocessing
Prashant Sharma
No ratings yet
Data Models in GIS (Part 2)
Document28 pages
Data Models in GIS (Part 2)
Naveed Hussain
No ratings yet
Data Preparation
Document21 pages
Data Preparation
Abebe Chekol
No ratings yet
Data Preprocessing
Document77 pages
Data Preprocessing
20bme094
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
Document53 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
Saili Mishra
No ratings yet
Feature Engineering
Document66 pages
Feature Engineering
arul mamce
No ratings yet
Data Manipulation: Overview and Applications
Document17 pages
Data Manipulation: Overview and Applications
劉柏亨
No ratings yet
Class3 DataPreprocessing 21sept2020
Document10 pages
Class3 DataPreprocessing 21sept2020
Rajat Gupta
No ratings yet
Lec 06-DSFa23
Document45 pages
Lec 06-DSFa23
labnexaplan9
No ratings yet
Consolidated Presentation v2
Document24 pages
Consolidated Presentation v2
Kieran Walsh
No ratings yet
Hive What Is Hive?
Document3 pages
Hive What Is Hive?
mmm
No ratings yet
Types of GIS
Document14 pages
Types of GIS
pwnjha
100% (1)
Lecture 1
Document66 pages
Lecture 1
triva
No ratings yet
Database Systems
Document57 pages
Database Systems
Muhammad Sheharyar Mohsin
No ratings yet
Lecture 5 PDF
Document39 pages
Lecture 5 PDF
zhiyan 55
No ratings yet
Chapter3 DataPreprocessing
Document50 pages
Chapter3 DataPreprocessing
01fe20bcs181
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
Document16 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
Aditya Bonnerjee 21BEC0384
No ratings yet
CH1-data Preprocessing
Document49 pages
CH1-data Preprocessing
selsabilrouahi
No ratings yet
GIS Data Structures
Document48 pages
GIS Data Structures
krish_mpl
No ratings yet
Attribute Data & Tables
Document77 pages
Attribute Data & Tables
Dipak Raj Basnet
No ratings yet
Lec 16
Document34 pages
Lec 16
sachin
No ratings yet
Fundamentals of Database Systems: (Parallel and Distributed Databases)
Document46 pages
Fundamentals of Database Systems: (Parallel and Distributed Databases)
thecoolguy96
No ratings yet
0 0 ALevel CS Specification
Document56 pages
0 0 ALevel CS Specification
HarALevelComputing J
No ratings yet
Database: - Data Hierarchy
Document9 pages
Database: - Data Hierarchy
raviaher
No ratings yet
Supervised Learning: Adane Letta Mamuye (PHD)
Document41 pages
Supervised Learning: Adane Letta Mamuye (PHD)
ABDULHAMID
No ratings yet
Michael Rees BestPractices - Oct09 - New - Handouts
Document10 pages
Michael Rees BestPractices - Oct09 - New - Handouts
Abhishek Gupta
No ratings yet
Data Types: CSE 325/CSE 425: Concepts of Programming Language
Document46 pages
Data Types: CSE 325/CSE 425: Concepts of Programming Language
Md. Amdadul Bari
No ratings yet
DB2 - IBM's Relational DBMS
Document154 pages
DB2 - IBM's Relational DBMS
amir ghorbani
No ratings yet
SIA Week 4
Document3 pages
SIA Week 4
Âįmân Śhêikh
No ratings yet
Gis Databases
Document27 pages
Gis Databases
senwelo gulubane
No ratings yet
DS 1
Document20 pages
DS 1
Ahmad Bajwa
No ratings yet
Data Visualisation Slides 1-6
Document318 pages
Data Visualisation Slides 1-6
Jasper van Tuijl
No ratings yet
Evolution of Database Models.: Dbms E&Tc R O L L N O - 7 2, 7 5, 8 8, 9 1, 9 3
Document12 pages
Evolution of Database Models.: Dbms E&Tc R O L L N O - 7 2, 7 5, 8 8, 9 1, 9 3
Yash Chaudhari
No ratings yet
Lecture 3
Document71 pages
Lecture 3
Al Nashib
No ratings yet
Relational Databases
Document101 pages
Relational Databases
Erdemtugs Erdenetogtokh
No ratings yet
Data Structure
Document13 pages
Data Structure
Saddam
No ratings yet
Preprocessing
Document62 pages
Preprocessing
poi.tamrakar
No ratings yet
8ad59658 1701235711480
Document36 pages
8ad59658 1701235711480
kashyaputtam7
No ratings yet
Lec1-Introduction To Data Structure and Algorithms
Document20 pages
Lec1-Introduction To Data Structure and Algorithms
sheheryar
100% (3)
DS Unit 1
Document99 pages
DS Unit 1
Dhananjay Chandrashekhar Kajale
No ratings yet
Decision Tree Slides
Document94 pages
Decision Tree Slides
gcsg1819.ezekiel
No ratings yet
DB2 - IBM's Relational DBMS
Document167 pages
DB2 - IBM's Relational DBMS
victorynaren
No ratings yet
Abhi Presentation
Document20 pages
Abhi Presentation
Chella venkannababu
No ratings yet
Big Data Intro
Document12 pages
Big Data Intro
Alekhya Abbaraju
No ratings yet
Creating Surfaces: Esri International User Conference
Document51 pages
Creating Surfaces: Esri International User Conference
carlos adrian
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Magic Data: Part 1 - Harnessing the Power of Algorithms and Structures
From Everand
Magic Data: Part 1 - Harnessing the Power of Algorithms and Structures
Chuck Sherman
No ratings yet
Performance Characterization and Placement of A Marine Hydrokine - 2015 - Applie
Document13 pages
Performance Characterization and Placement of A Marine Hydrokine - 2015 - Applie
cristian villegas
No ratings yet
Cable Car Paper Presntation
Document18 pages
Cable Car Paper Presntation
Bhanu Gupta
100% (3)
Configuration Manual Easygen 2000 Series
Document341 pages
Configuration Manual Easygen 2000 Series
Gilberto Nunes Dualdo Junior
100% (8)
Langacker 2015 How To Build An English Clause
Document45 pages
Langacker 2015 How To Build An English Clause
potioncomic
No ratings yet
Criteria For Dissertation/ Thesis Review: Master of Arts in Education (Maed)
Document11 pages
Criteria For Dissertation/ Thesis Review: Master of Arts in Education (Maed)
JELENE DAEP
No ratings yet
NR 210303 Electrical Engineering
Document8 pages
NR 210303 Electrical Engineering
Srinivasa Rao G
100% (3)
5 Newtons Laws Questions
Document7 pages
5 Newtons Laws Questions
Abdullah Mohammed
No ratings yet
C++ Programming Exercise-2:: Sample Output
Document2 pages
C++ Programming Exercise-2:: Sample Output
Munir Ali
No ratings yet
Microsoft Case Study: Based End-To-End Supply Chain Solution
Document43 pages
Microsoft Case Study: Based End-To-End Supply Chain Solution
Jelena Fedurko
No ratings yet
Animal Cell
Document6 pages
Animal Cell
miss_cutee
No ratings yet
Fuel Injection Systems: Dedicated To High-Precision
Document4 pages
Fuel Injection Systems: Dedicated To High-Precision
Yasin Gargı
No ratings yet
C++ File 6
Document46 pages
C++ File 6
Kumar k
No ratings yet
1.2a TOFD Formula 2010
Document18 pages
1.2a TOFD Formula 2010
Hoang Diep Phan
100% (1)
under NDA: 泰凌 Kite BLE SDK 开发指南
Document317 pages
under NDA: 泰凌 Kite BLE SDK 开发指南
毛增
No ratings yet
7834 5
Document2 pages
7834 5
Mehta & Associates PMC NRDA
No ratings yet
Monitoring Diesel Particulate Exhaust in The Workplace: Standard Test Method For
Document9 pages
Monitoring Diesel Particulate Exhaust in The Workplace: Standard Test Method For
asma hamza
No ratings yet
Review Module - Structural Engineering (Bolted and Welded Connections) Concentrically Loaded Connections
Document2 pages
Review Module - Structural Engineering (Bolted and Welded Connections) Concentrically Loaded Connections
Joseph Lanto
No ratings yet
3 - CEM - Testing of Mechanical Properties PDF
Document11 pages
3 - CEM - Testing of Mechanical Properties PDF
Smit Gandhi
No ratings yet
Superconducting Cable
Document16 pages
Superconducting Cable
shivam dwivedi
No ratings yet
Oracle Report Errors
Document70 pages
Oracle Report Errors
Naveen Lazarus
100% (1)
Atomic Hydrogen Welding
Document2 pages
Atomic Hydrogen Welding
David Turner
No ratings yet
Threaded Accessories: Materials
Document22 pages
Threaded Accessories: Materials
dzung
No ratings yet
PC Adr141a-Adr241a Am-110 06 10.12.2011
Document23 pages
PC Adr141a-Adr241a Am-110 06 10.12.2011
PrasunJain
No ratings yet
Lower Limb Positioning Peer Mentoring
Document92 pages
Lower Limb Positioning Peer Mentoring
Jae Em Diestro
No ratings yet
Highway Project
Document51 pages
Highway Project
dante.danzan
100% (1)
Obstacle Detection For Visually Impaired Using
Document9 pages
Obstacle Detection For Visually Impaired Using
Sumit Patil
No ratings yet
Awt Demo
Document18 pages
Awt Demo
imposter syndrome
No ratings yet
SI-3122V-Sanken Electric PDF
Document4 pages
SI-3122V-Sanken Electric PDF
Nelu Costin
No ratings yet
FX2N Communication Switching For Single Serial Port
Document3 pages
FX2N Communication Switching For Single Serial Port
Anddy
No ratings yet