You are on page 1of 6

Data quality issue

Incorrect rows
Summary rows
Extra rows
Missing Column Names

Fix rows and columns Inconsistent column names


Unnecessary columns
Columns containing Multiple data values

No Unique Identifier
Misaligned columns

Disguised Missing values


Missing Values Significant number of Missing values in a row/column
Partial missing values

Non-standard units

Values with varying Scales


Standardise Numbers
Over-precision

Remove outliers
Extra characters
Different cases of same words
Standardise Text
Non-standard formats

Encoding Issues

Incorrect data types

Correct values not in list


Fix Invalid Values
Wrong structure
Correct values beyond range

Validate internal rules

Duplicate data

Filter Data
Extra/Unnecessary rows
Filter Data
Columns not relevant to analysis

Dispersed data
Examples
Header rows, footer rows
Total, subtotal rows
Column numbers, indicators, blank rows
Column names as blanks, NA, XX etc.

X1, X2,C4 which give no information about the column


Unidentified columns, irrelevant columns, blank columns
E.g. address columns containing city, state, country

E.g. Multiple cities with same name in a column


Shifted columns

blank strings, "NA", "XX", "999" etc

Missing time zone, century etc

Convert lbs to kgs, miles/hr to km/hr


A column containing marks in subjects, with some subject
marks out of 50 and others out of 100

4.5312341 kgs, 9.323252 meters


Abnormally High and Low values
Common prefix/suffix, leading/trailing/multiple spaces
Uppercase, lowercase, Title Case, Sentence case, etc
23/10/16 to 2016/10/20
“Modi, Narendra" to “Narendra Modi"

CP1252 instead of UTF-8


Number stored as a string: "12,300"
Date stored
String storedasasaastring:
number:"2013-Aug"
PIN Code "110001" stored as
110001
Non-existent country, PIN code
Phone number with over 10 digits
Temperature less than -273° C (0° K)
Gross sales > Net sales
Date of delivery > Date of ordering
If Title is "Mr" then Gender is "M"

Identical rows, rows where some columns are identical


Rows that are not required in the analysis. E.g if
observations before or after a particular date only are
required for analysis, other rows become unnecessary
Columns that are not needed for analysis e.g. Personal
Detail columns such as Address, phone column in a
dataset for
Parts of data required for analysis stored in different files
or part of different datasets
How to resolve
Delete
Delete
Delete
Add the column names
Add column names that give some information
about the data
Delete
Split columns into components
Combine columns to create unique identifiers
e.g. combine City with the State
Align these columns

Set values as missing values


Delete rows, columns
Fill the missing values with the correct value

Standardise the observations so all of them


have the same consistent units

Make the scale common. E.g. a percentage scale


Standardise precision for better presentation of
data. 4.5312341 kgs couldbe presented as 4.53
kgs
Correct if by mistake else Remove
Remove the extra characters
Standadise the case/bring to a common case
Correct the format/Standardise format for
better readability in R

Encode unicode properly

Convert to Correct data type

Delete the invalid values, treat as Missing

Deduplicate Data/ Remove duplicated data


Filter rows to keep only the relevant data.

Filter columns-Pick columns relevant to analysis


Bring the data together, Group by required keys,
aggregate the rest

You might also like