Professional Documents
Culture Documents
Quality
Sangeeta Shah Bharadwaj
The Need For Data Management
Challenges of Data Management
a. Data independence
b. Reduced Data Redundancy
c. Data Consistency
d. Data Access
e. Data Administration
f. Managing Concurrency
g. Managing Security
h. Recovery from Crashes
i. Application Development
2
Data Quality
• Completeness refers not to have incomplete data e.g. variable without
data
• Correctness refers correct/right values of data (age= -45 years?)
• Accuracy refers to closeness of measured values, observations or
estimates to the real or true value
• Consistency describes the absence of apparent contradictions and is a
measure of internal validity and reliability (age recorded as 45 years and
10 months for one employee and 34 years for another)
• Currency refers to availability of data when required and in accurate
form (on hospital prescription form age available as on registration date)
• Granularity required ,relevant and appropriate details of data (Age: 24
years 10 months 20 days, 7 hours,40 minutes)
Data Completeness(Imputation)
Imputation is the process of replacing missing values and completing the
data
• Suppose you have collected data of customers for profiling but for some
customers age is missing, you can not go back to customers, how will you
treat those records where age is missing?
• Suppose you are collecting health related data of employees and on the
basis of height and weight , BMI is calculated. If height or weight data is
missing how will you complete the data?
Data Completeness(Imputation)
Imputation is the process of replacing missing values and completing the
data
• Replace missing values with the mean or median for the set.
• Use linear regression to fill in the blanks. Linear regression creates a
simple model (a line) where it’s easy to extrapolate or interpolate missing
values. Only suited for data that is likely to be linear, like height, weight, or
income levels.
• Replace missing values with the value before it. This may work if your
values seem to have a trend (as opposed to values that are all over the
place).
• Fill in the blank areas with zeros. Mostly an option if you have a few, non-
critical missing points.
• Use a k-nearest neighbor to generate missing data points. Nearest
neighbor matching logically matches one data point with another, most
similar, data point.
Data Correctness
• Entering the correct data by adding rules at entry level
• Identifying the incorrect data and taking action to
correct it
11
SNAPSHOT of HR DATA
12
Views of Data (Queries)
Different views reveal different combinations of data
13
TYPES OF MIS
• Compliance report
Operational • Summary
Reporting (Monthly/annual)
reports
14
From DataBase to Datawarehouse
15
Definition of a Data Warehouse
“A warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of
management’s decision making process”.
Data Warehouse Properties
Subject
Oriented
Integrated
Data
Warehouse
Operational Warehouse
Load
Warehouse Database
Operational Refresh
Database
Refresh
Refresh
Extraction, Transformation, and
Transportation
23
24
How Is Big Data Different?
1) Automatically generated by a machine
(e.g. Sensor embedded in an engine)
25
BIG data and Data Analytics
26