You are on page 1of 26

Data Management and Data

Quality
Sangeeta Shah Bharadwaj
The Need For Data Management
Challenges of Data Management
a. Data independence
b. Reduced Data Redundancy
c. Data Consistency
d. Data Access
e. Data Administration
f. Managing Concurrency
g. Managing Security
h. Recovery from Crashes
i. Application Development
2
Data Quality
• Completeness refers not to have incomplete data e.g. variable without
data
• Correctness refers correct/right values of data (age= -45 years?)
• Accuracy refers to closeness of measured values, observations or
estimates to the real or true value
• Consistency describes the absence of apparent contradictions and is a
measure of internal validity and reliability (age recorded as 45 years and
10 months for one employee and 34 years for another)
• Currency refers to availability of data when required and in accurate
form (on hospital prescription form age available as on registration date)
• Granularity required ,relevant and appropriate details of data (Age: 24
years 10 months 20 days, 7 hours,40 minutes)
Data Completeness(Imputation)
Imputation is the process of replacing missing values and completing the
data
• Suppose you have collected data of customers for profiling but for some
customers age is missing, you can not go back to customers, how will you
treat those records where age is missing?
• Suppose you are collecting health related data of employees and on the
basis of height and weight , BMI is calculated. If height or weight data is
missing how will you complete the data?
Data Completeness(Imputation)
Imputation is the process of replacing missing values and completing the
data
• Replace missing values with the mean or median for the set.
• Use linear regression to fill in the blanks. Linear regression creates a
simple model (a line) where it’s easy to extrapolate or interpolate missing
values. Only suited for data that is likely to be linear, like height, weight, or
income levels.
• Replace missing values with the value before it. This may work if your
values seem to have a trend (as opposed to values that are all over the
place).
• Fill in the blank areas with zeros. Mostly an option if you have a few, non-
critical missing points.
• Use a k-nearest neighbor to generate missing data points. Nearest
neighbor matching logically matches one data point with another, most
similar, data point.
Data Correctness
• Entering the correct data by adding rules at entry level
• Identifying the incorrect data and taking action to
correct it

• Negative values or values which the data should not


hold
• Multiple labels of same value
• Any other incorrect data
How is Data stored ?
Commercial DBMS Products in Market
• Oracle
• DB2
• SQL Server
• MySQL

11
SNAPSHOT of HR DATA

12
Views of Data (Queries)
Different views reveal different combinations of data

13
TYPES OF MIS

Delivered Customized Queries

Predictive analytics • Scenario planning


• Risk analysis and
Data for Reporting mitigation

Advance • For benchmarking


Reporting • Decision Making

• Compliance report
Operational • Summary
Reporting (Monthly/annual)
reports

14
From DataBase to Datawarehouse

15
Definition of a Data Warehouse

“A warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of
management’s decision making process”.
Data Warehouse Properties

Subject
Oriented
Integrated

Data
Warehouse

Non Volatile Time Variant


• Subject-Oriented: is subject oriented means it is focused towards subject
area e.g. Marketing or Sales., Employee
• Integrated: is integrated means it combines data from multiple sources
which in turn cleansed and integrated to be presented in single form
• Time-Variant: The data keeps on adding with time, with time stamp and
Input/ Output depends on time. For example, a transaction system may
hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.
• Non-volatile: Means there will be no update in the data stored in
datawarehouse Once data is in the data warehouse, it will not change.
So, historical data in a data warehouse should never be altered..
• data in data warehouse is used for analysis; update and delete operation
may disturb this analysis. So generally data in data warehouse is never
updated or deleted. Data is added with time/ date stamp
18
Nonvolatile
Typically data in the data warehouse is not updated or delelted.

Operational Warehouse

Load

Insert Read Read


Update
Delete
Time Variant
First time load

Warehouse Database
Operational Refresh
Database

Refresh

Refresh
Extraction, Transformation, and
Transportation

OLTP Databases Staging File Warehouse Database

Purchase specialist tools, or develop programs


• Extraction-- select data using different methods
• Transformation--validate, clean, integrate, and
time stamp data
• Load--move data into the warehouse
Data Warehouses
• The Need for Data Warehouses
– Consolidating much of the data from various databases into a
whole that could be understood clearly.
– Consolidated reporting
• Extract-Clean-Transform-Load
• Scrubbing / Data Cleansing
• Staging Area
• Data Marts – domain-specific data warehouses
• Analysis of Data
– Data Mining
– Online Analytical Processing (OLAP)
22
– Data Visualization
Structured Data to Unstructured data

23
24
How Is Big Data Different?
1) Automatically generated by a machine
(e.g. Sensor embedded in an engine)

2) Typically an entirely new source of data


(e.g. Use of the internet)

3) Not designed to be friendly


(e.g. Text streams)

4) May not have much important values (lot of redundant data)


– Need to focus on the important part

25
BIG data and Data Analytics

26

You might also like