You are on page 1of 20

Data Warehouse

Bilal Hussain
• Course Outlines:
1. Introduction & Background.
2. De-Normalization.
3. OLAP & Dimensional Modeling.
4. ETL and Data Quality Management (DQM).
5. Database Performance (Parallelism, Partitioning).
6. ETL Implementation using ODI.
7. Data Visualization using OBIEE.
8. Project (Design Data warehouse for any organization using any
ETL and BI Tool).
Course
Week #
plan: Assignment # Quiz No
1
2
3 Assign #1 Quiz # 1
4
5
6 Assign #2 Quiz # 2
7
8
9 Mid-Term
10 Assign #3 Quiz # 3
11
12 Assign #4 Quiz # 4
13
14
15
16 Final Exam
Recap:
• Dr.EF Code 12 Rules.
• Normlization.
• Constraints.
• De-Normalization.
• DM
ETL/ELT
The process of extracting data from source systems and bringing it into
the data warehouse is commonly called ETL. Which stands for Extract,
transformation and loading.
Why ETL?
• A Data Warehouse provides a common data repository.
• ETL provides a method of moving the data from various sources into a data
warehouse.
• As data sources change, the Data Warehouse will automatically updated.
• Well-designed and documented ETL system is almost essential to the success of a
Data Warehouse project.
• Allow verification of data transformation, aggregation and calculations rules.
• Perform complex transformations and requires extra space to store the data.
• Convert to the various formats and types to one consistent system.
• ETL is a predefined process for accessing and manipulating source data into the
target database.
• It helps to improve productivity.
ETL Process:
• Extract: Capture Data from source system.
• Full Extraction.
• Incremental Extraction(Timesatmp, UniqueID, Triggers).
• Efficient when changes are identified.
• Identification could be costly.
• Very challenging.

• Methods of Data extraction:


• Online
• Live Data
• Offline
• Batch Processing.
ETL Process:
• Transform: Set of rules or functions are applied on extracted data to convert it
into single uniform/standard format.
• Filtering
• Cleaning
• Joining
• Splitting
• Conversion
• Sorting
• Summarization.
• Major Types.
• Decoding of Fields
• Calculated and Derived values.
• Merging of Information.
• Unit of measurement conversion
• Date/time conversion.
ETL Process:
• Loading
• Loading final(transformed) data in Data warehouse.
ETL Issues
1. Diversity in source systems and platforms.

Platform OS DBMS

Exadata Linux Oracle

Mini Computer Unix Informix

Desktop Windows Access

2. Inconsistent data representation.


Gender(M/F,0/1) Date (YYYY-MM-DD,DD-MON-YYYY,DD-MM-YYYY)
3. Multiple sources for same data element.
4. Complexity of required transformation.
5. Volume of Data.
6. Data Quality.
ETL Methods:
• Incremental Updates Methods.
• Timestamp Based.
• Triggers
• Partitioning.
Primary Key Problems
1. Primary key but different data.
2. One attribute with different names.
3. Primary key in one system but not in other.
Non Primary Key Problems
1. Different encoding in different sources.
2. Multiple ways to represent the same information.
3. Sources might contain invalid data.
4. Two fields with different data but same name.
5. Required field left blank.
6. Data erroneous or incomplete.
7. Data contain null values.
Data Quality
• Formal Definition:
• Quality is conformance to requirements.
• P Crosby Quality is Free 1979.
• According to Industry.
• Quality means meeting customers needs, not necessarily exceeding them.
• Quality means improving things customers care about, because that makes
their lives easier and more comfortable.
Orr’s Laws of Data Quality
• Law #1- Data which is not used is not correct.
• Law #2- Data Quality is a function of its use, not its collection.
• Law #3- Data will be no better than its most stringent use.
• Law #4- Data Quality problems increase with the age of the system.
• Law #5- The less likely something is to occur, the more traumatic it
will be when it happens.
Total Quality Control
• Cost of fixing data quality.
Co$t of Data Quality.
• Controllable costs.
• Resultant costs.
• Equipment & Training Costs.
Where data quality is critical?
• Marketing communications.
• Customer Matching.
Characteristics or Dimensions of Data
Quality.
Data Quality Characteristic Definition

Accuracy Qualitatively assessing lack of error, high accuracy corresponding to small error.

Completeness The Degree to which values are present in the attributes that require them.

Reliability Reliability means piece of information doesn't contradict another piece of information.(DOB)

Timeliness How up-to-date is information?

Interpretability the extent to which data is in appropriate language, symbols and units and the definition are clear.

Accessibility The extent to which data is available or easily and quickly retrievable.

Objectivity the extent to which data is unbalanced, unpredicted, and impartial.


End

You might also like