Data Warehouse - List of Topics

1
DW - Terminal
1. Characteristics/Prperties/Features of Data Warehouse
a. Data Integration
b. Subject Oriented Data
c. Non-Volatile Data
d. Time-Variant Data
2. Varieties of Data Warehouse
a. Conventional DW
b. Data Mart
c. Operational Data Store
3. Q:/ What is Data Mart?
4. DW vs. DM
5. Types of Data Mart
a. Independent DM
b. Dependent DM
6. Architecture of Data Warehouses
a. Information/Data Sources Component
i. Production Data
ii. Internal Data
iii. External Data
iv. Transactional Data
b. Data Staging Component
i. Extraction
ii. Transformation
iii. Loading
c. Data Storage Component
i. Metadata
ii. Data Mart
iii. Multidimensional Data Bases
d. Information/Data Delivery Component
i. Ad hoc reports
ii. Complex Queries
iii. Multidimensional Analysis
e. OLAP = Online Analytical Processing
i. ROLAP
ii. MOLAP
iii. DOLAP
iv. HOLAP
2
7. Slides-02
8. Granularity
9. Benefits of Granularity (5)
10. Slides-04
11. Q:/ What is Normalization?
12. Q:/ What are the goals of normalization?
a. Eliminate redundant data.
b. Ensure data dependencies make sense.
13. Q:/ What is the result of normalization?
14. Q:/ What are the levels of normalization?
a. 1NF
b. 2NF
c. 3NF
15. Slides-05
16. Q:/ What is De-Normalization
17. Q:/ Why De-normalization In DSS?
18. Q:/ How De-normalization improves performance?
19. 4 Guidelines for De-normalization
a. Carefully do a cost-benefit analysis.
b. Do a data requirement and storage analysis.
c. Weigh against the maintenance issue of the redundant data (triggers used).
d. When in doubt, don’t de-normalize
20. Areas for Applying De-Normalization Techniques
21. Five principal De-normalization techniques
a. Collapsing Tables.
i. Two entities with a One-to-One relationship.
ii. Two entities with a Many-to-Many relationship.
b. Splitting Tables (Horizontal/Vertical Splitting).
c. Pre-Joining.
d. Adding Redundant Columns (Reference Data).
e. Derived Attributes (Summary, Total, Balance etc).
22. OLAP FASMI Test
a. Fast
b. Analysis
c. Shared
d. Multi-dimensional:
e. Information
3
23. Lecture-09
24. OLAP Implementations
a. MOLAP
b. ROLAP
c. HOLAP
d. DOLAP
25. MOLAP Implementations
26. Aggregations in MOLAP
27. Cube Operations
a. Roll up (drill-up):
b. Drill down (roll down):
c. Slice and dice (project and select):
d. Pivot (rotate):
28. Advantages of MOLAP:
a. Instant response (pre-calculated aggregates).
b. Impossible to ask question without an answer.
c. Value added functions (ranking, % change).
29. Drawbacks of MOLAP:
a. Long load time ( pre-calculating the cube may take days!).
b. Very sparse cube (wastage of space) for high cardinality (sometimes in small
hundreds). e.g. number of heaters sold in Jacobabad or Sibi
30. MOLAP Implementation Issues
a. Maintenance issue: Every data item received must be aggregated into every
cube (assuming “to-date” summaries are maintained). Lot of work.
b. Storage issue: As dimensions get less detailed (e.g., year vs. day) cubes get much
smaller, but storage consequences for building hundreds of cubes can be
significant. Lot of space.
31. Partitioned Cubes
32. Virtual Cubes
33. Lecture-10
34. Relational OLAP (ROLAP)
35. Why ROLAP?
36. ROLAP as a “Cube”
37. How to create “Cube” in ROLAP
38. How to create “Cube” in ROLAP using SQL?
39. Problem With Simple Approach
40. CUBE Clause
41. ROLAP & Space Requirement
42. EXAMPLE: ROLAP & Space Requirement
43. ROLAP Issues
a. Maintenance
4
b. Non-standard hierarchy of dimensions
c. Non-standard conventions
d. Explosion of storage space requirement
e. Aggregation pit-falls
44. How to Reduce Summary tables?
45. Performance vs. Space Trade-Off
46. HOLAP
47. DOLAP
48. Lecture-11
49. Dimensional Modeling (DM)
50. The need for ER modeling?
a. Problems with early COBOLian data processing systems.
b. Data redundancies
c. From flat file to Table, each entity ultimately becomes a Table in the physical
schema.
d. Simple O(n2) Join to work with Tables
51. Why ER Modeling has been so successful?
a. Coupled with normalization drives out all the redundancy out of the database.
b. Change (or add or delete) the data at just one point.
c. Can be used with indexing for very fast access.
d. Resulted in success of OLTP systems.
52. Need for DM:
a. Un-answered Qs
b. Complexity of Representation
c. The Paradox
53. ER vs. DM
54. How to simplify a ER data model?
a. De-Normalization
b. Dimensional Modeling (DM)
55. What is DM?
56. Dimensions have Hierarchies
57. Two Schemas
a. Star
b. Snowflake
58. “Simplified” 3NF (Retail)
59. Vastly Simplified Star Schema
60. The Benefit of Simplicity
61. Features of Star Schema
62. Quantifying space requirement
63. Lecture-12
64. Process of Dimensional Modeling (in detail)
5
a. Choose the Business Process
b. Choose the Grain
c. Choose the Facts
d. Choose the Dimensions
65. Lecture-13
66. ETL
67. Putting Pieces together
68. ETL Cycle
69. ETL Processing
70. Overview of Data Extraction
71. Types of Data Extraction
a. Logical Extraction
i. Full Extraction
ii. Incremental Extraction
b. Physical Extraction
i. Online Extraction
ii. Offline Extraction
iii. Legacy vs. OLTP
72. Basic Tasks of Data Transformation
a. Selection
b. Splitting/Joining
c. Conversion
d. Summarization
e. Enrichment
73. Aspects of Data Loading Strategies
a. Need to look at:
i. Data freshness
ii. System performance
iii. Data volatility
b. Data Freshness
i. Very fresh low update efficiency
ii. Historical data, high update efficiency
iii. Always trade-offs in the light of goals
c. System performance
i. Availability of staging table space
ii. Impact on query workload
d. Data Volatility
i. Ratio of new to historical data
ii. High percentages of data change (batch update)
74. Three Loading Strategies
a. Full data refresh with BLOCK INSERT or ‘block slamming’ into empty table.
6
b. Incremental data refresh with BLOCK INSERT or ‘block slamming’ into existing
(populated) tables.
c. Trickle/continuous feed with constant data collection and loading using row level
insert and update operations.
75. Lecture-14
76. Issues of ETL
77. Why ETL Issues?
78. “Some” Issues
a. Usually, if not always underestimated
b. Diversity in source systems and platforms
c. Inconsistent data representations
d. Complexity of transformations
e. Rigidity and unavailability of legacy systems
f. Volume of legacy data
g. Web scrapping
79. Q:/ Beware of Data Quality?
80. ETL vs. ELT
81. Lecture-15
82. ETL Detail: Data Extraction & Transformation
83. Extracting Changed Data
a. Change Data Capture (CDC) identifies and processes only the data that has
changed, not entire tables, & makes the change data available for further use.
b. Incremental data extraction
c. Efficient when changes can be identified
d. Identification could be costly
e. Very challenging
84. Two CDC sources
a. Modern systems
b. Legacy systems
85. CDC in Modern Systems
a. Time Stamps
b. Triggers
c. Partitioning
86. CDC in Legacy Systems
a. Changes recorded in tapes
b. Changes read and removed from tapes
c. Problems with reading a log/journal tape are many
87. CDC Advantages: Modern Systems
a. Immediate.
b. No loss of history
c. Flat files NOT required
7
88. CDC Advantages: Legacy Systems
a. No incremental on-line I/O required for log tape
b. The log tape captures all update processing
c. Log tape processing can be taken off-line.
d. No haste to make waste.
89. Major Transformation Types
a. Format revision
b. Decoding of fields
c. Calculated and derived values
d. Splitting of single fields
e. Merging of information
f. Character set conversion
g. Unit of measurement conversion
h. Date/Time conversion
i. Summarization
j. Key restructuring
k. Duplication
90. Data content defects
a. Domain value redundancy
b. Non-standard data formats
c. Non-atomic data values
d. Multipurpose data fields
e. Embedded meanings
f. Inconsistent data values
g. Data quality contamination
91. Lecture-17
92. ETL Detail: Data Cleansing
93. Background = Other names: Called as data scrubbing or cleaning
94. Lighter Side of Dirty Data
95. Serious Side of dirty data
96. 3 Classes of Anomalies…
a. Syntactically Dirty Data
i. Lexical Errors
ii. Irregularities
b. Semantically Dirty Data
i. Integrity Constraint Violation
ii. Business rule contradiction
iii. Duplication
c. Coverage Anomalies
i. Missing Attributes
ii. Missing Records
8
97. Why Coverage Anomalies?
a. Equipment malfunction (bar code reader, keyboard etc.)
b. Inconsistent with other recorded data and thus deleted.
c. Data not entered due to misunderstanding/illegibility.
d. Data not considered important at the time of entry (e.g. Y2K).
98. Handling missing data
a. Dropping records.
b. “Manually” filling missing values.
c. Using a global constant as filler.
d. Using the attribute mean (or median) as filler.
e. Using the most probable value as filler.
99. Key Based Classification of Problems
a. Primary key problems
i. Same PK but different data.
ii. Same entity with different keys.
iii. PK in one system but not in other.
iv. Same PK but in different formats.
b. Non-Primary key problems
i. Different encoding in different sources.
ii. Multiple ways to represent the same information.
iii. Sources might contain invalid data.
iv. Two fields with different data but same name.
v. Required fields left blank.
vi. Data erroneous or incomplete.
vii. Data contains null values.
100. Automatic Data Cleansing…
a. Statistical
b. Pattern Based
c. Clustering
d. Association Rules

Data Warehouse - List of Topics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Warehouse - List of Topics

Uploaded by

Copyright:

Available Formats

1

You might also like