Professional Documents
Culture Documents
Dzone Refcard160 Datawarehousing PDF
Dzone Refcard160 Datawarehousing PDF
CONTENTS
Data Warehousing:
öö DATA
öö DATA MODELING
öö NORMALIZED DATA
öö FACTS
öö DIMENSIONS
decision-support data for some or all of an enterprise. Data The data warehouse's technical architecture includes data sources,
warehousing is a broad subject that is described point-by-point in data integration, BI/analytics data stores, and data access.
DZO N E .CO M/ RE FCA RDZ
1
+2%
+4%
+7%
“
DATA WAREHOUSING
A software tool that contains data that present it as reports and/or graphical
describes other data. The two kinds of Reporting and displays. The business or analyst will
Metadata Query Tools be able to explore the data-exploration
metadata are: business metadata and
technical metadata. sanction. These tools also help produce
reports and outputs that are desired
and needed to understand the data.
A software tool that enables the design
of data and databases through graphical
Software tools that find patterns in
means. This tool provides a detailed
Repository stores of data or databases. These tools
design capability that includes the Data Mining Tools
are useful for predictive analytics and
design of tables, columns, relationships,
optimization analytics.
rules, and business definitions.
that other types of software cannot. Data architecture is a blueprint for the management of data in an
enterprise. The data architect builds a picture of how multiple sub-
ENTITIES
An entity is a core part of any conceptual and logical data model. An
entity is an object of interest to an enterprise --- it can be a person,
organization, place, thing, activity, event, abstraction, or idea.
Entities are represented as rectangles in the data model. Think of
entities as singular nouns.
Minimum cardinality is expressed by the symbol farther away from HEADER AND DETAIL ENTITIES
the entity. A circle indicates that an entity is optional, while a bar The ADW is organized into non-changing data with logical keys and
indicates that an entity is mandatory. At least one is required. changeable data that supports tracking of changes and rapid load/
insert. Use an integer as the primary surrogate key. Then, add the
effective date to track changes.
ASSOCIATIVE ENTITIES
Track the history of relationships between entities using an
associative entity with effective dates and expiration dates.
DZO N E .CO M/ RE FCA RDZ
First Normal Form Entities contain no repeating groups ATOMIC DW SPECIALIZED ATTRIBUTES
(1NF) of attributes. Use specialized attributes to improve ADW efficiency and
effectiveness. Identify these attributes using a prefix of ADW_.
Entity is in the first normal form and
attributes that depend on only part ATTRIBUTE NAME DESCRIPTION
Second Normal
Form (2NF) of a composite key are separated into
new entities. Data warehouse assigned surrogate
key. Replace ‘xxx’ with a reference to
dw_xxx_id
The entity is in the second normal form the table name, such as ‘dw_customer_
SU PP ORTING TAB LE S
Supporting data is required to enable the data warehouse to
DIM E N SIONAL DATABAS E
operate smoothly. Here is some supporting data:
A dimensional database is a database that is optimized for query
• Code management and translation. and analysis and is not normalized like the atomic data warehouse.
It consists of fact and dimension tables, where each fact is
• Data source tracking.
connected to one or more dimensions.
• Error logging.
CODE TRANSLATION The sales order fact includes the measurer's order quantity and
Data warehousing requires that codes, such as gender code and currency amount. Dimensions of Calendar Date, Product, Customer,
units of measure, be translated to standard values aided by code- Geo Location, and Sales Organization put the sales order fact into
DZO N E .CO M/ RE FCA RDZ
translation tables like these: context. This star schema supports looking at orders in a cubical
way, enabling slicing and dicing by customer, time, and product.
• Code set: Group of codes, such as "gender code."
FACT-LESS FACT
The fact-less fact tracks an association between dimensions
rather than quantitative metrics. Examples include miles, event
attendance, and sales promotions.
DIM E N SION S
A dimension is a database table that contains properties that
identify and categorize. The attributes serve as labels for reports
and as data points for summarization. In the dimensional model,
dimensions surround and qualify facts.
AGGREGATED FACT
Aggregated facts provide summary information, such as general
DEGENERATE DIMENSION
ledger totals during a period of time or complaints per product per
A degenerate dimension has a dimension key without a dimension
store per month.
table. Examples include transaction numbers, shipment numbers,
and order numbers.
Adds a new row. Each change will add CHANGE DATA CAPTURE (CDC)
a new row where all the values will be The CDC pattern of data integration is strong in event processing.
the same except for the changed fields. Database logs that contain a record of database changes are
SCD Type 2 replicated near real time at staging. This information is then
This will mean that a new field(s) will be
added to mark the rows and state which transformed and loaded to the data warehouse.
one is effective.
for batch processing of bulk data. modeling, where adaptive schema changes at real time along with
the data, and changes are seamless. You would only need to just
upload the data sources, everything else is automated including
the following tasks:
• Data types are automatically discovered, and a schema is SOLVING CONCURRENCY ISSUES
generated based on the initial data structure. To remedy concurrency issues, new cloud data warehousing
technologies today can separate storage from compute
• Likely relationships between tables are automatically and increase the compute nodes based on the amount of
detected and used to model a relational schema. connections. Consequently, the number of available clusters
scales with the number of users and the intensity of the workload,
• Aggregations are automatically generated.
supporting hundreds of parallel queries that are load-balanced
• Table history, which stores data uploaded from API data between clusters.
• Re-indexing happens automatically whenever the algorithm uninterrupted. When the scaling is complete, the old and new
detects changes in query patterns. clusters are swapped instantly. Data warehouse maintenance itself
has been greatly improved as well, by automating the cleaning and
• Redistributing the data across nodes to improve data locality compressing of tables to boost database performance.
and join performance is done automatically.
DZO N E .CO M/ RE FCA RDZ
DZone, Inc.
DZone communities deliver over 6 million pages each 150 Preston Executive Dr. Cary, NC 27513
month to more than 3.3 million software developers, 888.678.0399 919.678.0300
architects and decision makers. DZone offers something for
Copyright © 2018 DZone, Inc. All rights reserved. No part of this publication
everyone, including news, tutorials, cheat sheets, research
may be reproduced, stored in a retrieval system, or transmitted, in any form
guides, feature articles, source code and more. "DZone is a or by means electronic, mechanical, photocopying, or otherwise, without
developer’s dream," says PC Magazine. prior written permission of the publisher.