Professional Documents
Culture Documents
TI
D5784
Data Mining
Summary
The latest and most successful advocate for data warehousing is Bill Inmon, who has earned
the title of ‘father of data warehousing’ due to his active promotion of the concept.
The Great Debates about Data Warehouse
"The data warehouse is nothing more than the union of all the data marts.”
(Ralph Kimball)
"You can catch all the minnows in the ocean and stack them together and
they still do not make a whale.”
(Bill Inmon)
Independent Data Mart (Ralph Kimball) Dependent Data Mart (Bill Inmon)
Dimensional Modeling
Dimensional modeling is the design concept used by many data warehouse
designers to build their data warehouse.
The dimensional data model provides a method for making databases simple and
understandable.
The major purpose of creating a data warehouse from transactional systems is
creating intelligence out of day to day activity and it is not intended for extracting
operation reports and cannot be treated merely as a report store.
Dimensional Data
OLTP ETL
Modeling Warehouse
Fact table: A central table in a data warehouse schema that contains numerical measures and
keys relating facts to dimension tables.
Dimension table: It is a business entity of the source system. There can be multiple
normalized table represent one single business entity on the source system.
All fact tables have two or more foreign keys, that connect to the dimension tables’primary
keys
The fact table itself generally has its own primary key made up of a subset of the foreign keys.
Star schema is a logical structure that has a fact table containing factual data in the
center, surrounded by dimension tables containing reference data (which can be
denormalized).
Product Dimension Store Dimension
ProductKey StoreKey
Sales Fact
Product Name StoreName
StartDate ProductKey StartDate
ProductManufacturer CustomerKey StoreLocation
StoreKey
Customer Dimension
DateKey Date Dimension
CustomerKey
DateKey
CustomerName
CalendarDate
StartDate
Month
CustomerLocation
Day
The star schema has a center, represented by a fact table, and the points of the star, represented
by the dimension tables.
From a technical perspective, the advantage of a star schema is that joins between the dimensions
and the fact tables are simple, performance, ability to slicing and easy understanding of data.
Snowflake schema is a variant of the star schema where dimension tables do not
contain denormalized data.
Loc Dim
LocKey
Prod Dimension
LocName
ProductKey Store Dimension
StartDate
Product Name StoreKey
StartDate StoreName
Sales Fact
ProductMfr StartDate
LocKey
ProductKey
Cust Dimension CustomerKey
Date Dimension
CustomerKey StoreKey
DateKey
CustLoc Dim CustomerName DateKey CalendarDate
CustomerLocKey StartDate
Month
Location CustomerLocKey
Day
StartDate
In a snowflake, the dimension tables are normalized. From a performance perspective, the snowflake
may result in slower queries because of the additional joins required.
OLAP key features as described in the OLAP Council White Paper (2001):
1. multi-dimensional views of data
2. support for complex calculations
3. time intelligence
OLAP database servers use multi-dimensional structures to store data and relationships
between data.
Multidimensional structures can be visualized as cubes of data, and cubes within cubes of data.
Each side of the cube is considered a dimension.
Multidimensional Data as 3-field Table versus 2-D Cube
Multidimensional Data as 4-field Table versus 3-D Cube
OLAP tools are categorized according to the architecture of the database providing the
data for the purposes of analytical processing.
There are 4 main categories of OLAP Tools:
- Multi-dimensional OLAP (MOLAP)
- Relational OLAP (ROLAP)
- Hybrid OLAP (HOLAP)
- Desktop OLAP (DOLAP)
Data mining is not:
Data warehouse: Data warehouse, relational or OLAP can be used for mining
process but data mining itself is not data warehouse store for storing
warehouse objects such as facts, dimensions.
Reporting store: Data mining is not a report store. It provides a method for
analyzing data and making decisions. It does not provide any reports other
than the analyzed data
OLAP: Online Analytical processing stores the data warehouse data in multi
dimensional store and also does aggregates accordingly. Data Mining does
not require the data to be in multi dimensional or aggregations. It cannot be
treated as a replacement of OLAP store.
OLAP Data Mining
Typically focuses on historical facts Typically focuses on future outcomes or trends
Limited ability to include reliability estimates with Data models available for predicting, discovering
predictions patterns, estimating and producing accurate results for
trend analysis and forecasting
OLAP can be used as a data source for Data Mining Data mining results can also be used in OLAP applications
models by incorporating new predictive variables or scores as
dimensions or attributes in your OLAP tool
There are 4 main operations associated with data mining techniques:
1. Predictive modeling
2. Database segmentation
3. Link analysis
4. Deviation detection
Operational analysis is business
transaction reports (closing bank
balances, who was admitted into the
Trend hospital today, how many support calls
analysis Adhoc are closed today, etc.)
analysis Trend analysis understands the growth
of the historical data over a period of
Operational Predictive time.
Ad hoc analysis is business context
analysis analysis analysis (Products sales by region) or it
can also be used for finding the root
cause such as sudden decrease in sales
Data analysis Cycle of a product due floods or natural
calamity
Predictive analysis is predicting the
patterns for the future (also called
forecasting)
Factors that would encourage considering data mining include the following:
Data availability in source Systems: Detailed data is available from source systems,
preferably on a near real-time basis. Having detailed data would be a good
candidate for accurate and predictable results.
Huge data volume: Large data sets that can be difficult to analyze effectively using
other tools lend themselves to data mining solutions. Also, the statistical functions
in data mining require a large sample set in order to produce meaningful results.
Complexity to identify trends: Having multiple factors enter into to forecasting or
discovery analysis lends itself to data mining, particularly when the appropriate
grouping structures are not known in advance.
Automating with minimum user interaction: Because data mining is driven by data
values, the same solution can be implemented at different customer locations,
achieving customized behavior with no changes to the application.