Professional Documents
Culture Documents
Introduction To Business Intelligence and Data Analysis
Introduction To Business Intelligence and Data Analysis
IT300
1
What is Business intelligence ?
Zeng et al. (2006) define BI as the process of
collection, treatment and diffusion of information
that has an objective, the reduction of uncertainty
in the making of all strategic decisions.
7
Decision Making process
Define the
1 Problem
Obtaining
2 Data
Decision making
Analyze Data
3
Create
4 alternatives
Select the
best
5 alternative
Implement
the decision
6 alternative 8
Data analysis
Information processing is the analysis of a large
quantity of data or other forms of information to
support decision making and to discover
knowledge in data.
This is indeed the biggest challenge posed by big
and often unstructured data: how to analyze it in a
useful way.
Objectives of Data analysis:
• Increase the effectiveness of the manager’s
decision making process,
• Support the manager in the decision making
process but not replace it,
• And improve the directions of the decision
9
making.
Evolution of Database Technology
1960s :
• Data collection, database creation, IMS and
network DBMS
1970s :
• Relational data model, relational DBMS
implementation
1980s:
• RDBMS, advanced data models and
application-oriented DBMS
1990s—2000s:
• Data mining, data warehousing, multimedia
databases and web databases 10
Origins of Data Warehouses
Database developers understood that their
software was required for both transactional and
analytical processing.
However, operational and analytical data are
separate with different requirements and different
user communities.
Once these differences were understood, new data
bases were created specifically for analysis use.
11
Origins of Data Warehouses
Operational processing (transactional processing)
captures, stores and manipulates data to support
daily operations.
Information processing is the analysis of data or
other forms of information to support decision
making.
Data warehouse can consolidate and integrate
information from many internal and external
sources and arrange it in a meaningful format for
making business decisions.
12
What is a Data Warehouse ?
According to Inmon’s (father of data warehousing) :
It is a collection of integrated, subject-oriented,
databases designed to support the DSS function,
where each unit of data is non-volatile and relevant
to some moment in time.
Or a DW is : A subject-oriented, integrated, time-
variant, non-updatable collection of data used in
support of management decision-making processes:
Subject-oriented: e.g. customers, patients, products
Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
Time-variant: Can study trends and changes
Non-updatable: Read-only, periodically refreshed 13
Need for Data Warehousing
14
Database, Data warehouse and
Data set
DB : contains tables, rows refer to records and
columns to fields. Most DBs are relational DBs
(relating tables to reduce redundancy & improve
DB performance via the normalization process)
DW : is a type of DB that has been denormalized
& archived.
Denormalization is the process of combining
some tables into a single table. This may
introduce duplicate data, but will reduce the
number of joins a query has to process.
Data set : is a sub-set of a DW or a DB. It is usually
denormalized so that only one table is used. 15
How Do Data Warehouses Differ
From Operational Systems?
Goals
Structure
Size
Performance optimization
Technologies used
16
Need to separate operational and
information systems
Three primary factors:
A data warehouse centralizes data that are
scattered throughout disparate operational
systems and makes them available for DM.
A well-designed data warehouse adds value to
data by improving their quality and consistency.
A separate data warehouse eliminates much of the
contention for resources that results when
information applications are mixed with
operational processing.
17
Comparison of Database Types
18
From the Data Warehouse to Data
Marts
A data mart contains only those data that are
specific to a particular group. For example, the
marketing data mart may contain only data
related to items, customers, and sales.
Data marts are confined to subjects.
Data marts are small in size.
Data marts are customized by department.
19
How Data Warehousing works
20
How Data Warehousing works
Extraction Transformation Loading–ETL tools
Extract Transform Load
& Clean
Sources DSA DW
22
ER Model vs. Multidimensional
Model
Why don’t we use the entity-relationship (ER)
model in data warehousing?
ER model: a data model for general purposes
– All types of data are equal, difficult to identify the
data that is:
• important for business analysis
• No difference between: What is important ? What
just describes the important?
• Normalized databases (many details that can affect
privacy and security)
– Hard to overview a large ER diagram (e.g., over 100
entities/relations for an enterprise)
23
ER Model vs. Multidimensional
Model
Traditional DBs generally deal with two-dimensional
data. However, querying performance in a multi-
dimensional data storage model is more efficient.
More built in “meaning”
– What is important
– What describes the important
– What we want to optimize
Recognized by OLAP/BI tools : Tools that offer powerful
query facilities based on Multi-Dimensional (MD) design
24
Multidimensional Model
Data is divided into: Facts and Dimensions
A fact is the important entity: exp a sale
Facts have measures that can be aggregated: sales
price
Dimensions describe facts
Facts “live” in a MD cube
Goal for dimensional modeling:
– Surround facts with as much context (dimensions) as
possible
– Hint: redundancy may be ok (in well-chosen places)
– But you should not try to model all relationships in the
data (unlike E/R and OO modeling!) 25
Dimension
Dimensions are the core of MD databases
Dimensions are used for
Selection of data
Grouping of data at the right level of detail
Dimensions consist of dimension values
Product dimension has values ”milk”, ”cream”, …
Time dimension has values ”1/1/2001”, ”2/1/2001”,…
Dimension values may have an ordering
Used for comparing cube data across values
Especially used for Time dimension
26
Dimension
Dimensions have hierarchies with levels
Typically 3-5 levels (of detail)
Dimension values are organized in a tree structure
Product: Product->Type->Category
Store: Store->Area->City->County
Time: Day->Month->Quarter->Year
Dimensions have a bottom level and a top level
Levels may have attributes
Simple, non-hierarchical information
Day has Workday as attribute
Dimensions should contain much information
Time dimension may contain holiday, season, events,…
Good dimensions have 50-100 or more attributes/levels
27
Facts
Facts represent the subject of the desired analysis
• The important in the business that should be
analyzed
A fact is identified via its dimension values
• A fact is a non-empty cell
Generally, a fact should:
• Be attached to exactly one dimension value in
each dimension
• Only be attached to dimension values in the
bottom levels
28
Measures
Measures represent the fact property that the
users want to study and optimize
Example: total sales price
A measure has two components
Numerical value: (exp: sales price)
Aggregation formula (exp: SUM): used for
aggregating/combining a number of measure values
into one
29
Multidimensional Model
Example: sales of supermarkets
• Facts and measures
– Each sales record is a fact, and its sales value is a
measure
• Dimensions
– Group correlated attributes into the same
dimension
– Each sales record is associated with its values of
Product, store, Time
30
Granularity: Dimensionality Hierarchy
Granularity of facts is important
Level of detail
Given by combination of bottom levels
A dimensional hierarchy defines mappings from a set of
lower-level concepts to higher level concepts.
Country
Year
2D data
Region Season
Quarter
City
Month Week
Area
31
ZipCode Day
Schema Design
A schema is a logical description of the entire
database.
Much like a database, a data warehouse also
requires to maintain a schema.
A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact
Constellation schema.
32
Star schema
A star schema consists of two types of tables:
• fact table
• dimension tables
Each dimension in a star schema is represented
with only one-dimension table.
This dimension table contains the set of
attributes.
33
Star schema: Components
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
34
Snowflake schema
Snowflake schema is an expanded version of a
star schema in which dimension tables are
normalized into several related tables.
Advantages
• Small saving in storage space
• Normalized structures are easier to update and
maintain
Disadvantages
• A schema that is less intuitive
• The ability to browse through the content is difficult
• A degraded query performance because of additional
joins.
35
Snowflake schema : Example
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name
brand supplier_key
month
time_key type supplier_type
quarter
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
36
Fact Constellation Schema
A fact constellation has multiple fact tables. It is
also known as galaxy schema.
The following diagram shows two fact tables,
namely sales and shipping.
37
Fact Constellation Schema
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
39
The Complete Decision Support
System
extract Query/Reporting
transform Data
serve
load Warehouse
refresh e.g., ROLAP
.
Data Mining
Operational serve
DB’s
Data Marts
40