You are on page 1of 40

Introduction to business

intelligence and data


analysis

IT300

1
What is Business intelligence ?
 Zeng et al. (2006) define BI as the process of
collection, treatment and diffusion of information
that has an objective, the reduction of uncertainty
in the making of all strategic decisions.

 Stackowiak et al. (2007) define BI as the process of


taking large amounts of data, analyzing that data,
and presenting a high-level set of reports that
condense the essence of that data into the basis
of business actions, enabling management to
make fundamental daily business decisions.
2
What is Business intelligence ?
 To learn from the past and forecast the future,
many companies are adopting BI tools and
systems.

 In such rough and competitive environment,


strategic decision making is extremely complex
and often requires the consideration of several
objectives while satisfying hard constraints.
As a result, the main concern of the managers is to
deliver sophisticated solutions strategies in an
attempt to reach the predefined objectives and
fulfill all system constraints
3
Business intelligence

 The institutions and firms operate in an open


system (that is impacted by external variables)

 There is a big need to make business analysis and


strategic decision making

 Is to find a match between the opportunities in


the environment and the strengths and
weaknesses of the firm
4
Decision-making support systems

 Decision making support systems are


information systems (SI) which are designed
to interactively support all phases of an end-
user’s decision making process in
organizations.
 Many organizations are turning to decision
support systems to improve decision
making.
 Turning a challenge to a learning curve
5
Decision Making Process
1. Define the problem: the manager must identify the
problem.
2. Establish decision criteria and goals : obtaining
necessary information and data.
3. Analyze Data : Formulate a model between goals and
the important variables (physical models as a scale
model of a building):
The approaches to decision making include:
 the use of models,
 the use quantitative methods,
 the analysis of trade-offs,
 establishing priorities,
 the systems approach 6
Decision Making Process
Models are often a key tool used by all decision makers.
A model is an abstraction of reality, a simplified
representation of something. For example, a child’s toy
car is a model of a real automobile.
4. Identify and evaluate various alternatives or
solutions.
5. Select the best alternative (decision theory : an
analytical approach to select the best alternative,
it is closely related to the field of game theory).
6. Implement your decision.

7
Decision Making process
Define the
1 Problem

Obtaining
2 Data
Decision making

Analyze Data
3
Create
4 alternatives

Select the
best
5 alternative
Implement
the decision
6 alternative 8
Data analysis
Information processing is the analysis of a large
quantity of data or other forms of information to
support decision making and to discover
knowledge in data.
This is indeed the biggest challenge posed by big
and often unstructured data: how to analyze it in a
useful way.
Objectives of Data analysis:
• Increase the effectiveness of the manager’s
decision making process,
• Support the manager in the decision making
process but not replace it,
• And improve the directions of the decision
9
making.
Evolution of Database Technology
1960s :
• Data collection, database creation, IMS and
network DBMS
1970s :
• Relational data model, relational DBMS
implementation
1980s:
• RDBMS, advanced data models and
application-oriented DBMS
1990s—2000s:
• Data mining, data warehousing, multimedia
databases and web databases 10
Origins of Data Warehouses
 Database developers understood that their
software was required for both transactional and
analytical processing.
 However, operational and analytical data are
separate with different requirements and different
user communities.
 Once these differences were understood, new data
bases were created specifically for analysis use.

11
Origins of Data Warehouses
 Operational processing (transactional processing)
captures, stores and manipulates data to support
daily operations.
 Information processing is the analysis of data or
other forms of information to support decision
making.
 Data warehouse can consolidate and integrate
information from many internal and external
sources and arrange it in a meaningful format for
making business decisions.

12
What is a Data Warehouse ?
 According to Inmon’s (father of data warehousing) :
It is a collection of integrated, subject-oriented,
databases designed to support the DSS function,
where each unit of data is non-volatile and relevant
to some moment in time.
 Or a DW is : A subject-oriented, integrated, time-
variant, non-updatable collection of data used in
support of management decision-making processes:
 Subject-oriented: e.g. customers, patients, products
 Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
 Time-variant: Can study trends and changes
 Non-updatable: Read-only, periodically refreshed 13
Need for Data Warehousing

 A DW allows its users to extract required data, for


business analysis and strategic decision making
 DW is a process, not a product
 DW is an architecture : is the way of organizing data

14
Database, Data warehouse and
Data set
 DB : contains tables, rows refer to records and
columns to fields. Most DBs are relational DBs
(relating tables to reduce redundancy & improve
DB performance via the normalization process)
 DW : is a type of DB that has been denormalized
& archived.
 Denormalization is the process of combining
some tables into a single table. This may
introduce duplicate data, but will reduce the
number of joins a query has to process.
 Data set : is a sub-set of a DW or a DB. It is usually
denormalized so that only one table is used. 15
How Do Data Warehouses Differ
From Operational Systems?
 Goals
 Structure
 Size
 Performance optimization
 Technologies used

16
Need to separate operational and
information systems
Three primary factors:
 A data warehouse centralizes data that are
scattered throughout disparate operational
systems and makes them available for DM.
 A well-designed data warehouse adds value to
data by improving their quality and consistency.
 A separate data warehouse eliminates much of the
contention for resources that results when
information applications are mixed with
operational processing.

17
Comparison of Database Types

Data warehouse Operational system


Subject oriented Transaction oriented
Large (hundreds of GB up to Small (MB up to several GB)
several TB)
Historical data Current data
Denormalized table structure Normalized table structure
(few tables, many columns per (many tables, few columns per
table) table)
Batch updates Continuous updates
Usually very complex queries Simple to complex queries

18
From the Data Warehouse to Data
Marts
 A data mart contains only those data that are
specific to a particular group. For example, the
marketing data mart may contain only data
related to items, customers, and sales.
 Data marts are confined to subjects.
 Data marts are small in size.
 Data marts are customized by department.

19
How Data Warehousing works

20
How Data Warehousing works
Extraction Transformation Loading–ETL tools
Extract Transform Load
& Clean

Sources DSA DW

DSA: A Data Staging Area is a temporary location where data from


21
source systems are copied.
Example of analytical questions: Historical
analysis of the number of passengers

•How many passengers are frequent flyers?


• How much time do passengers spend on
average in the different zones?
•How much time do passengers spend on average
before entering different zones? How is the
distribution of time spent per passenger in a
given zone?
•Which day of week are the zones used the most?
•When is there a risk of bottlenecks in specific
zones?

22
ER Model vs. Multidimensional
Model
 Why don’t we use the entity-relationship (ER)
model in data warehousing?
 ER model: a data model for general purposes
– All types of data are equal, difficult to identify the
data that is:
• important for business analysis
• No difference between: What is important ? What
just describes the important?
• Normalized databases (many details that can affect
privacy and security)
– Hard to overview a large ER diagram (e.g., over 100
entities/relations for an enterprise)
23
ER Model vs. Multidimensional
Model
 Traditional DBs generally deal with two-dimensional
data. However, querying performance in a multi-
dimensional data storage model is more efficient.
 More built in “meaning”
– What is important
– What describes the important
– What we want to optimize
 Recognized by OLAP/BI tools : Tools that offer powerful
query facilities based on Multi-Dimensional (MD) design

24
Multidimensional Model
 Data is divided into: Facts and Dimensions
 A fact is the important entity: exp a sale
 Facts have measures that can be aggregated: sales
price
 Dimensions describe facts
 Facts “live” in a MD cube
 Goal for dimensional modeling:
– Surround facts with as much context (dimensions) as
possible
– Hint: redundancy may be ok (in well-chosen places)
– But you should not try to model all relationships in the
data (unlike E/R and OO modeling!) 25
Dimension
 Dimensions are the core of MD databases
 Dimensions are used for
 Selection of data
 Grouping of data at the right level of detail
 Dimensions consist of dimension values
 Product dimension has values ”milk”, ”cream”, …
 Time dimension has values ”1/1/2001”, ”2/1/2001”,…
 Dimension values may have an ordering
 Used for comparing cube data across values
 Especially used for Time dimension

26
Dimension
 Dimensions have hierarchies with levels
 Typically 3-5 levels (of detail)
 Dimension values are organized in a tree structure
 Product: Product->Type->Category
 Store: Store->Area->City->County
 Time: Day->Month->Quarter->Year
 Dimensions have a bottom level and a top level
 Levels may have attributes
 Simple, non-hierarchical information
 Day has Workday as attribute
 Dimensions should contain much information
 Time dimension may contain holiday, season, events,…
 Good dimensions have 50-100 or more attributes/levels
27
Facts
 Facts represent the subject of the desired analysis
• The important in the business that should be
analyzed
 A fact is identified via its dimension values
• A fact is a non-empty cell
 Generally, a fact should:
• Be attached to exactly one dimension value in
each dimension
• Only be attached to dimension values in the
bottom levels

28
Measures
 Measures represent the fact property that the
users want to study and optimize
 Example: total sales price
 A measure has two components
 Numerical value: (exp: sales price)
 Aggregation formula (exp: SUM): used for
aggregating/combining a number of measure values
into one

29
Multidimensional Model
Example: sales of supermarkets
• Facts and measures
– Each sales record is a fact, and its sales value is a
measure
• Dimensions
– Group correlated attributes into the same
dimension
– Each sales record is associated with its values of
Product, store, Time

30
Granularity: Dimensionality Hierarchy
 Granularity of facts is important
 Level of detail
 Given by combination of bottom levels
 A dimensional hierarchy defines mappings from a set of
lower-level concepts to higher level concepts.
Country
Year

2D data
Region Season
Quarter

City
Month Week
Area

31
ZipCode Day
Schema Design
 A schema is a logical description of the entire
database.
 Much like a database, a data warehouse also
requires to maintain a schema.
 A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact
Constellation schema.

32
Star schema
 A star schema consists of two types of tables:
• fact table
• dimension tables
 Each dimension in a star schema is represented
with only one-dimension table.
 This dimension table contains the set of
attributes.

33
Star schema: Components

time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
34
Snowflake schema
 Snowflake schema is an expanded version of a
star schema in which dimension tables are
normalized into several related tables.
 Advantages
• Small saving in storage space
• Normalized structures are easier to update and
maintain
 Disadvantages
• A schema that is less intuitive
• The ability to browse through the content is difficult
• A degraded query performance because of additional
joins.
35
Snowflake schema : Example

time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name
brand supplier_key
month
time_key type supplier_type
quarter
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
36
Fact Constellation Schema
 A fact constellation has multiple fact tables. It is
also known as galaxy schema.
 The following diagram shows two fact tables,
namely sales and shipping.

37
Fact Constellation Schema
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
38
shipper_type
38
On-Line Analytical Processing (OLAP)
 Original definition : The dynamic synthesis,
analysis, and consolidation of large volumes of
multi-dimensional data, [Codd, 1993].

 Describes a technology that is designed to


optimize the storing and querying of large
volumes of multi-dimensional data that is
aggregated (summarized) to various levels of
detail to support the analysis of this data.

39
The Complete Decision Support
System

Information Data Warehouse OLAP Servers Clients


Sources Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
OLAP
Semi-structured
Sources
serve

extract Query/Reporting
transform Data
serve
load Warehouse
refresh e.g., ROLAP
.
Data Mining
Operational serve
DB’s

Data Marts
40

You might also like