4.module 4 - Diagnostics Analytics - SEND

BUSINESS INTELLIGENCE
PROGRAM
Phuong Thao
Profile
Thao is currently Business Intelligence/ Business Analytics Trainer. She was Business Analytics
Manager at SIFT Analytics Group. She specializes in Analytics Project Management and regularly
using Business Intelligence and Business Analytics tools to review, understand data of customers,
acquire subject matter expertise, communicate model insights, present business performance
analyzing results for C-level managers in finance & retail industry. Thao is also a Trainer, Speaker
for Analytics events, training programs of SIFT and partners. With more than half-decade of
successful experience, Thao used a broad range of technologies, analytical techniques and
methodologies in order to analyze a broad range of data from various sources to provide insights
and support business decision making.
@PhuongThaoAnalytics #PhuongThaoAnalytics
1 Business Intelligence Introduction
2 Business Statistics
3 Descriptive Analytics
4 Diagnostics Analytics
5 Data Visualization & Storytelling
5 Business Intelligence Capstone

4. DIAGNOSTICS ANALYTICS lessons
01 02 03
1. Fundamentals 2. Qualitative 3. Quantitative
Diagnostics Analytics Diagnostics Analytics
❑ Introduction ❑ Introduction ❑Heuristics

❑ Types of Diagnostics Analytics ❑ Understanding the Big Picture ❑Big guns of BI
➢ Logic Trees
➢ Question‐Based Problem Solving
➢ Cleaving frames
Lesson 1: Diagnostics Analytics Fundamentals
1. Introduction
There is an orientation or attitude that is found in the best problem solvers that reflects an active openness
to new ideas and data, and a suspicion of standard or conventional answers. Tetlock describes it well in
profiling the super-forecasters.
1. Introduction
Questions
1. Introduction
Start your analysis with summary statistics, rules of thumb (Descriptive Analytics), and heuristics to get a
feel for the data and the solution space. Before you dive into giant data sets, machine learning, Monte
Carlo simulations, or other big guns, we believe it is imperative to explore the data, learn its quality,
understand the magnitudes and direction of key relationships, and assess whether you are trying to
understand drivers (Diagnostics Analytics).
Sophisticated analysis has its place, but it is our experience that one‐ day answers, supported by good
logic and simple heuristics, are often sufficient to close the book on many problems, allowing you to
move on to the more difficult ones.
1. Introduction
WHAT IS
DIAGNOSTICS ANALYTICS?
WHAT IS THE USE OF
DIAGNOSTICS ANALYTICS?
TYPES OF
DIAGNOSTICS ANALYTICS
Identifying FACTORS & CAUSES

from historic data
1. WHY is it happening & what
are the RELATIONSHIPS?
2. WHY did it & WHERE should we
have to look? 1. Qualitative
2. Quantitative
Arrow 3
2. Types of Diagnostics Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
4. DIAGNOSTICS ANALYTICS lessons
01 02 03
1. Fundamentals 2. Qualitative 3. Quantitative
Diagnostics Analytics Diagnostics Analytics
❑ Introduction ❑ Introduction ❑Heuristics

❑ Types of Diagnostics Analytics ❑ Understanding the Big Picture ❑Big guns of BI
➢ Logic Trees
➢ Question‐Based Problem Solving
➢ Cleaving frames
Lesson 2: Qualitative Diagnostics Analytics
1. Introduction
The Seven-Steps vs Design Thinking of Analytics Problem Solving
There is an adage that says “a well‐ defined problem is a
problem half solved”; it's worth the investment of time
upfront.
1. Introduction
The Seven-Steps – (i) Problem Definition
1. Introduction
One-day Answers
We find there is great clarity in stating what you know about your problem at any point in the process. It helps bed
down what understandings are emerging, and what unknowns still stand between the answers and us. We call this a
one‐day answer, and they convey our current best analysis of the situation, complications or insightful observations,
and our best guess at the solution, as we iterate between our evolving workplans and our analysis.
One‐day answers help sharpen our hypotheses and make our analysis focused and efficient.
1. Introduction
The Seven-Steps vs Design Thinking of Analytics Problem Solving
There is an adage that says “a well‐ defined problem is a
problem half solved”; it's worth the investment of time
upfront.
Arrow 3
Heuristics
2. Understanding the Big Picture
(ii) Disaggregate – Logic Trees
Types of Logic Trees
Good problem disaggregation is at the heart of the

seven steps process. That help us understand the
drivers or causes of the situation. At the same time,
when we can see all the parts clearly, we can
determine what not to work on, the bits that are either
too difficult to change, or that don't impact the
problem much. When you get good at cleaving
problems apart, insights come quickly.
(ii) Disaggregate
(ii) Disaggregate – Logic Trees - MECE
It is worth stopping here for a moment to introduce an important principle in logic tree construction, the
concept of MECE. MECE stands for “mutually exclusive, collectively exhaustive”. Because this tree
confuses or overlaps some of its branches, it isn't MECE. It's a mouthful, but it is a really useful concept.
Trees should have branches that are:
The branches of the tree don't overlap, or contain partial

elements of the same factor or component. This is a little
Mutually
hard to get your head around, but it means that the core
Exclusive
concept of each trunk or branch of the problem is self‐
contained, not spread across several branches.
Taken as a whole, your tree contains all of the elements of

Collectively
the problem, not just some of them. If you are missing parts,
Exhaustive
you may very well miss the solution to your problem.
(ii) Disaggregate – Logic Trees - MECE
You can arrange them from left to right, right to

left, or top to bottom—whatever makes the
elements easier for you to visualize. There are
lots of ways to do it; in fact, we almost always try
two or three alternative disaggregations to
discover which yields the most insights.
Better trees have a clearer and more complete

logic of relationships linking the parts to each
other, are more comprehensive, and have no
overlaps.
Components or factors are just the most obvious elements that make up a problem, like the bricks and
mortar of our earlier brick wall example. You can usually find enough information for a logical first
disaggregation with a small amount of Internet research and a team brainstorming session.
Inductive Logic Trees

They take their name from the process of inductive reasoning,
which works from specific observations toward general principles.
We employ inductive logic trees when we do not yet know much
about the general principles behind the problems we are
interested in, but we do have some data or insights into specific
cases. Inductive trees show probabilistic relationships, not causal
ones: Just because all the swans you have seen are white, does
not mean all swans are white (there are black swans in Australia).
Deductive Logic Trees
Deductive logic trees take their name from the process of logical
deduction. Deductive reasoning is sometimes called top–down
reasoning.
(ii) Disaggregate – Drill-down vs Drill-up
Time consuming if Wrong direction

(ii) Disaggregate – Drill-down vs Drill-up
(ii) Disaggregate – Drill-down vs Drill-up PRACTICE - EXCEL
(ii) Disaggregate
In many cases you will actually work

on your initial tree both inductively
and deductively. You will have a
sense of some of the bigger drivers
and general principles, and you will
have good case examples of
successful projects in the space you
are looking at. So you work both from
the trunks of the tree and from the
leaves, slowly and iteratively figuring
out which is which.
(ii) Disaggregate
(ii) Disaggregate
(ii) Disaggregate - Exercise
In August, Fahasa Bookstore must decide how many of next year’s nature calendars to
order.
• Each calendar costs the bookstore $7.50 and sells for $10.
• After January 1, all unsold calendars will be returned to the publisher for a refund of
$2.50 per calendar.
Fahasa believes that the number of calendars it can sell by January 1 follows some
probability distribution with mean 200. Fahasa believes that ordering to the average
demand, that is, ordering 200 calendars, is a good decision. Is it?
Qualitative - Exercise
Understanding the Big Picture

Qualitative - Exercise
Understanding the Big Picture
As in the previous Fahasa Bookstore example, Fahasa needs to place an order for next year’s
calendar. We continue to assume that the calendars sell for $10 and customer demand for the
calendars at this price is triangularly distributed with minimum value, most likely value, and
maximum value equal to 100, 175, and 300. However, there are now two other sources of
uncertainty. First, the maximum number of calendars Fahasa’s supplier can supply is uncertain
and is modeled with a triangular distribution. Its parameters are 125 (minimum), 200 (most
likely), and 250 (maximum). Once Fahasa places an order, the supplier will charge $7.50 per
calendar if he can supply the entire Fahasa order. Otherwise, he will charge only $7.25 per
calendar. Second, unsold calendars can no longer be returned to the supplier for a refund.
Instead, Fahasa will put them on sale for $5 apiece after January 1. At that price, Fahasa
believes the demand for leftover calendars is triangularly distributed with parameters 0, 50, and
75. Any calendars still left over, say, after March 1, will be thrown away.
Arrow 3
Heuristics
2. Understanding the Big Picture - Question‐Based Problem Solving
Root Cause Analysis
Root Cause Analysis
Root Cause Analysis
Root Cause Analysis

Root Cause Analysis
Root Cause Analysis
PPC stands for pay-per-click, a

model of internet marketing in
which advertisers pay a fee
each time one of their ads is
clicked. Essentially, it's a way of
buying visits to your site, rather
than attempting to “earn”
those visits organically. Search
engine advertising is one of
the most popular forms of PPC
Sherlock Holmes
Question‐Based Problem Solving

Once you have roughed out the
scale and direction of your
problem levers by using heuristics, it
is time to dig deeper into the
analysis. But not all the analysis you
will do requires huge number
crunching. Often you can get to
good understanding and
preliminary solutions with a
straightforward Sherlock Holmes
framework. We have found that
the Sherlock Holmes approach of
painting a picture of the problem
by asking who, what, where, when,
how, and why is a powerful
root‐cause tool to quickly focus
problem solving.
5 Whys
Question‐Based Problem Solving
Root cause analysis is a problem

solving tool that also uses questions in a
clever way. The technique of asking 5
Whys to get to the bottom of a
problem was developed at the Toyota
Motor Corporation. What is often
termed a fishbone diagram is laid out
to visualize contributing factors to a
problem.
Fishbone
Arrow 3
Heuristics
o Fishbone
➢ Cleaving frames
Big guns of BI
• Price/Volume ➢ Relationship
• Collaborate/Compete ➢ What-If, Scenario
• Principal/Agent ➢ Sentiment Analytics
• Asset/Options ➢ Cohort Analytics
2. Understanding the Big Picture – Cleaving Frames
This chart shows a toolkit of what we call cleaving

frames for different kinds of problem. Good problem
solvers have toolkits like this that act as lenses to
visualize potential solutions. They try on one or more
theoretical frames to see which one is likely to be the
best fit for the problem at hand. Often they combine
more than one frame to make progress on particular
problems.
Price/Volume: One of the key elements of our return on capital tree is

the revenue drivers branch that focuses on product pricing and volume.
This frame raises questions about the nature of the competitive game:
• Are there differentiated products or commodities?

• Are there competitive markets or oligopoly markets controlled by a few players?
• Each has different dynamics and good business problem solvers build these into their
disaggregation and research plans.
The kinds of elements here often include assumptions about market share, new product entry,
rate of adoption, and price and income elasticities.
Priorization
Before we start to invest significant time and effort into work planning and analysis, we have to prune our
trees.
Priorization
Priorization
Arrow 3
Heuristics
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
1. Heuristics
Good problem solvers have a toolkit at their disposal that helps

them work efficiently, starting with heuristics and rules of thumb to
understand the direction and magnitudes of relationships that
allows them to focus attention on the most important issues. They
don't jump right into building giant models until they have a clear
understanding of whether and where complex tools are
required.
Following a structured problem approach, where we use good

analytic techniques to pressure test our hypotheses and good
team processes to limit bias, allows us to avoid torturing the facts.
Heuristics are powerful tools that act as shortcuts in analysis. They

help you size the different elements of the problem to determine
the efficient path in further analysis. They can be dangerous
when incorrectly applied, of course.
We use the term heuristics, rules of thumb, and shortcuts
interchangeably.
1. Heuristics
1. Heuristics
1. Heuristics
We focus on how to do powerful analyses quickly and efficiently,

starting with heuristics, shortcuts, and rules of thumb. We illustrate
how you can structure and resolve many analytic issues you
face with straightforward heuristics well before any use of the
complex “big guns” analysis.
1. Heuristics
(i) Occam’s Razor
The oldest of these is definitely Occam's Razor—favor the simplest solution that fits the facts—which
originated in the fourteenth century. It tells us to select the hypothesis that has the fewest assumptions.
One way of seeing why this make sense is a simple math example: If you have four assumptions that are
independent of each other, with an 80% separate chance of being correct, the probability that all four
will be correct is just over 40%. With two assumptions and the same probabilities, it is 64%. For many
problems the fewer the assumptions you have the better. Practically speaking, this means avoiding
complex, indirect, or inferential explanations, at least as our starting point. Related to Occam's Razor are
one‐reason decision heuristics, including reasoning by elimination and a fortiori reasoning, where you
eliminate alternatives that are less attractive. The important reminder is not to get committed to a simple
answer with few assumptions when the facts and evidence are pointing to a more nuanced or complex
answer.
1. Heuristics
(i) Occam’s Razor
WHY we adopt Occam's razor both in day-to-day reasoning and in science.
“The more conditions had to be met for

something to happen, the less likely the
theory will seem.”
1. Heuristics
(i) Occam’s Razor
Giả sử bạn nghe thấy một chiếc bình vỡ. Bạn đi đến phòng khách và có Jimmy (cậu bé) bên cạnh chiếc bình vỡ.
Cậu bé nói rằng: “một con mèo đã làm điều đó”.

Nhưng bạn đáp: “nhà chúng ta không nuôi mèo”.
Jimmy đáp: “một số con mèo lẻn vào từ bên ngoài”
Bạn đáp: “Đã đóng tất cả các cửa sổ và cửa ra vào rồi”
Jimmy lại nói: “Con mèo đã lẻn vào nhà trước khi bạn đóng chúng lại.”
Bạn lại đáp: “Jimmy bị dị ứng với mèo. Nhưng tôi không thấy cậu hắt hơi.”
Vì vậy, cậu bé nói: “Đó có thể là một con mèo Spynx, không có lông.”
Bạn đi tìm con mèo khắp nhà nhưng không thể tìm thấy nó.
Jimmy nói rằng: “có thể nó đã chạy đi rồi”.
Cuộc trò chuyện này, về nguyên tắc, có thể tiếp tục mãi mãi. Mọi sự phản đối mà bạn đưa ra có thể được phản bác bằng
một giả thuyết khác.
Trước khi điều đó xảy ra, có lẽ bạn sẽ bảo Jimmy dừng lại với những lời giải thích này và rút ra kết luận đơn giản hơn
nhiều rằng cậu bé đã làm vỡ chiếc bình. Lời giải thích của con mèo dựa trên quá nhiều điều kiện, điều này khiến nó
rất khó xảy ra. Càng nhiều điều kiện phải được đáp ứng cho một cái gì đó xảy ra, lý thuyết sẽ càng ít có khả năng.
1. Heuristics
(i) Occam’s Razor
If we manage to deduce a simpler version of our theory, we can get rid of all the statistical noise of our
experiments. Simpler models are generally better at predicting new and yet unobserved events.
1. Heuristics
(ii) Order of magnitude
Order of magnitude analysis is used to prioritize team efforts by estimating the size of different levers. In
business problems, we typically calculate the value of a 10% improvement in price, cost, or volume to
determine which is more important to focus on (assuming, of course, that each is similarly difficult or easy
to change). It applies to analyzing social issues as well. Doing an order of magnitude analysis should
provide a minimum, most likely, and maximum estimate, not simply the maximum.
1. Heuristics
(iii) Pareto
Efficient analysis is often helped by the 80:20 rule, sometimes called the Pareto Principle after the Italian
economist Vilfredo Pareto, who first noticed this relationship. It describes the common phenomenon that
80% of outcomes come from 20% of causes. If you plot percent of consumption of a product on the Y‐axis
and percent of consumers on the X‐axis you will often see that 20% of consumers account for 80% of sales
of a product or service. The point of doing 80:20 analyses is again to focus your analytical effort on the
most important factors. Many business and social environments feature market structures where 80:20 or
something close to that ratio is the norm, so it's a handy device to use. The 80:20 rule may also apply in
complex system settings.
1. Heuristics
(iv) Compound growth
Compound growth is key to understanding how wealth builds, how enterprises scale quickly, and how
some populations grow. Warren Buffett said: “My wealth has come from a combination of living in America,
some lucky genes, and compound interest.” A really quick way to estimate compounding effects is to use
the Rule of 72. The rule of 72 allows you to estimate how long it takes for an amount to double given its
growth rate by dividing 72 by the rate of growth. So, if the growth rate is 5% an amount will double in about
14 years (72/5 = 14.4 years). If the growth rate is 15%, doubling occurs in four to five years.
Where do errors occur with the rule of 72? When there is a change in the growth rate, which of course is
often the case over longer periods. This makes sense, as few things continue to compound forever (try the
old trick of putting a grain of rice on the first square of a chessboard and double the number on each
successive square)
1. Heuristics
(iv) Compound growth
1. Heuristics
(v) S‐curve
A useful heuristic if you are involved in estimating the adoption rate for a new innovation is the S‐curve,
which shows a common pattern of sales with a new product or a new market. S‐curves are drawn with the
percent of full adoption potential on the Y‐axis and the years since adoption on the X‐axis. The shape of
the S will vary a lot depending on the reference class you select and particular reasons for faster or slower
take up in your case. Charles led a successful start‐up in the early days of Internet adoption. At that time,
many forecasters overestimated the impact of Internet penetration in the short term (think of Webvan or
Pets.com), but underestimated adoption in the longer term of 10 to 15 years. It looks in hindsight very
clearly to be a classic S‐curve. In 1995 fewer than 10% of Americans had access to the Internet. By 2014 it
reached 87%. The S‐curve can take many specific profiles. Like any heuristic, you don't want to apply this
rigidly or blindly, but rather use it as a frame for scoping a problem. The challenge is to get behind
sweeping statements like “the world has speeded up,” to understand why a particular technology will
have adoption at a certain rate.
1. Heuristics
(v) S‐curve
1. Heuristics
(vi) Expected value
Expected value is simply the value of an outcome multiplied by its probability of occurring. That is
called a single point expected value, but it is usually more useful (depending on the shape of the
distribution) to take the sum of all probabilities of possible outcomes multiplied by their values.
Expected value is a powerful first‐cut analytic tool to set priorities and reach conclusions on whether to
take a bet in an uncertain environment.
But be careful: Single point expected value calculations are most useful when the underlying distribution
is normal rather than skewed or long tailed. You check that by looking at the range, and whether the
median and mean of the distribution are very different from each other.
1. Heuristics
(vi) Expected value
1. Heuristics
(vii) Bayesian thinking
Bayesian thinking is really about conditional probability, which is the probability of an event given
another event took place which also has a probability, called a prior probability.
As a simple example, look at the probability of it raining given that it is cloudy (the prior probability),
versus the probability of it raining if it is currently sunny. Rain can happen in either case, but is more likely
when the prior condition is cloudy. Bayesian analysis can be challenging to employ formally (as a
calculation), because it is difficult to precisely estimate prior probabilities.
But we often use Bayesian thinking when we think conditional probabilities are at work in the problem.
1. Heuristics
(vii) Bayesian thinking
1. Heuristics
(viii) Reasoning by analogy
Reasoning by analogy is an important heuristic for quick problem solving. An analogy is when you have
seen a particular problem structure and solution before that you think may apply to your current problem.
Analogies are powerful when you have the right reference class (that is, have correctly identified the
structure type), but dangerous when you don’t. To check this, we typically line up all the assumptions
that underpin a reference class and test the current case for fit with each.
1. Heuristics
(viii) Reasoning by analogy
1. Heuristics
(ix) Break‐even point
Break‐even point. Every start‐up company Charles and Rob see likes to talk about their cash runway—the
months before cash runs out and they need a new equity infusion. Not enough of them really know their
break‐even point, the level of sales where revenue covers cash costs. It's a simple bit of arithmetic to
calculate, but requires knowledge of marginal and fixed costs, and particularly how these change with
increased sales volume. The break‐ even point in sales dollars or units equals fixed costs/unit price less unit
variable costs. Typically, the unit price is known. You can fairly quickly calculate the costs associated with
each sale, the variable costs. The tricky part is how fixed costs will behave as you scale a business. You
may face what are called step‐fixed costs, where to double volume involves significant investment in
machinery, IT infrastructure, or sales channels.
1. Heuristics
(ix) Break‐even point
1. Heuristics
(x) Marginal analysis
Marginal analysis is a related concept that is useful when you are thinking about the economics of
producing more, consuming more, or investing more in an environment of scarce resources. Rather than
just looking at the total costs and benefits, marginal analysis involves examining the cost or benefit of the
next unit. In production problems with fixed costs of
machinery and plant, marginal costs (again, the cost of producing one more unit) often fall very
quickly—favoring more production—up to the point at which incremental machinery is needed. We add
units until the marginal benefit of a unit sale is equal to the marginal cost.
1. Heuristics
(x) Marginal analysis
Chi phí cận biên (marginal cost) là mức tăng chi phí (∆C) khi
sản lượng tăng thêm một đơn vị (∆Y).
1. Heuristics
(xi) Distribution of outcomes
As a final heuristic, consider the distribution of outcomes.

Descriptive Analytics
2. Descriptive Introduction – HR Analytics
The personnel department of MTP, a large communications company, is reconsidering its hiring policy. Each
applicant for a job must take a standard exam, and the hire or no-hire decision depends at least in part on
the result of the exam. The scores of all applicants have been examined closely. They are approximately
normally distributed with mean 525 and standard deviation 55.
The current hiring policy occurs in two phases. The first phase separates all applicants into three categories:
automatic accepts, automatic rejects, and maybes. The automatic accepts are those whose test scores
are 600 or above. The automatic rejects are those whose test scores are 425 or below. All other applicants
(the maybes) are passed on to a second phase where their previous job experience, special talents, and
other factors are used as hiring criteria. The personnel manager at MTP wants to calculate the percentage
of applicants who are automatic accepts or rejects, given the current standards. She also wants to know
how to change the standards to automatically reject 10% of all applicants and automatically accept 15% of
all applicants.
Objective To determine test scores that can be used to accept or reject job applicants at MTP.
1. Heuristics
Conclusion
• Start all analytic work with simple summary statistics and heuristics that help you see the size and
shape of your problem levers.
• Don't gather huge data sets or build complicated models before you have done this scoping
reconnaissance with rules of thumb.
• Be careful to know the limitations of heuristics, particularly the potential for reinforcing availability
and confirmation biases.
• Question‐based, rough‐cut problem solving can help you uncover powerful algorithms for
making good decisions and direct your empirical work (when required).
• Root cause and 5‐Whys analytics can help you push through proximate drivers to fundamental
causes in a variety of problems, and not just limited to production and operations environments.
Arrow 3
Heuristics
o Fishbone
➢ Cleaving frames
Big guns of BI
• Price/Volume ➢ Relationship
• Collaborate/Compete ➢ What-If, Scenario
• Principal/Agent ➢ Sentiment Analytics
• Asset/Options ➢ Cohort Analytics
2. Big guns of BI
We introduced a set of first‐cut heuristics and root cause thinking to make the initial analysis phase of
problem solving simpler and faster. We showed that you can often get good‐enough analytic results
quickly for many types of problems with little mathematics or model building.
But what should you do when faced with a complex problem that really does require a robustly
quantified solution? When is it time to call in the big guns—Bayesian statistics, regression analysis, Monte
Carlo simulation, randomized controlled experiments, machine learning, game theory, or
crowd‐sourced solutions?
This is certainly an arsenal of analytic weaponry that many of us would find daunting to consider
employing. Even though your team may not have the expertise to use these more complex problem
solving tools, it is important for the workforce of today to have an understanding of how they can be
applied to challenging problems. In some cases you may need to draw on outside experts, in other
instances you can learn to master these techniques yourself.
2. Big guns of BI
2. Big guns of BI
(i) Relationship – DATA MODEL
2. Big guns of BI
2. Big guns of BI
Data Model
2. Big guns of BI
(i) Relationship – DATA MODEL (M Language)
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
When I say “shape” of your data, I am talking about things like

• how many tables you import into Power BI (and Power Pivot for Excel),
• how many columns are in each table,
• which columns are in each of the tables, and
• should you unpivot your data so column headings become values in rows in your table.
“Not all tables are created equal”

2. Big guns of BI
2. Big guns of BI
Each relationship in this chain has a performance cost (technical).

2. Big guns of BI
If do not have Relationship Having Relationships

2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
The Star Schema is the Optimal Shape

The generally accepted approach to
bringing data into Power Pivot is to
bring in your data in a “Star Schema”
format. This is a technical term coming
from the Kimball methodology (also
known as dimensional modelling)
which describes the logical way data
should be structured for optimal
reporting performance. You can see
why it is called a Star Schema by
looking at the following image – note
the star shape.
2. Big guns of BI
2. Big guns of BI
Why you should layout your Star Schema using the Collie Methodology?
There are a few reasons you should visually layout your data using the Collie Methodology.
1. Firstly it gives you a visual clue to which ones are the “Lookup” tables. They are the ones above
that you have to “lookup” to see them.
2. Also note that in the old world of traditional Excel, you would have probably
written VLOOKUP formulae to bring the data from these Lookup tables into your single Data table
prior to creating a pivot table.
3. Finally, filters only propagate in 1 direction in the data model and that is from the 1 side of the
relationship to the many side of the relationship. The Lookup tables are on the 1 side, so that
means the filters always flow “down hill”. They do not/cannot flow “up hill” automatically (but
note this can be done with advanced DAX and also by editing a relationship in Power Pivot to make
it bi-directional). So this layout gives you a visual clue which direction the filters will flow.
2. Big guns of BI
Another design: follow the graphical layout

approach developed and recommended
by Rob Collie – let’s call it the Collie
Methodology! That is to place the lookup
tables at the top of the screen and the data
tables at the bottom (see the image below).
If you compare the 2 images above, you will

see they have exactly the same logical
relationships (links) between the tables, it is
just that they have a different visual layout.
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
Creating Visualizations – Matrix Table
(i) Relationship – DATA MODEL (compare Power BI vs Excel)
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
(i)10.2018
Relationship – DATA MODEL
Lesson 3: Essential Visualization Principles
1. From Data to Dashboard process
Data Model
2. Big guns of BI
One to One Relationships The first relationship (shown as 1) is a 1 to many relationship between
the Customer table (Lookup table) and the Sales table (Data
table). The Customer Socio Economic Data table is joined to the
Customer table via a 1 to 1 relationship (shown as 2 above). If there is
a benefit (to the user of reports) of splitting this Socio Economic data
into a separate table then of course you should do so. If there is no
benefit, I recommend you combine all the data from Customer Socio
Economic Data table into the Customer table using Power Query on
load.
Every relationship has a “cost” in that it will have some affect on
performance. The performance impact may not be noticeable for
simple models but may become an issue with very complex models.
If you only remember 1 thing from this article, then please let it be
this: Don’t automatically accept the table structure coming from your
source data. You are now a data modeller and you need to make
decisions on the best way to load your data. Your source system is
probably not optimised for reporting (unless it is a reporting datamart)
so please don’t assume that what you have got is what you need.
2. Big guns of BI
Relationships in Power Pivot
One to Many Relationships
The one to many relationship is the foundation of Power Pivot. In the example above (from Adventure Works in
Power BI Desktop), the Customers table is on the 1 side of the relationship and the Sales table is on the many side
of the relationship. These tables are joined using a common field/column called “CustomerKey”. Customer Key
(aka customer number) is a code that uniquely identifies each customer. There can be no duplicates of the
customer key in the customer table. Conversely the customer can purchase as many times as needed and
hence the customer key can appear in the Sales table as many times as necessary. This is where the name
“one to many” comes from – the customer key occurs once and only once in the Customers table but can
appear many times in the Sales table.
Tables on the one side of the relationship are called Dimension tables (I call them Lookup tables) and the tables
on the many side of the relationship are called Fact tables (I call them Data tables).
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
Lookup Tables and Data Tables

In the professional BI world, the Lookup tables would be referred to as Dimension tables and the Data tables would be
referred to as Fact tables.
Lookup Tables
You should have 1 Lookup table for each “object” that you need for reporting purposes. In the sample data above
from Adventure Works, these objects are “Customers”, “Products”, “Territories” and “Time” (i.e. “Calendar”). A key feature
of a Lookup table is it contains 1 and only 1 row for every individual item in the table and as many columns as you need to
describe the object. For e.g., there is only 1 row for each unique Customer in the Customers table. The Customers table has
lots of columns describing each customer such as customer number, customer name, customer address etc., but there is
only 1 row for each customer – each one is unique based on the customer number – no duplicates allowed.
Data Tables
The Data table in the Adventure Works data is the Sales table (it is possible to have many data tables in your data model
but there is only one in this example). This Data table contains lots of rows (60,000+ in this case) and contains all of the
transactional records of sales that occurred over several years. Importantly the Data table contains one column “key” that
matches to each of the “keys” in each Lookup table needed for reporting. So in this sample data, there is a date, customer
number, product number, and territory key so that the Data table can be logically joined to the Lookup tables.
The ideal shape of Data tables is to have very few columns but to have as many rows as needed to bring in all the data
records. These Data tables normally have lots of rows (sometimes in the 10s or even 100s of millions).
Lesson 4: Dashboard
5. Successful Business Intelligence
Excel Dashboard Formatting
Hide from Clients tools (on the Many sides): This is a rule of building Dimentional Model.
That way the User can’t accidentally drag the field under the pivot table and get into a relationship problem. This a
actually a really good thing.
2. Big guns of BI
(i) Relationship – Hide in Report View/ Hide from Client tools
Hide all unnecessary columns from your client tools. By doing so, users can not aggregate columns in the
wrong way. You will need to keep any columns by which you want to slice or filter your data. Most often
these will be label type columns. Once you have created measures carrying out all the calculations you
need, there is no need to keep the column in sight.
2. Big guns of BI
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
INVITED ATTENDED
Calvin Ellie
Ellie Flo
Irrfan Irrfan
Jada Jada
Sig Keith
Teena Rosa
Trout Sig
Teena
2. Big guns of BI
ATTENDED
INVITED Ellie
Calvin Flo
Ellie Irrfan
Irrfan Jada
Jada Keith
Sig Rosa
Teena Sig
Trout Teena
2. Big guns of BI
Invited &
Attended
INVITED ATTENDED
Ellie
Calvin Flo
Irrfan
Trout Keith
Jada
Rosa
Sig
Teena
2. Big guns of BI
INVITED ATTENDED
LEFT RIGHT
Ellie Flo
Calvin Irrfan Keith
Trout Jada Rosa
Sig
Teena
2. Big guns of BI
INVITED ATTENDED
Ellie
Irrfan
Jada Flo
Calvin Sig Keith
Trout Teena Rosa
2. Big guns of BI
INVITED ATTENDED
Ellie
Irrfan
Jada Flo
Calvin Sig Keith
Trout Teena Rosa
2. Big guns of BI
INVITED ATTENDED
Ellie
Irrfan Flo
Calvin Jada Keith
Trout Sig Rosa
Teena
2. Big guns of BI
INVITED ATTENDED
Ellie Flo
Irrfan Keith
Calvin
Jada Rosa
Trout
Sig
Teena
2. Big guns of BI
INVITED ATTENDED
Ellie
Irrfan Flo
Calvin Jada Keith
Trout Sig Rosa
Teena
2. Big guns of BI
INVITED ATTENDED
Ellie
Irrfan Flo
Calvin Jada Keith
Trout Sig Rosa
Teena
2. Big guns of BI
Calvin
LEFT Trout
LEFT
Ellie
Irrfan Jada
Sig RIGHT
RIGHT Teena
Flo
Keith
Rosa
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
(i) Relationship – NAMING CONVENTION
Naming Conventions
It is a good practice to use simple English names for your tables and columns. Avoid using abbreviations
and underscores wherever possible.
Example: Instead of a table called Pricing_History_Sum_Tbl, simply call it Pricing. It is much easier to read
and write your DAX if you keep it simple.
Other best practices include:

•Use English business names.
•No spaces in Table Names (use CamelCase if necessary).
•Spaces are fine in Column Names, because you always refer to these in [square brackets], so it doesn’t
matter.
•Keep the names of your tables and columns short and to the point.
Measures on the other hand should always use Spaces and not
CamelCaseAsTheyAreHarderToReadThatWay. Also avoid abbreviations unless they are common in your
business setting. Some examples of good Measure names are
[Total Sales]
[Total Sales YTD]
[Total Sales All Products]
2. Big guns of BI
Using Spaces in Names

•Don’t use spaces in table names.
•Do use spaces in column names.
•Do use spaces in measure names.
If you use spaces in table names you will be forced
to add single quotes around the table name each
time you refer to it in a formula. This makes the code
longer, harder to read and “untidy” (IMO anyway). It
is better to use underscore_characters or
CamelCase instead of spaces (or better still use a
single noun name as mentioned above).
Columns and measures always need to be wrapped
in [Square Brackets] anyway and hence adding
spaces does not make the code any more
complex. Columns and measures are easier to read
if they have spaces.
2. Big guns of BI
Naming Tables, Columns And Measures In Power Query
Make naming a top priority when you’re building a dataset.
What’s wrong with this picture? Look at the names:

•The tables and columns have the same names that they had in the data
source, in this case a SQL Server database. Note the table name prefixes
of “Dim” for dimensions and “Fact” for fact tables.
•The column and measure names either don’t have spaces or use
underscores instead of spaces.
•What on earth does the measure name _PxSysF even mean?

2. Big guns of BI
Three things to consider when naming a table, column or measure:
•You should use human-readable names rather than any kind of technical naming convention, with
spaces where you would expect to have spaces and all vowels present. For example, that means having
names like [Sales Amount] rather than [Sales_Amount] or [SlsAmt]; similarly, prefixes like “Dim” and “Fact”
might make sense to you but won’t mean anything to your users.
•You should use the correct business terminology, the terminology that your users will know and
understand, rather than just make up some names that seem appropriate. Your users might not
understand what [Total Sales Value] is if the generally accepted term is [Net Sales Amount].
•The names you use should be consistent across all datasets that contain the same data. That means that
if you have a table called Sales in one dataset it should be called Sales in every other dataset that you
build from the same data source, not Transactions, FactSales or something else.
2. Big guns of BI
Column naming
Rename all columns that will be visible in the
data model using concise, self-describing,
meaningful and user-friendly names.
Your data model should be designed for users
and not for developers to design reports and
consume. Even if technical professionals are
designing reports, field names are going to
show up as labels and titles.
2. Big guns of BI
(i) Relationship – Duplicate vs Reference Tab
Duplicate copies a query with all the applied steps of it as a new query; an exact copy.
2. Big guns of BI
(i) Relationship – Duplicate vs Reference Tab
Reference will create a new query which has only one step: Getting data from the original query.
2. Big guns of BI
(i) Relationship – Search Tab
2. Big guns of BI
(i) Relationship – Search Tab
Hide all key Field in the relationship and that's going to give us a nice
shortened list of columns to hide all of our key columns for relationships.
So, we are going to quickly just run through and hide all of these.
OTHER TIPS
Analyze in Excel
OTHER TIPS
Analyze in Excel
OTHER TIPS
Manual Calculation Mode
OTHER TIPS
Drill-Up, Drill-Down in Table
OTHER TIPS
Creating Visualizations – Running Total Chart
DAX can help us create a

Calendar table that we'll
anchor to our sales data that
will be able to grab the start
date for sales and the end
date for sales from our Sales
table, and then create a
Calendar table for exactly
that range.
OTHER TIPS
Running Total
Power BI: Modeling -> New Table

OTHER TIPS
Running Total
Excel: Add as New Query => To Table

OTHER TIPS
Running Total
OTHER TIPS
Running
TooltipsTotal
OTHER TIPS
Running Total - Hierarchy
So why are hierarchies important in Power BI/ Power Pivot?

What is Hierarchy Simple. Data hierarchy allows you to drill up or down on your
visual and reveal additional details.
Take a look at hierarchy and drill through in action:
OTHER TIPS
Running Total - Hierarchy
Create Hierarchy
OTHER TIPS
Running Total – Hierarchy (Grouping Power Table)
OTHER TIPS
Q&A
Q&A
Arrow 3
Heuristics
2. Big guns of BI
(i) Relationship
Aggregation Analysis
This methodology simply means calculating a value across a group or
dimension and is commonly used in data analysis. For example, you may want
to aggregate sales data for a salesperson by month - adding all of the sales
closed for each month. Then, you may want to aggregate across dimensions,
such as sales by month per sales territory. Aggregation is often done in
reporting to be able to "slice and dice" information to help managers make
decisions and view performance.
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
Covariance is too sensitive to the

measurement scales.
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
Causation: one event or state is the result of the occurrence of another event or state
2. Big guns of BI
(i) Relationship
Is there a way we can prove that causation does in fact exist?

More often than not, the answer is no. Proving causation is pretty hard.
In web cancellation example, we used reasoning based on our knowledge of

the industry and likely customer behavior to question the assumption of causality.
Again, context turns out to be critical and interpreting relationships and data.
It should be the first line of defense in avoiding mistakes.
2. Big guns of BI
(i) Relationship
Cần phân biệt giữa association, correlation và causation:

• Association là sự liên quan giữa biến x và y, nhưng không nhất thiết phải là quan
hệ correlation hay causation, nó có thể là bất kỳ mối quan hệ nào.
• Còn correlation thì x và y có mối quan hệ tuyến tính với nhau, và có thể thể hiện
dưới dạng một đường thẳng trên đồ thị (y=a+b*x), ví dụ khi x thay đổi thì y cũng
thay đổi.
• Tuy nhiên, correlation giữa hai biến không nhất thiết là mối quan hệ causation x là
nguyên nhân gây ra y, ví dụ x và y cùng biến thiên là do có một biến khác tác động
lên cả hai biến…
2. Big guns of BI
(i) Relationship
Overfitting Trap: you are struggling at improving model's accuracy => EDA
(spend significant time on exploration and analyzing data)
Remember the quality of your inputs decide the quality of your output
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
2. Big guns of BI
(i) Relationship
Let’s take a moment to review the correlation. The score ranges from -1 to 1 and indicates if
there is a strong linear relationship — either in a positive or negative direction. So far so good.
However, there are many non-linear relationships that the score simply won’t detect.
If you are a little bit too well educated you know that the correlation matrix is symmetric.
However, relationships in the real world are rarely symmetric. More often, relationships are
asymmetric.
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
Categorical & Categorical: To find the relationship between two categorical variables, we can use following
methods:
• Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%.
The rows represents the category of one variable and the columns represent the categories of the other
variable. We show count or count% of observations available in each combination of row and column
categories.
• Stacked Column Chart: This method is more of a visual form of Two-way table.
2. Big guns of BI
(i) Relationship
=COUNTIFS(Range1,Condition1,
Range 2, Condition2 …)
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
Categorical variables, including nominal (where numbers are simply labels) and ordinal, rank order
variables, are described by tabulating their frequencies or probability. If two categorical variables are
associated, the frequencies of values of one will depend on the frequencies of values of the other. Chi
square tests the hypothesized association between two categorical variables and contingency analysis
quantifies their association.
When Conditional Probabilities Differ from Joint
Probabilities, There Is Evidence of Association
If the unconditional probabilities of category levels, such

as sprouts versus strawberries topping, differ from the
probabilities, conditional on levels of another category,
such as hummus or peanut butter spread, we have
evidence of association. In this sandwich example,
sprouts were chosen by half the students, making its
unconditional probability .5. If a student chose hummus
spread, the conditional probability of sprouts topping
was higher (.60). If a student chose peanut butter
spread, sprouts was the less likely topping choice (.40).
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
Chi Square Tests Association Between Two Categorical Variables

2. Big guns of BI
(i) Relationship
Chi Square Is Unreliable If Cell Counts Are Sparse
Since the chi square components include expected cell counts in the denominator, sparse (with
expected counts less than five) cells inflate chi square. When sparse cells exist, we must either combine
categories or collect more data.
In the Recruiting Stars example, management was most interested in increasing the chances of hiring
Outstanding performers. Since some believed that Outstanding performers were recruited from
programs in the Home State, these categories were preserved. Same Region and Outside Region
program locations were combined. Poor and Average performance categories were combined. We
are left with a 2 x 2 contingency analysis, shown in
2. Big guns of BI
(i) Relationship
With fewer categories, all expected cell counts, are now greater than five:
25*25/40
2. Big guns of BI
(i) Relationship
Pros and Cons

Pros:
•It is easier to compute.
•It can also be used with nominal data.
•It does not assume anything about the data distribution.
Cons:
•The number of observations should be more than 20.
•Data must be frequency data.
•It assumes random sampling. It means the sample should be selected randomly.
•It is sensitive to small frequencies, which leads to erroneous conclusions.
•It is also sensitive to sample size.
2. Big guns of BI
(i) Relationship
INTERACTIVE COURSE: Introduction to

Statistics in Spreadsheets
DataCamp
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
Counts versus Percentages

There is no single correct way to display the data in a crosstab.
Ultimately, the data are always counts, but they can be show as raw
counts, percentages of row totals, percentage of column totals, or
even percentage of overall total. Nevertheless, when you are looking
for relationships between 2 categorical variables, showing the counts
as percentages of row totals or percentages of column totals usually
makes any relationships stand out more clearly. Corresponding charts
are also very useful.
2. Big guns of BI
(i) Relationship
Categorical & Continuous: While exploring relation between categorical and

continuous variables, we can draw box plots for each level of categorical
variables.
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
Breaking down by Category

There is arguably no more powerful data analysis technique than
breaking down a numerical variable by a categorical variable. The
methods, especially side-by-side box plots and pivot tables, get you
started with this general comparison problem. They allow you to see
quickly, with charts and/or numerical summary measures, how two or
more categories compare.
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
2. Big guns of BI
(i) Relationship
Arrow 3
Heuristics
2. Big guns of BI
(ii) What-If, Scenario
Expert practitioners of business problem solving

often employ existing frameworks or theories to
more quickly and elegantly cleave problems
into insightful parts. In fact many business
problems benefit from this particular type of
problem disaggregation, because it shows the
levers of revenue (price, volume, market
share), costs (fixed, variable, overheads), and
asset utilization so clearly, and in mathematical
relationship to each other. This kind of tree also
makes “ what if” competitive scenario
analysis easy. For example, it is easy to model
both niche and broad market product
strategies, and for your team to debate how
realistic the assumptions are that generate
results. We often call this kind of work “what you
have to believe” analysis.
2. Big guns of BI
Analysis ToolPak in Excel

Using a Data Table to Repeat Simulations
Data table in excel

2. Big guns of BI
Data table in excel (1-Variable) by Row

2. Big guns of BI

2. Big guns of BI

2. Big guns of BI
Data table in excel (1-Variable) by Column

2. Big guns of BI

2. Big guns of BI

2. Big guns of BI
Data table in excel (2-Variable)

2. Big guns of BI
Data table in excel (2-Variable)

2. Big guns of BI
What-if Analysis (Scenario Manager)

More than 2-variables => Scenario Manager
2. Big guns of BI
1. Make Your First Scenario

Step 1: Set up the First Scenario
Now we'll dig into What-If Analysis in Excel.
We'll open up the Scenario Manager and
begin:
1.First, select all the cells that will change.
To do that, click B4, hold the Ctrl key
(Command key on the Mac) while
dragging from B6 down to B12, then Ctrl +
click (Command + click on the Mac) B17.
2.On the ribbon, select the Data tab

> What-If Analysis > Scenario Manager.
2. Big guns of BI
This displays the Scenario

Manager dialog box.
Since we haven’t created
any scenarios yet, it says
there are none defined.
Each scenario will be a set

of the cells you just
selected, containing
unique values. The first set
will be the current values.
2. Big guns of BI
Step 2: Now Create the First Scenario

1.In the dialog box, click Add.
2.Enter the name Original values.
3.The changing cells are what you
selected. If you selected different cells
by mistake, you can enter the correct
ones here (see image below).
4.Enter a comment if you want. This is
optional.
5.The checkboxes for Protection are
only if you want to protect the sheet
from changes. We won’t do that in this
exercise, so ignore these choices.
2. Big guns of BI
Click OK. The Scenario

Values dialog box shows
you a list of all the cells in
the scenario and what
their current values are.
Note that you can’t resize
this box, so use its scrollbar
to see all of them.
2. Big guns of BI
For now, there’s nothing to change, but note

the Add button. A quick way of creating
several scenarios one after another is to click
this Add button after entering values. That will
immediately display the Add Scenario screen.
For now, click OK. That brings back the main
Scenario Manager dialog, showing the first
one listed.
The Manager has buttons for adding a new

scenario, deleting one, editing one, merging
in a scenario from another open workbook,
and creating a summary. The summary is the
coolest part, and we’ll do that below.
2. Big guns of BI
2. Create Additional Scenarios
Step 1: Add More Scenarios
Click Add. This is the same thing as clicking
the Add button in the previous step.
Create 3 more scenarios using the data from

the table below. The general concept is that
larger venues will have higher costs – not
always in proportion – along with the ability to
charge higher ticket prices resulting in greater
revenues. For the sake of simplicity, assume
that if a concert has more than one act,
they’re combined in the Artist category.
The fastest way of entering the numbers is not
to use the mouse. Just type a number, press
the Tab key, type another number, press
the Tab key, and so on.
2. Big guns of BI
2. Big guns of BI
After entering the last

scenario, click OK to return to
the main Scenario Manager
screen. It should look like this:
2. Big guns of BI
Step 2: Switch Between

The sheet still shows the original values,
so here’s the first cool
feature: Double-click one of the
scenario names in the list. The sheet
updates with those values.
2. Big guns of BI
Step 3: View All the Scenarios at

Once
1.Click the Summary button.
2.That confirms you want to create
a summary, not a PivotTable, so
leave the default radio button
selected.
3.It also confirms the main result cell
is the Profit or Loss in B24.
2. Big guns of BI
Click OK. That creates a new sheet

in the workbook, called Scenario
Summary.
2. Big guns of BI
Step 4: Engaging With the Scenario Summary

This shows the values that the sheet currently displays (you could have changed these manually) as well
as the sets of numbers from all four scenarios.
Notice the small plus and minus symbols in the margins. These are part of Excel’s Group and Outline
feature, which you can use separately from Scenario Manager. The Outline button is also on the
ribbon’s Data tab, all the way on the end.
Click any of the minus signs to collapse the sheet so it shows only summary data, or click the plus signs to
expand and show detail.
2. Big guns of BI
Step 5: Two Things to Be Aware Of

1.None of the values are dynamic. If you change the underlying data on the original sheet, the values
on this sheet will not change. You will need to create a new summary.
2.Down column C, you see Excel lists the cell references, not their labels (Artist, Venue rental, etc.). If you
want to see the labels, stretch out column C and type them manually.
2. Big guns of BI
2. Big guns of BI
Arrow 3
Heuristics
2. Big guns of BI
(iii) Sentiment Analytics
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
Arrow 3
Heuristics
2. Big guns of BI
(iv) Cohort analytics
➢ What is it?
Cohort analysis is a subset of behavioural analytics which allows you to study the behaviour of a group
over time. The groups or cohorts in this context are aggregations of data points, or relevant stakeholders
in your business, and the data may come from e-commerce platforms, web applications or sales data.
➢ When do I use it?

Cohort analysis is especially useful if you want to know more about the behaviour of a group of your
stakeholders such as customers or employees. Cohort analysis is therefore more commercially relevant
and can help you identify problems and make better decisions.
➢ What business questions is it helping me to answer?

It can help you to answer:
● Who are my most profitable customers?
● What characteristics do my different customer groups share in common?
● How do particular groups of my customers behave?
● What characteristics do particular groups of employees share in common?
2. Big guns of BI
What is this?
2. Big guns of BI
How do I use it?
There are four essential steps in cohort analysis:

1/ You need to determine what questions you want answered. What are you trying to find out about your
stakeholder group or sub-groups? The whole point of analysis of any type is to provide answers to
strategically important questions so you can identify what needs to be done differently or what opportunity
needs to be exploited in order to improve some element of your business.
2/ Next you need to define the metrics that will be able to help you answer the questions you’ve identified.
Cohort analysis requires the identification of the specific properties you are going to examine in the group.
For example, gender, date of first or last purchase, level of purchase, etc.
3/ Next you need to define or identify the specific cohorts you are going to assess. This will often require you
to perform attribute contribution in order to find the relevant differences between each group or sub-group
so that you can use that difference to explain the difference in their behaviour.
4/ Finally you perform the cohort analysis. The findings are often then visualised in a graph or table. This
visualisation can help spot the patterns that will then shape decision making.
2. Big guns of BI
2. Big guns of BI
Arrow 3
Heuristics
2. Big guns of BI
(v) Correlogram Analytics
The autocorrelation function plot (correlogram) show a series correlated with itself, like
by x time units. So if you take each of the correlation numbers we just calculated and
plotted them, you’d have the autocorrelation function plot or ACF plot
2. Big guns of BI
Vertical axis: correlation coefficient Horizontal axis: number of lag shown

2. Big guns of BI
This series is undifferenced and has a slow This plot is for the same data set,
decay toward 0 correlation. Meaning that but now we’ve taken the first
the current values are much more difference.
correlated to recent values than values Lag-1: significance
further in the past. This suggests that the After Lag-1 is much less,
series is not stationary and will need to be suggesting now a stationary
differenced to reach a stationary series series
2. Big guns of BI
Autocorrelation function plot (correlogram)

Arrow 3
Heuristics
2. Big guns of BI
(v) Multidimensional Analytics
Multi-dimensional Vizualization
Scatter plot matrix Bubble Chart Sankey Flow

2. Big guns of BI
(vi) Multidimensional Analytics – Scatterplot matrix
2. Big guns of BI
(vi) Multidimensional Analytics – Bubble charts
2. Big guns of BI
(vi) Multidimensional Analytics – Bubble charts
2. Big guns of BI
(vi) Multidimensional Analytics – Sankey Flow
2. Big guns of BI
(vi) Multidimensional Analytics – Sankey Flow
2. Big guns of BI
(vi) Multidimensional Analytics
Contour Chart
2. Big guns of BI
(vi) Multidimensional Analytics – Contour Chart
2. Big guns of BI
We present and discuss several dimension reduction approaches:
(1) Incorporating domain knowledge to remove or combine categories,

(2) Using data summaries to detect information overlap between variables (and remove or combine
redundant variables or categories),
(3) Using data conversion techniques such as converting categorical variables into numerical variables,
and
(4) Employing automated reduction techniques, such as principal components analysis (PCA), where
a new set of variables (which are weighted averages of the original variables) is created
The dimensionality of a model is the number of independent or input variables used by the model. In
the artificial intelligence literature, dimension reduction is often referred to as factor selection or feature
extraction.
2. Big guns of BI
(1) Domain knowledge
Practical Considerations
The integration of expert knowledge through a discussion with the data provider (or user) will probably
lead to better results.
Practical considerations include:

• Which variables are most important for the task at hand, and which are most likely to be useless?
• Which variables are likely to contain much error?
• Which variables will be available for measurement (and what will it cost to measure them) in the future
if the analysis is repeated?
• Which variables can actually be measured before the outcome occurs? (For example, if we want to
predict the closing price of an ongoing online auction, we cannot use the number of bids as a
predictor because this will not be known until the auction closes.)
Numerical summaries and graphs of the data are very helpful for data reduction
2. Big guns of BI
(2) Detect information overlap

Data visualization, an important initial step of data exploration is getting familiar with the data and their
characteristics through summaries and graphs. The importance of this step cannot be overstated.
Numerical summaries and graphs of the data are very helpful for data reduction.
(1) Summary Statistics

(2) Graphs
(3) Pivot Tables
(4) Correlation Analysis
2. Big guns of BI
Interpreting the results of a Principal Component Analysis in Excel using XLSTAT - How to interpret a PCA correlation matrix
The first result to look at is the correlation matrix. We can see right away that the rates of people below and above 65 are negatively correlated
(r = -1). Either of the two variables could have been removed without effect on the quality of the results. We can also see that the Net Domestic
Migration has low correlation with the other variables, including the Net International migration. This means that U.S. nationals and non-nationals
may be moving to a state for different sets of reasons.
2. Big guns of BI
(3) Data conversion techniques

Reducing the Number of Categories in Categorical Variables
When a categorical variable has many categories, and this variable is destined to be a predictor, many
data mining methods will require converting it into many dummy variables. In particular, a variable with m
categories will be transformed into m – 1 dummy variables.
(1) Pivot tables are useful for this task: We can examine the sizes of the various categories and how the
response behaves at each category. Generally, categories that contain very few observations are
good candidates for combining with other categories. Use only the categories that are most relevant to
the analysis, and label the rest as “other.”
(2) In classification tasks (with a categorical output), a pivot table broken down by the output classes can
help identify categories that do not separate the classes. Those categories too are candidates for
inclusion in the “other” category. An example is shown in Figure below, where the distribution of output
variable CAT.MEDV is broken down by ZN (treated here as a categorical variable). We can see that the
distribution of CAT.MEDV is identical for ZN=17.5, 90, 95, and 100 (where all neighborhoods have
CAT.MEDV=1). These four categories can then be combined into a single category. Similarly categories
ZN=12.5, 25, 28, 30, and 70 can be combined. Further combination is also possible based on similar bars.
2. Big guns of BI
CAT.MEDV, which has been created by categorizing median value (MEDV) into two categories, high and
low. (There are a couple of aspects of MEDV, the median house value, that bear noting. For one thing, it is
quite low, since it dates from the 1970s. For another, there are a lot of 50s, the top value. It could be that
median values above $50,000 were recorded as $50,000.) The variable CAT.MEDV is actually a categorical
variable created from MEDV. If MEDV ≥ $30,000, CAT.MEDV = 1. If MEDV < $30,000, CAT.MEDV = 0.
2. Big guns of BI
CAT.MEDV is identical for ZN=17.5, 90, 95, and 100 (where all neighborhoods have CAT.MEDV=1).
These four categories can then be combined into a single category. Similarly categories ZN=12.5,
25, 28, 30, and 70 can be combined.
2. Big guns of BI
In a time series context where we might have a categorical

variable denoting season (such as month, or hour of day)
that will serve as a predictor, reducing categories can be
done by examining the time series plot and identifying
similar periods. For example, the time plot in Figure below
shows the quarterly revenues of Toys “R” Us between 1992
and 1995. Only quarter 4 periods appear different, and
therefore we can combine quarters 1–3 into a single
category.
Sometimes the categories in a categorical variable
represent intervals. Common examples are age group or
income group. If the interval values are known (e.g.,
category 2 is the age interval 20–30), we can replace the QUARTERLY REVENUES OF TOYS‘‘R’’ US,
categorical value (“2” in the example) with the mid 1992–1995
interval value (here “25”). The result will be a numerical
variable that no longer requires multiple dummy variables.
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
1. Dimensionality Reduction
2. PCA Introduction
3. PCA – Excel
• What is Principal Component Analysis?
• Several uses
• Options for Principal Component Analysis in Excel
using the XLSTAT software
• Results for Principal Component Analysis in XLSTAT
• XLSTAT charts for Principal Component Analysis in
Excel
4. Tutorials on how to run PCA in Excel using the
XLSTAT software
• Setting up a Principal Component Analysis in Excel
using XLSTAT
• Interpreting the results of a Principal Component
Analysis in Excel using XLSTAT
• Note on the usage of Principal Component Analysis
• Going further
2. Big guns of BI
Dimensionality Reduction (giảm chiều dữ liệu), là một trong những kỹ thuật quan trọng trong Machine Learning. Các
feature vectors trong các bài toán thực tế có thể có số chiều rất lớn, tới vài nghìn. Ngoài ra, số lượng các điểm dữ
liệu cũng thường rất lớn. Nếu thực hiện lưu trữ và tính toán trực tiếp trên dữ liệu có số chiều cao này thì sẽ gặp khó
khăn cả về việc lưu trữ và tốc độ tính toán. Vì vậy, giảm số chiều dữ liệu là một bước quan trọng trong nhiều bài
toán. Đây cũng được coi là một phương pháp nén dữ liệu.
Dimensionality Reduction, nói một cách đơn giản, là việc đi tìm một hàm số, hàm số này lấy đầu vào là một điểm dữ
liệu ban đầu x∈RDx∈RD với DD rất lớn, và tạo ra một điểm dữ liệu mới z∈RKz∈RK có số chiều K<DK<D.
Principal Component Analysis (PCA), tức Phân tích thành phần chính - một phương pháp đơn giản nhất trong các
thuật toán Dimensionality Reduction dựa trên một mô hình tuyến tính. Phương pháp này dựa trên quan sát rằng
dữ liệu thường không phân bố ngẫu nhiên trong không gian mà thường phân bố gần các đường/mặt đặc
biệt nào đó. PCA xem xét một trường hợp đặc biệt khi các mặt đặc biệt đó có dạng tuyến tính là các không gian
con (subspace).
2. Big guns of BI
Principal Component Analysis

Cách đơn giản nhất để giảm chiều dữ liệu từ DD về K<DK<D là chỉ giữ lại K, K phần tử quan trọng nhất. Tuy nhiên,
việc làm này chắc chắn chưa phải tốt nhất vì chúng ta chưa biết xác định thành phần nào là quan trọng hơn. Hoặc
trong trường hợp xấu nhất, lượng thông tin mà mỗi thành phần mang là như nhau, bỏ đi thành phần nào cũng dẫn
đến việc mất một lượng thông tin lớn.
Tuy nhiên, nếu chúng ta có thể biểu diễn các vector dữ liệu ban đầu trong một hệ cơ sở mới mà trong hệ cơ sở
mới đó, tầm quan trọng giữa các thành phần là khác nhau rõ rệt, thì chúng ta có thể bỏ qua những thành phần ít
quan trọng nhất.
Lấy một ví dụ về việc có hai camera đặt dùng để chụp một con người, một camera đặt phía trước người và một
camera đặt trên đầu. Rõ ràng là hình ảnh thu được từ camera đặt phía trước người mang nhiều thông tin hơn so
với hình ảnh nhìn từ phía trên đầu. Vì vậy, bức ảnh chụp từ phía trên đầu có thể được bỏ qua mà không có quá
nhiều thông tin về hình dáng của người đó bị mất.
PCA chính là phương pháp đi tìm một hệ cơ sở mới sao cho thông tin của dữ liệu chủ yếu tập trung ở một
vài toạ độ, phần còn lại chỉ mang một lượng nhỏ thông tin. Và để cho đơn giản trong tính toán, PCA sẽ tìm một hệ
trực chuẩn để làm cơ sở mới.
2. Big guns of BI
Principal Component Analysis (PCA) is one of the
most popular data mining statistical methods.
Principal Component Analysis (PCA) is a powerful
and popular multivariate analysis method that lets
you investigate multidimensional datasets with
quantitative variables. It is widely used in biostatistics,
marketing, sociology, and many other fields.
You can run your PCA on raw data or on dissimilarity

matrices, add supplementary variables or
observations, filter out variables or observations
according to different criteria to optimize PCA map
readability. Also, you can perform rotations such as
VARIMAX. Feel free to customize your correlation
circle, your observations plot or your biplots as
standard Excel charts. Copy your PCA coordinates
from the results report to use them in further analyses.
2. Big guns of BI
What is Principal Component Analysis?
Principal Component Analysis is one of the most frequently used multivariate data analysis methods.
It is a projection method as it projects observations from a p-dimensional space with p variables to a
k-dimensional space (where k < p) so as to conserve the maximum amount of information
(information is measured here through the total variance of the dataset) from the initial dimensions.
PCA dimensions are also called axes or Factors. If the information associated with the first 2 or 3 axes
represents a sufficient percentage of the total variability of the scatter plot, the observations could be
represented on a 2 or 3-dimensional chart, thus making interpretation much easier.
The Principal Components tool can reduce the dimensions (the number of numeric fields) in a
database. It does this by transforming the original set of fields into a smaller set that accounts for most
of the variance (i.e., information) in the data. The new fields are called factors, or principal
components.
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
2. Big guns of BI
The Principal Component Analysis, a Data Mining tool

PCA can thus be considered as a Data Mining method as it allows to easily extract information from
large datasets. There are several uses for it, including:
• The study and visualization of the correlations between variables to hopefully be able to limit the
number of variables to be measured afterwards;
• Obtaining non-correlated factors which are linear combinations of the initial variables so as to use
these factors in modeling methods such as linear regression, logistic regression or discriminant
analysis.
• Visualizing observations in a 2- or 3-dimensional space in order to identify uniform or atypical groups
of observations.
2. Big guns of BI
Options for Principal Component Analysis in Excel using the XLSTAT software
• Pearson or Covariance?
• Supplementary variables and observations
• Rotations: Varimax and others
XLSTAT offers several data treatments to be used on
the input data prior to Principal Component Analysis
computations:
•Correlation (Pearson), the classic PCA, that Result display
automatically standardizes the data prior to
computations to avoid inflating the impact of variables
with high variances on the result.
•Covariance, that works on unstandardized variances
and covariances (variables with high variances will play
stronger roles in the outputs). Covariance matrices
allocate more weight to variables with higher variances.
•Spearman. Spearman's correlations may be more
appropriate when running the PCA on variables with
different distributions.
2. Big guns of BI
• PCA is used to calculate matrices to project the variables in a new space using a new matrix which shows the degree of similarity between the
variables. It is common to use the Pearson correlation coefficient or the covariance as the index of similarity, Pearson correlation and covariance
have the advantage of giving positive semi-defined matrices whose properties are used in PCA. However other indexes may be used. XLSTAT
makes it possible to also use the Spearman correlation coefficient because the corresponding matrices are also positive semidefinite. Making a
PCA on a Spearman correlation matrix is fully equivalent to a classic PCA (based on Pearson correlation) performed on the matrix of ranks. When
you run a PCA based on Spearman correlations, XLSTAT offers the option to display the matrix of the ranks in the report.
• Traditionally, a correlation coefficient rather than the covariance is used as using a correlation coefficient removes the effect of scale: thus a
variable which varies between 0 and 1 does not weigh more in the projection than a variable varying between 0 and 1000. However in certain
areas, when the variables are supposed to be on an identical scale or we want the variance of the variables to influence factor building,
covariance is used.
• Where only a similarity matrix is available rather than a table of observations/variables, or where you want to use another similarity index, you can
carry out a PCA starting from the similarity matrix. The results obtained will only concern the variables as no information on the observations was
available.
• Note: where PCA is carried out on a correlation matrix, it is called normalized PCA
2. Big guns of BI
Rotations: Varimax and others

Rotations can be applied on the factors. Several
methods are available including Varimax,
Quartimax, Equamax, Parsimax, Quartimin and
Oblimin and Promax.
2. Big guns of BI
2. Big guns of BI
Supplementary variables and observations
XLSTAT lets you add variables (qualitative or quantitative) or observations to
the PCA after it has been computed. Those variables or observations are
called supplementary. This can be used in several contexts.
Here are two examples:
•If the user wants to investigate roughly how a set of dependent variables
relates to the others. The set of dependent variables should be used here as
a set of supplementary variables and the others (i.e. independent
variables) should be used to build the PCA.
•If the user simply wants to see how different categories of observations
behave in the PCA space (Males vs Females for example). In this case, a
qualitative supplementary variable (sex) may be used to color observations
according to the sex they belong to. It is also possible to display the
category centroids as well as confidence ellipses around categories.
2. Big guns of BI
Results for Principal Component Analysis in XLSTAT

• What are the Correlation/Covariance matrices?
• What is Bartlett's sphericity test in PCA?
• What are Eigenvalues and inertia?
• What are contributions?
• How to interpret Squared cosines for the variables
• What are factor scores?
• Results with rotations
What is Bartlett's sphericity test in PCA?

The results of the Bartlett sphericity test are
displayed. They are used to reject or not the
hypothesis according to which the variables are
not correlated.
XLSTAT also proposes the Kaiser-Meyer-Olkin

(KMO) test.
2. Big guns of BI
What are Eigenvalues and inertia?
Eigenvalues are the amount of information (inertia) summarized in
every dimension. The first dimension contains the highest amount of
inertia, followed by the second, then the third, and so on. XLSTAT
displays eigenvalues in a table and in a chart (scree plot). The number
of eigenvalues is equal to the number of non-null eigenvalues.
What are contributions?
Contributions (also called absolute contributions) represent the extent
to which each variable contributed to building the corresponding PCA
axis. They help in the interpretation.
2. Big guns of BI

How to interpret Squared cosines for the variables
Squared cosines reflect the representation quality of a variable on a PCA axis. As in other factor methods,
squared cosine analysis is used to avoid interpretation errors due to projection effects. If the squared
cosines of a variable associated to an axis is low, the position of the variable on this axis should not be
interpreted.
What are factor scores?
Factor scores are the observations coordinates on the PCA dimensions. They are displayed in a table
XLSTAT. If supplementary data have been selected, these are displayed at the end of the table.
As for the results related to variables, XLSTAT displays observations contributions (i.e. their contribution in
building the PCA axes) as well as squared cosines (i.e. their representation quality on the different axes).
2. Big guns of BI

Results with rotations

Where a rotation has been requested, the results of the rotation are displayed with the rotation matrix
first applied to the factor loadings. This is followed by the modified variability percentages associated
with each of the axes involved in the rotation. The coordinates, contributions and cosines of the
variables and observations after rotation are displayed in the following tables.
2. Big guns of BI
In the Charts tab, in order to

display the labels on all charts,
and to display all the observations
(observations charts and biplots),
the filtering option is unchecked. If
there is a lot of data, displaying
the labels might slow down the
global display of the results.
Displaying all the observations
might make the results
unreadable. In these cases,
filtering the observations to
display is recommended.
2. Big guns of BI
Principal Component Analysis in XLSTAT - launching the

computations
• The computations begin once you have clicked
on OK. You are asked to confirm the number of rows
and columns.
• Note: This message can be bypassed by un-selecting
the "Ask for selections confirmation" in the XLSTAT
options panel.
• Then you should confirm the axes for which you want
to display plots. In this example, the percentage of
variability represented by the first two factors is not
very high (67.72%); to avoid a misinterpretation of the
results, we have decided to complement the results
with a second chart on axes 1 and 3.
2. Big guns of BI
2. Big guns of BI
Interpreting the results of a Principal Component Analysis in Excel using XLSTAT - How to interpret a PCA correlation matrix
The first result to look at is the correlation matrix. We can see right away that the rates of people below and above 65 are negatively correlated
(r = -1). Either of the two variables could have been removed without effect on the quality of the results. We can also see that the Net Domestic
Migration has low correlation with the other variables, including the Net International migration. This means that U.S. nationals and non-nationals
may be moving to a state for different sets of reasons.
2. Big guns of BI
How to interpret Eigenvalues in Principal Component Analysis
• The next table and the corresponding chart are related to a mathematical object, the eigenvalues, which reflect
the quality of the projection from the N-dimensional initial table (N=7 in this example) to a lower number of
dimensions. In this example, we can see that the first eigenvalue equals 3.567 and represents 51% of the total
variability. This means that if we represent the data on only one axis, we will still be able to see % of the total
variability of the data.
• Each eigenvalue corresponds to a factor, and each factor to a one dimension. A factor is a linear combination of
the initial variables, and all the factors are un-correlated (r=0). The eigenvalues and the corresponding factors are
sorted by descending order of how much of the initial variability they represent (converted to %).
• Broadly speaking, factor = PCA dimension = PCA axis

2. Big guns of BI
Ideally, the first two or three eigenvalues will correspond to a high % of the
variance, ensuring us that the maps based on the first two or three factors
are a good quality projection of the initial multi-dimensional table. In this
example, the first two factors allow us to represent 67.72% of the initial
variability of the data. This is a good result, but we'll have to be careful
when we interpret the maps as some information might be hidden in the
next factors. We can see here that although we initially had 7 variables, the
number of factors is 6. This is due to the two age variables, which are
negatively correlated (-1). The number of "useful" dimensions has been
automatically detected.
2. Big guns of BI
2. Big guns of BI
XLSTAT charts for Principal Component Analysis in Excel

• The correlation circle or variables chart
• The Observations charts
• The Biplots
The correlation circle or variables chart

The correlation circle (or variables chart) shows the
correlations between the components and the initial
variables. Supplementary variables can also be
displayed in the shape of vectors.
2. Big guns of BI
The correlation circle is useful in interpreting the meaning of the axes. In this example, the horizontal axis is
linked with age and population renewal, and the vertical axis with domestic migration. These trends will be
helpful in interpreting the next map. To confirm that a variable is well linked with an axis, take a look at the
squared cosines table: the greater the squared cosine, the greater the link with the corresponding axis. The
closer the squared cosine of a given variable is to zero, the more careful you have to be when interpreting
the results in terms of trends on the corresponding axis. Looking at this table we can see that the trends for
international migration would be best viewed on a F2/F3 map.

2. Big guns of BI

• The Biplots
The Observations charts

The observations charts represent the observations in
the PCA space.
2. Big guns of BI
It enables you to look at the observations on a two-

dimensional map, and to identify trends. We can see
that the demographics of Nevada and Florida are
unique, as are the demographics of Utah and
Alaska, two states that share common
characteristics. Going back to the table, we can
confirm that Utah and Alaska have a low population
rate of people over age 65. Utah has the highest
birth rate in the U.S., and Alaska ranks high as well.
It is also possible to display biplots, which are

simultaneous representations of variables and
observations in the PCA space.
Click to view a 3D visualization on the first three axes

generated by XLSTAT-3DPlot.
2. Big guns of BI
• The Biplots
The Biplots
The biplots represent the observations and variables simultaneously
in the new space. Here as well the supplementary variables can be
plotted in the form of vectors. There are different types of biplots:
•Correlation biplot
•Distance biplot
•Symmetric biplot
XLSTAT allows to choose the coefficient whose square root is to be

multiplied by the coordinates of the variables. This coefficient lets
you adjust the position of the variable points in the biplot in order to
make it more readable. If set to other than 1, the length of the
variable vectors can no longer be interpreted as standard deviation
(correlation biplot) or contribution (distance biplot).
2. Big guns of BI
Tutorials on how to run PCA in Excel using the XLSTAT software

Dataset for running a principal component analysis in Excel
The data are from the US Census Bureau and describe the changes in the population of 51 states
between 2000 and 2001. The initial dataset has been transformed to rates per 1000 inhabitants, with the
data for 2001 serving as the focus for the analysis.
Goal of this practice

Our goal is to analyze the correlations between the variables and to find out if the changes in
population in some states are very different from the ones in other states.
2. Big guns of BI
Principal Component Analysis
Principal Component Analysis is a very useful method to analyze numerical data structured in a M
observations / N variables table. It allows to:
•Quickly visualize and analyze correlations between the N variables,
•Visualize and analyze the M observations (initially described by the N variables) on a low dimensional map,
the optimal view for a variability criterion,
•Build a set of P uncorrelated factors
The limits of Principal Component Analysis stem from the fact that it is a projection method, and sometimes
the visualization can lead to false interpretations. There are however some tricks to avoid these pitfalls.
It is also important to note that PCA is an exploratory statistical tool and does not generally allow to test
hypotheses. The advantage of this aspect is that PCA's may be run several times with observations or
variables being removed or added at every run, as long as those manipulations are justified in the
interpretations.
2. Big guns of BI
Setting up a Principal Component Analysis in Excel using XLSTAT
1. Selecting the data

• Once XLSTAT is activated, select the XLSTAT / Analyzing data /
Principal components analysis command (see below).
• The Principal Component Analysis dialog box will appear.
• Select the data on the Excel sheet.
• In this example, the data start from the first row, so it is quicker and
easier to use columns selection. This explains why the letters
corresponding to the columns are displayed in the selection
boxes.
• The Data format chosen is Observations/variables because of the
format of the input data.
2. Big guns of BI
2. Principal Component Analysis: what type to

choose - Pearson or covariance
The PCA type that will be used during the
computations is the Correlation matrix, which
corresponds to the Pearson correlation coefficient.
Covariance matrices allocate more weight to
variables with higher variances. Spearman's
correlations may be more appropriate when running
the PCA on variables with different distributions.
2. Big guns of BI
3. Principal Component Analysis in XLSTAT, configuring

outputs and charts
In the Outputs tab, we choose to activate the option
to display significant correlations in bold characters
(Test significance).
2. Big guns of BI
3. Principal Component Analysis in XLSTAT,

configuring outputs and charts
In the Charts tab, in order to display the labels on all
charts, and to display all the observations
(observations charts and biplots), the filtering option
is unchecked. If there is a lot of data, displaying the
labels might slow down the global display of the
results. Displaying all the observations might make
the results unreadable. In these cases, filtering the
observations to display is recommended.
2. Big guns of BI
4. Principal Component Analysis in XLSTAT -

launching the computations
The computations begin once you have clicked
on OK. You are asked to confirm the number of rows
and columns.
Note: This message can be bypassed by un-selecting
the "Ask for selections confirmation" in the XLSTAT
options panel.
Then you should confirm the axes for which you want
to display plots. In this example, the percentage of
variability represented by the first two factors is not
very high (67.72%); to avoid a misinterpretation of the
results, we have decided to complement the results
with a second chart on axes 1 and 3.
2. Big guns of BI
Interpreting the results of a Principal Component Analysis in Excel using XLSTAT

1. How to interpret a PCA correlation matrix
The first result to look at is the correlation matrix. We can see right away that the rates of people below and
above 65 are negatively correlated (r = -1). Either of the two variables could have been removed without
effect on the quality of the results. We can also see that the Net Domestic Migration has low correlation with
the other variables, including the Net International migration. This means that U.S. nationals and non-
nationals may be moving to a state for different sets of reasons.
2. Big guns of BI

2. How to interpret Eigenvalues in Principal Component Analysis
The next table and the corresponding chart are related to a mathematical object, the eigenvalues, which
reflect the quality of the projection from the N-dimensional initial table (N=7 in this example) to a lower
number of dimensions. In this example, we can see that the first eigenvalue equals 3.567 and represents
51% of the total variability. This means that if we represent the data on only one axis, we will still be able to
see % of the total variability of the data.
Each eigenvalue corresponds to a factor, and each factor to a one dimension. A factor is a linear
combination of the initial variables, and all the factors are un-correlated (r=0). The eigenvalues and the
corresponding factors are sorted by descending order of how much of the initial variability they represent
(converted to %).
Broadly speaking, factor = PCA dimension = PCA axis
2. Big guns of BI
Ideally, the first two or three eigenvalues will

correspond to a high % of the variance, ensuring us
that the maps based on the first two or three factors
are a good quality projection of the initial multi-
dimensional table. In this example, the first two
factors allow us to represent 67.72% of the initial
variability of the data. This is a good result, but we'll
have to be careful when we interpret the maps as
some information might be hidden in the next
factors. We can see here that although we initially
had 7 variables, the number of factors is 6. This is due
to the two age variables, which are negatively
correlated (-1). The number of "useful" dimensions has
been automatically detected.
2. Big guns of BI

3. How to interpret results related to variables in PCA
The first map is called the correlation circle (below on axes
F1 and F2). It shows a projection of the initial variables in
the factors space. When two variables are far from the
center, then, if they are: Close to each other, they are
significantly positively correlated (r close to 1); If they are
orthogonal, they are not correlated (r close to 0); If they
are on the opposite side of the center, then they are
significantly negatively correlated (r close to -1).
When the variables are close to the center, some
information is carried on other axes, and that any
interpretation might be hazardous. For example, we might
be tempted to interpret a correlation between the
variables Net Domestic migration and Net International
Migration although, in fact, there is none. This can be
confirmed either by looking at the correlation matrix or by
looking at the correlation circle on axes F1 and F3.
2. Big guns of BI

The correlation circle is useful in interpreting the
meaning of the axes. In this example, the
horizontal axis is linked with age and population
renewal, and the vertical axis with domestic
migration. These trends will be helpful in
interpreting the next map. To confirm that a
variable is well linked with an axis, take a look
at the squared cosines table: the greater the
squared cosine, the greater the link with the
corresponding axis. The closer the squared
cosine of a given variable is to zero, the more
careful you have to be when interpreting the
results in terms of trends on the corresponding
axis. Looking at this table we can see that the
trends for international migration would be best
viewed on a F2/F3 map.
2. Big guns of BI

4. How to interpret results related to observations in PCA
The next chart can be the ultimate goal of the Principal
Component Analysis (PCA). It enables you to look at the
observations on a two- dimensional map, and to identify
trends. We can see that the demographics of Nevada
and Florida are unique, as are the demographics of Utah
and Alaska, two states that share common
characteristics. Going back to the table, we can confirm
that Utah and Alaska have a low population rate of
people over age 65. Utah has the highest birth rate in the
U.S., and Alaska ranks high as well.
It is also possible to display biplots, which are simultaneous
representations of variables and observations in the PCA
space.
Click to view a 3D visualization on the first three axes
generated by XLSTAT-3DPlot.
2. Big guns of BI
Note on the usage of Principal Component Analysis
Principal component analysis is often performed before a regression, to avoid using correlated variables,
or before clustering the data, to have a better overview of the variables. The number of clusters might
sometimes be a simple guess based on the maps. The above demographic data have also been used
in the tutorial on hierarchical clustering. The ">65 pop" variable has been removed as its inclusion would
double the weight of the age variables in the analysis.
2. Big guns of BI
Going further
1. Adding supplementary variables to the PCA
It is possible to add supplementary variables to the PCA after it has been computed. This may help
increasing interpretation quality. In XLSTAT, those variables can be selected under the Suppl. Data tab
of the PCA dialog box. Supplementary variables can be divided into two types:
•Qualitative supplementary variables: they allow to color observations on the map according to
the category they belong to. In this tutorial's example, we could have added a column defining if
a state is mostly republican or mostly democrat.
•Quantitative supplementary variables: these variables can be added to see how they correlate
with the group of variables that have been used to build the PCA. In the case where PCA is
performed before a regression, the explanatory variables can be used to construct the PCA while
the dependent variable can be added as a supplementary variable. This may help to roughly
detect which explanatory variables could have the strongest effects on the dependent variable.
2. Big guns of BI
(v) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
NEXT MODULE: DATA VIZUALIZATION (Module 5)

4.module 4 - Diagnostics Analytics - SEND

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4.module 4 - Diagnostics Analytics - SEND

Uploaded by

Copyright:

Available Formats

BUSINESS INTELLIGENCE

5 Data Visualization & Storytelling

5 Business Intelligence Capstone

❑ Introduction ❑ Introduction ❑Heuristics

Identifying FACTORS & CAUSES

❑ Introduction ❑ Introduction ❑Heuristics

Types of Logic Trees

Good problem disaggregation is at the heart of the

Trees should have branches that are:

The branches of the tree don't overlap, or contain partial

Taken as a whole, your tree contains all of the elements of

You can arrange them from left to right, right to

Better trees have a clearer and more complete

Inductive Logic Trees

Deductive Logic Trees

Time consuming if Wrong direction

In many cases you will actually work

Understanding the Big Picture

Understanding the Big Picture

Root Cause Analysis

Root Cause Analysis

PPC stands for pay-per-click, a

Question‐Based Problem Solving

Question‐Based Problem Solving

Root cause analysis is a problem

This chart shows a toolkit of what we call cleaving

Price/Volume: One of the key elements of our return on capital tree is

• Are there differentiated products or commodities?

Good problem solvers have a toolkit at their disposal that helps

Following a structured problem approach, where we use good

Heuristics are powerful tools that act as shortcuts in analysis. They

We focus on how to do powerful analyses quickly and efficiently,

WHY we adopt Occam's razor both in day-to-day reasoning and in science.

“The more conditions had to be met for

Cậu bé nói rằng: “một con mèo đã làm điều đó”.

As a final heuristic, consider the distribution of outcomes.

When I say “shape” of your data, I am talking about things like

“Not all tables are created equal”

Each relationship in this chain has a performance cost (technical).

If do not have Relationship Having Relationships

The Star Schema is the Optimal Shape

Another design: follow the graphical layout

If you compare the 2 images above, you will

Relationships in Power Pivot

One to Many Relationships

Lookup Tables and Data Tables

Other best practices include:

Using Spaces in Names

Naming Tables, Columns And Measures In Power Query

Make naming a top priority when you’re building a dataset.

What’s wrong with this picture? Look at the names:

•What on earth does the measure name _PxSysF even mean?

Three things to consider when naming a table, column or measure:

DAX can help us create a

Power BI: Modeling -> New Table

Excel: Add as New Query => To Table

So why are hierarchies important in Power BI/ Power Pivot?

Covariance is too sensitive to the

Is there a way we can prove that causation does in fact exist?

In web cancellation example, we used reasoning based on our knowledge of

Cần phân biệt giữa association, correlation và causation:

If the unconditional probabilities of category levels, such

Chi Square Tests Association Between Two Categorical Variables