Professional Documents
Culture Documents
PROGRAM
Phuong Thao
Profile
Thao is currently Business Intelligence/ Business Analytics Trainer. She was Business Analytics
Manager at SIFT Analytics Group. She specializes in Analytics Project Management and regularly
using Business Intelligence and Business Analytics tools to review, understand data of customers,
acquire subject matter expertise, communicate model insights, present business performance
analyzing results for C-level managers in finance & retail industry. Thao is also a Trainer, Speaker
for Analytics events, training programs of SIFT and partners. With more than half-decade of
successful experience, Thao used a broad range of technologies, analytical techniques and
methodologies in order to analyze a broad range of data from various sources to provide insights
and support business decision making.
@PhuongThaoAnalytics #PhuongThaoAnalytics
1 Business Intelligence Introduction
2 Business Statistics
3 Descriptive Analytics
4 Diagnostics Analytics
01 02 03
1. Fundamentals 2. Qualitative 3. Quantitative
Diagnostics Analytics Diagnostics Analytics
There is an orientation or attitude that is found in the best problem solvers that reflects an active openness
to new ideas and data, and a suspicion of standard or conventional answers. Tetlock describes it well in
profiling the super-forecasters.
Lesson 1: Diagnostics Analytics Fundamentals
1. Introduction
Questions
Lesson 1: Diagnostics Analytics Fundamentals
1. Introduction
Start your analysis with summary statistics, rules of thumb (Descriptive Analytics), and heuristics to get a
feel for the data and the solution space. Before you dive into giant data sets, machine learning, Monte
Carlo simulations, or other big guns, we believe it is imperative to explore the data, learn its quality,
understand the magnitudes and direction of key relationships, and assess whether you are trying to
understand drivers (Diagnostics Analytics).
Sophisticated analysis has its place, but it is our experience that one‐ day answers, supported by good
logic and simple heuristics, are often sufficient to close the book on many problems, allowing you to
move on to the more difficult ones.
Lesson 1: Diagnostics Analytics Fundamentals
1. Introduction
WHAT IS
DIAGNOSTICS ANALYTICS?
WHAT IS THE USE OF
DIAGNOSTICS ANALYTICS?
TYPES OF
DIAGNOSTICS ANALYTICS
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
4. DIAGNOSTICS ANALYTICS lessons
01 02 03
1. Fundamentals 2. Qualitative 3. Quantitative
Diagnostics Analytics Diagnostics Analytics
We find there is great clarity in stating what you know about your problem at any point in the process. It helps bed
down what understandings are emerging, and what unknowns still stand between the answers and us. We call this a
one‐day answer, and they convey our current best analysis of the situation, complications or insightful observations,
and our best guess at the solution, as we iterate between our evolving workplans and our analysis.
One‐day answers help sharpen our hypotheses and make our analysis focused and efficient.
Lesson 2: Qualitative Diagnostics Analytics
1. Introduction
The Seven-Steps vs Design Thinking of Analytics Problem Solving
There is an adage that says “a well‐ defined problem is a
problem half solved”; it's worth the investment of time
upfront.
Lesson 1: Diagnostics Analytics Fundamentals
Arrow 3
2. Types of Diagnostics Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture
(ii) Disaggregate – Logic Trees
It is worth stopping here for a moment to introduce an important principle in logic tree construction, the
concept of MECE. MECE stands for “mutually exclusive, collectively exhaustive”. Because this tree
confuses or overlaps some of its branches, it isn't MECE. It's a mouthful, but it is a really useful concept.
Components or factors are just the most obvious elements that make up a problem, like the bricks and
mortar of our earlier brick wall example. You can usually find enough information for a logical first
disaggregation with a small amount of Internet research and a team brainstorming session.
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture
(ii) Disaggregate – Logic Trees
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture
(ii) Disaggregate – Logic Trees
Deductive logic trees take their name from the process of logical
deduction. Deductive reasoning is sometimes called top–down
reasoning.
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture
(ii) Disaggregate – Drill-down vs Drill-up
As in the previous Fahasa Bookstore example, Fahasa needs to place an order for next year’s
calendar. We continue to assume that the calendars sell for $10 and customer demand for the
calendars at this price is triangularly distributed with minimum value, most likely value, and
maximum value equal to 100, 175, and 300. However, there are now two other sources of
uncertainty. First, the maximum number of calendars Fahasa’s supplier can supply is uncertain
and is modeled with a triangular distribution. Its parameters are 125 (minimum), 200 (most
likely), and 250 (maximum). Once Fahasa places an order, the supplier will charge $7.50 per
calendar if he can supply the entire Fahasa order. Otherwise, he will charge only $7.25 per
calendar. Second, unsold calendars can no longer be returned to the supplier for a refund.
Instead, Fahasa will put them on sale for $5 apiece after January 1. At that price, Fahasa
believes the demand for leftover calendars is triangularly distributed with parameters 0, 50, and
75. Any calendars still left over, say, after March 1, will be thrown away.
Lesson 1: Diagnostics Analytics Fundamentals
Arrow 3
2. Types of Diagnostics Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture - Question‐Based Problem Solving
Root Cause Analysis
Root Cause Analysis
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture - Question‐Based Problem Solving
Root Cause Analysis
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone
➢ Cleaving frames
Big guns of BI
• Price/Volume ➢ Relationship
• Collaborate/Compete ➢ What-If, Scenario
• Principal/Agent ➢ Sentiment Analytics
• Asset/Options ➢ Cohort Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture – Cleaving Frames
The kinds of elements here often include assumptions about market share, new product entry,
rate of adoption, and price and income elasticities.
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture – Cleaving Frames
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture
Priorization
Before we start to invest significant time and effort into work planning and analysis, we have to prune our
trees.
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture
Priorization
Lesson 2: Qualitative Diagnostics Analytics
2. Understanding the Big Picture
Priorization
Lesson 1: Diagnostics Analytics Fundamentals
Arrow 3
2. Types of Diagnostics Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
The oldest of these is definitely Occam's Razor—favor the simplest solution that fits the facts—which
originated in the fourteenth century. It tells us to select the hypothesis that has the fewest assumptions.
One way of seeing why this make sense is a simple math example: If you have four assumptions that are
independent of each other, with an 80% separate chance of being correct, the probability that all four
will be correct is just over 40%. With two assumptions and the same probabilities, it is 64%. For many
problems the fewer the assumptions you have the better. Practically speaking, this means avoiding
complex, indirect, or inferential explanations, at least as our starting point. Related to Occam's Razor are
one‐reason decision heuristics, including reasoning by elimination and a fortiori reasoning, where you
eliminate alternatives that are less attractive. The important reminder is not to get committed to a simple
answer with few assumptions when the facts and evidence are pointing to a more nuanced or complex
answer.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(i) Occam’s Razor
Giả sử bạn nghe thấy một chiếc bình vỡ. Bạn đi đến phòng khách và có Jimmy (cậu bé) bên cạnh chiếc bình vỡ.
Cuộc trò chuyện này, về nguyên tắc, có thể tiếp tục mãi mãi. Mọi sự phản đối mà bạn đưa ra có thể được phản bác bằng
một giả thuyết khác.
Trước khi điều đó xảy ra, có lẽ bạn sẽ bảo Jimmy dừng lại với những lời giải thích này và rút ra kết luận đơn giản hơn
nhiều rằng cậu bé đã làm vỡ chiếc bình. Lời giải thích của con mèo dựa trên quá nhiều điều kiện, điều này khiến nó
rất khó xảy ra. Càng nhiều điều kiện phải được đáp ứng cho một cái gì đó xảy ra, lý thuyết sẽ càng ít có khả năng.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(i) Occam’s Razor
If we manage to deduce a simpler version of our theory, we can get rid of all the statistical noise of our
experiments. Simpler models are generally better at predicting new and yet unobserved events.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(ii) Order of magnitude
Order of magnitude analysis is used to prioritize team efforts by estimating the size of different levers. In
business problems, we typically calculate the value of a 10% improvement in price, cost, or volume to
determine which is more important to focus on (assuming, of course, that each is similarly difficult or easy
to change). It applies to analyzing social issues as well. Doing an order of magnitude analysis should
provide a minimum, most likely, and maximum estimate, not simply the maximum.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(iii) Pareto
Efficient analysis is often helped by the 80:20 rule, sometimes called the Pareto Principle after the Italian
economist Vilfredo Pareto, who first noticed this relationship. It describes the common phenomenon that
80% of outcomes come from 20% of causes. If you plot percent of consumption of a product on the Y‐axis
and percent of consumers on the X‐axis you will often see that 20% of consumers account for 80% of sales
of a product or service. The point of doing 80:20 analyses is again to focus your analytical effort on the
most important factors. Many business and social environments feature market structures where 80:20 or
something close to that ratio is the norm, so it's a handy device to use. The 80:20 rule may also apply in
complex system settings.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(iv) Compound growth
Compound growth is key to understanding how wealth builds, how enterprises scale quickly, and how
some populations grow. Warren Buffett said: “My wealth has come from a combination of living in America,
some lucky genes, and compound interest.” A really quick way to estimate compounding effects is to use
the Rule of 72. The rule of 72 allows you to estimate how long it takes for an amount to double given its
growth rate by dividing 72 by the rate of growth. So, if the growth rate is 5% an amount will double in about
14 years (72/5 = 14.4 years). If the growth rate is 15%, doubling occurs in four to five years.
Where do errors occur with the rule of 72? When there is a change in the growth rate, which of course is
often the case over longer periods. This makes sense, as few things continue to compound forever (try the
old trick of putting a grain of rice on the first square of a chessboard and double the number on each
successive square)
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(iv) Compound growth
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(v) S‐curve
A useful heuristic if you are involved in estimating the adoption rate for a new innovation is the S‐curve,
which shows a common pattern of sales with a new product or a new market. S‐curves are drawn with the
percent of full adoption potential on the Y‐axis and the years since adoption on the X‐axis. The shape of
the S will vary a lot depending on the reference class you select and particular reasons for faster or slower
take up in your case. Charles led a successful start‐up in the early days of Internet adoption. At that time,
many forecasters overestimated the impact of Internet penetration in the short term (think of Webvan or
Pets.com), but underestimated adoption in the longer term of 10 to 15 years. It looks in hindsight very
clearly to be a classic S‐curve. In 1995 fewer than 10% of Americans had access to the Internet. By 2014 it
reached 87%. The S‐curve can take many specific profiles. Like any heuristic, you don't want to apply this
rigidly or blindly, but rather use it as a frame for scoping a problem. The challenge is to get behind
sweeping statements like “the world has speeded up,” to understand why a particular technology will
have adoption at a certain rate.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(v) S‐curve
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(vi) Expected value
Expected value is simply the value of an outcome multiplied by its probability of occurring. That is
called a single point expected value, but it is usually more useful (depending on the shape of the
distribution) to take the sum of all probabilities of possible outcomes multiplied by their values.
Expected value is a powerful first‐cut analytic tool to set priorities and reach conclusions on whether to
take a bet in an uncertain environment.
But be careful: Single point expected value calculations are most useful when the underlying distribution
is normal rather than skewed or long tailed. You check that by looking at the range, and whether the
median and mean of the distribution are very different from each other.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(vi) Expected value
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(vii) Bayesian thinking
Bayesian thinking is really about conditional probability, which is the probability of an event given
another event took place which also has a probability, called a prior probability.
As a simple example, look at the probability of it raining given that it is cloudy (the prior probability),
versus the probability of it raining if it is currently sunny. Rain can happen in either case, but is more likely
when the prior condition is cloudy. Bayesian analysis can be challenging to employ formally (as a
calculation), because it is difficult to precisely estimate prior probabilities.
But we often use Bayesian thinking when we think conditional probabilities are at work in the problem.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(vii) Bayesian thinking
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(viii) Reasoning by analogy
Reasoning by analogy is an important heuristic for quick problem solving. An analogy is when you have
seen a particular problem structure and solution before that you think may apply to your current problem.
Analogies are powerful when you have the right reference class (that is, have correctly identified the
structure type), but dangerous when you don’t. To check this, we typically line up all the assumptions
that underpin a reference class and test the current case for fit with each.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(viii) Reasoning by analogy
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(ix) Break‐even point
Break‐even point. Every start‐up company Charles and Rob see likes to talk about their cash runway—the
months before cash runs out and they need a new equity infusion. Not enough of them really know their
break‐even point, the level of sales where revenue covers cash costs. It's a simple bit of arithmetic to
calculate, but requires knowledge of marginal and fixed costs, and particularly how these change with
increased sales volume. The break‐ even point in sales dollars or units equals fixed costs/unit price less unit
variable costs. Typically, the unit price is known. You can fairly quickly calculate the costs associated with
each sale, the variable costs. The tricky part is how fixed costs will behave as you scale a business. You
may face what are called step‐fixed costs, where to double volume involves significant investment in
machinery, IT infrastructure, or sales channels.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(ix) Break‐even point
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(x) Marginal analysis
Marginal analysis is a related concept that is useful when you are thinking about the economics of
producing more, consuming more, or investing more in an environment of scarce resources. Rather than
just looking at the total costs and benefits, marginal analysis involves examining the cost or benefit of the
next unit. In production problems with fixed costs of
machinery and plant, marginal costs (again, the cost of producing one more unit) often fall very
quickly—favoring more production—up to the point at which incremental machinery is needed. We add
units until the marginal benefit of a unit sale is equal to the marginal cost.
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(x) Marginal analysis
Chi phí cận biên (marginal cost) là mức tăng chi phí (∆C) khi
sản lượng tăng thêm một đơn vị (∆Y).
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
(xi) Distribution of outcomes
The personnel department of MTP, a large communications company, is reconsidering its hiring policy. Each
applicant for a job must take a standard exam, and the hire or no-hire decision depends at least in part on
the result of the exam. The scores of all applicants have been examined closely. They are approximately
normally distributed with mean 525 and standard deviation 55.
The current hiring policy occurs in two phases. The first phase separates all applicants into three categories:
automatic accepts, automatic rejects, and maybes. The automatic accepts are those whose test scores
are 600 or above. The automatic rejects are those whose test scores are 425 or below. All other applicants
(the maybes) are passed on to a second phase where their previous job experience, special talents, and
other factors are used as hiring criteria. The personnel manager at MTP wants to calculate the percentage
of applicants who are automatic accepts or rejects, given the current standards. She also wants to know
how to change the standards to automatically reject 10% of all applicants and automatically accept 15% of
all applicants.
Objective To determine test scores that can be used to accept or reject job applicants at MTP.
Descriptive Analytics
2. Descriptive Introduction – HR Analytics
Descriptive Analytics
2. Descriptive Introduction – HR Analytics
Lesson 3: Quantitative Diagnostics Analytics
1. Heuristics
Conclusion
• Start all analytic work with simple summary statistics and heuristics that help you see the size and
shape of your problem levers.
• Don't gather huge data sets or build complicated models before you have done this scoping
reconnaissance with rules of thumb.
• Be careful to know the limitations of heuristics, particularly the potential for reinforcing availability
and confirmation biases.
• Question‐based, rough‐cut problem solving can help you uncover powerful algorithms for
making good decisions and direct your empirical work (when required).
• Root cause and 5‐Whys analytics can help you push through proximate drivers to fundamental
causes in a variety of problems, and not just limited to production and operations environments.
Lesson 1: Diagnostics Analytics Fundamentals
Arrow 3
2. Types of Diagnostics Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone
➢ Cleaving frames
Big guns of BI
• Price/Volume ➢ Relationship
• Collaborate/Compete ➢ What-If, Scenario
• Principal/Agent ➢ Sentiment Analytics
• Asset/Options ➢ Cohort Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
We introduced a set of first‐cut heuristics and root cause thinking to make the initial analysis phase of
problem solving simpler and faster. We showed that you can often get good‐enough analytic results
quickly for many types of problems with little mathematics or model building.
But what should you do when faced with a complex problem that really does require a robustly
quantified solution? When is it time to call in the big guns—Bayesian statistics, regression analysis, Monte
Carlo simulation, randomized controlled experiments, machine learning, game theory, or
crowd‐sourced solutions?
This is certainly an arsenal of analytic weaponry that many of us would find daunting to consider
employing. Even though your team may not have the expertise to use these more complex problem
solving tools, it is important for the workforce of today to have an understanding of how they can be
applied to challenging problems. In some cases you may need to draw on outside experts, in other
instances you can learn to master these techniques yourself.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Data Model
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (M Language)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Why you should layout your Star Schema using the Collie Methodology?
There are a few reasons you should visually layout your data using the Collie Methodology.
1. Firstly it gives you a visual clue to which ones are the “Lookup” tables. They are the ones above
that you have to “lookup” to see them.
2. Also note that in the old world of traditional Excel, you would have probably
written VLOOKUP formulae to bring the data from these Lookup tables into your single Data table
prior to creating a pivot table.
3. Finally, filters only propagate in 1 direction in the data model and that is from the 1 side of the
relationship to the many side of the relationship. The Lookup tables are on the 1 side, so that
means the filters always flow “down hill”. They do not/cannot flow “up hill” automatically (but
note this can be done with advanced DAX and also by editing a relationship in Power Pivot to make
it bi-directional). So this layout gives you a visual clue which direction the filters will flow.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Data Model
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
One to One Relationships The first relationship (shown as 1) is a 1 to many relationship between
the Customer table (Lookup table) and the Sales table (Data
table). The Customer Socio Economic Data table is joined to the
Customer table via a 1 to 1 relationship (shown as 2 above). If there is
a benefit (to the user of reports) of splitting this Socio Economic data
into a separate table then of course you should do so. If there is no
benefit, I recommend you combine all the data from Customer Socio
Economic Data table into the Customer table using Power Query on
load.
Every relationship has a “cost” in that it will have some affect on
performance. The performance impact may not be noticeable for
simple models but may become an issue with very complex models.
If you only remember 1 thing from this article, then please let it be
this: Don’t automatically accept the table structure coming from your
source data. You are now a data modeller and you need to make
decisions on the best way to load your data. Your source system is
probably not optimised for reporting (unless it is a reporting datamart)
so please don’t assume that what you have got is what you need.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
The one to many relationship is the foundation of Power Pivot. In the example above (from Adventure Works in
Power BI Desktop), the Customers table is on the 1 side of the relationship and the Sales table is on the many side
of the relationship. These tables are joined using a common field/column called “CustomerKey”. Customer Key
(aka customer number) is a code that uniquely identifies each customer. There can be no duplicates of the
customer key in the customer table. Conversely the customer can purchase as many times as needed and
hence the customer key can appear in the Sales table as many times as necessary. This is where the name
“one to many” comes from – the customer key occurs once and only once in the Customers table but can
appear many times in the Sales table.
Tables on the one side of the relationship are called Dimension tables (I call them Lookup tables) and the tables
on the many side of the relationship are called Fact tables (I call them Data tables).
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lookup Tables
You should have 1 Lookup table for each “object” that you need for reporting purposes. In the sample data above
from Adventure Works, these objects are “Customers”, “Products”, “Territories” and “Time” (i.e. “Calendar”). A key feature
of a Lookup table is it contains 1 and only 1 row for every individual item in the table and as many columns as you need to
describe the object. For e.g., there is only 1 row for each unique Customer in the Customers table. The Customers table has
lots of columns describing each customer such as customer number, customer name, customer address etc., but there is
only 1 row for each customer – each one is unique based on the customer number – no duplicates allowed.
Data Tables
The Data table in the Adventure Works data is the Sales table (it is possible to have many data tables in your data model
but there is only one in this example). This Data table contains lots of rows (60,000+ in this case) and contains all of the
transactional records of sales that occurred over several years. Importantly the Data table contains one column “key” that
matches to each of the “keys” in each Lookup table needed for reporting. So in this sample data, there is a date, customer
number, product number, and territory key so that the Data table can be logically joined to the Lookup tables.
The ideal shape of Data tables is to have very few columns but to have as many rows as needed to bring in all the data
records. These Data tables normally have lots of rows (sometimes in the 10s or even 100s of millions).
Lesson 4: Dashboard
5. Successful Business Intelligence
Excel Dashboard Formatting
Hide from Clients tools (on the Many sides): This is a rule of building Dimentional Model.
That way the User can’t accidentally drag the field under the pivot table and get into a relationship problem. This a
actually a really good thing.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – Hide in Report View/ Hide from Client tools
Hide all unnecessary columns from your client tools. By doing so, users can not aggregate columns in the
wrong way. You will need to keep any columns by which you want to slice or filter your data. Most often
these will be label type columns. Once you have created measures carrying out all the calculations you
need, there is no need to keep the column in sight.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
INVITED ATTENDED
Calvin Ellie
Ellie Flo
Irrfan Irrfan
Jada Jada
Sig Keith
Teena Rosa
Trout Sig
Teena
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
ATTENDED
INVITED Ellie
Calvin Flo
Ellie Irrfan
Irrfan Jada
Jada Keith
Sig Rosa
Teena Sig
Trout Teena
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Invited &
Attended
INVITED ATTENDED
Ellie
Calvin Flo
Irrfan
Trout Keith
Jada
Rosa
Sig
Teena
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
INVITED ATTENDED
LEFT RIGHT
Ellie Flo
Calvin Irrfan Keith
Trout Jada Rosa
Sig
Teena
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
INVITED ATTENDED
Ellie
Irrfan
Jada Flo
Calvin Sig Keith
Trout Teena Rosa
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
INVITED ATTENDED
Ellie
Irrfan
Jada Flo
Calvin Sig Keith
Trout Teena Rosa
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
INVITED ATTENDED
Ellie
Irrfan Flo
Calvin Jada Keith
Trout Sig Rosa
Teena
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
INVITED ATTENDED
Ellie Flo
Irrfan Keith
Calvin
Jada Rosa
Trout
Sig
Teena
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
INVITED ATTENDED
Ellie
Irrfan Flo
Calvin Jada Keith
Trout Sig Rosa
Teena
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
INVITED ATTENDED
Ellie
Irrfan Flo
Calvin Jada Keith
Trout Sig Rosa
Teena
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Calvin
LEFT Trout
LEFT
Ellie
Irrfan Jada
Sig RIGHT
RIGHT Teena
Flo
Keith
Rosa
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – DATA MODEL (Merge)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – NAMING CONVENTION
Naming Conventions
It is a good practice to use simple English names for your tables and columns. Avoid using abbreviations
and underscores wherever possible.
Example: Instead of a table called Pricing_History_Sum_Tbl, simply call it Pricing. It is much easier to read
and write your DAX if you keep it simple.
Measures on the other hand should always use Spaces and not
CamelCaseAsTheyAreHarderToReadThatWay. Also avoid abbreviations unless they are common in your
business setting. Some examples of good Measure names are
[Total Sales]
[Total Sales YTD]
[Total Sales All Products]
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – NAMING CONVENTION
•The column and measure names either don’t have spaces or use
underscores instead of spaces.
•You should use human-readable names rather than any kind of technical naming convention, with
spaces where you would expect to have spaces and all vowels present. For example, that means having
names like [Sales Amount] rather than [Sales_Amount] or [SlsAmt]; similarly, prefixes like “Dim” and “Fact”
might make sense to you but won’t mean anything to your users.
•You should use the correct business terminology, the terminology that your users will know and
understand, rather than just make up some names that seem appropriate. Your users might not
understand what [Total Sales Value] is if the generally accepted term is [Net Sales Amount].
•The names you use should be consistent across all datasets that contain the same data. That means that
if you have a table called Sales in one dataset it should be called Sales in every other dataset that you
build from the same data source, not Transactions, FactSales or something else.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – NAMING CONVENTION
Column naming
Rename all columns that will be visible in the
data model using concise, self-describing,
meaningful and user-friendly names.
Your data model should be designed for users
and not for developers to design reports and
consume. Even if technical professionals are
designing reports, field names are going to
show up as labels and titles.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – Duplicate vs Reference Tab
Duplicate copies a query with all the applied steps of it as a new query; an exact copy.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – Duplicate vs Reference Tab
Reference will create a new query which has only one step: Getting data from the original query.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – Search Tab
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship – Search Tab
Hide all key Field in the relationship and that's going to give us a nice
shortened list of columns to hide all of our key columns for relationships.
So, we are going to quickly just run through and hide all of these.
Lesson 3: Quantitative Diagnostics Analytics
OTHER TIPS
Analyze in Excel
Lesson 3: Quantitative Diagnostics Analytics
OTHER TIPS
Analyze in Excel
Lesson 3: Quantitative Diagnostics Analytics
OTHER TIPS
Manual Calculation Mode
Lesson 3: Quantitative Diagnostics Analytics
OTHER TIPS
Drill-Up, Drill-Down in Table
Lesson 3: Quantitative Diagnostics Analytics
OTHER TIPS
Creating Visualizations – Running Total Chart
Create Hierarchy
Lesson 3: Quantitative Diagnostics Analytics
OTHER TIPS
Running Total – Hierarchy (Grouping Power Table)
Lesson 3: Quantitative Diagnostics Analytics
OTHER TIPS
Q&A
Q&A
Lesson 1: Diagnostics Analytics Fundamentals
Arrow 3
2. Types of Diagnostics Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Aggregation Analysis
This methodology simply means calculating a value across a group or
dimension and is commonly used in data analysis. For example, you may want
to aggregate sales data for a salesperson by month - adding all of the sales
closed for each month. Then, you may want to aggregate across dimensions,
such as sales by month per sales territory. Aggregation is often done in
reporting to be able to "slice and dice" information to help managers make
decisions and view performance.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Causation: one event or state is the result of the occurrence of another event or state
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Again, context turns out to be critical and interpreting relationships and data.
It should be the first line of defense in avoiding mistakes.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Overfitting Trap: you are struggling at improving model's accuracy => EDA
(spend significant time on exploration and analyzing data)
Remember the quality of your inputs decide the quality of your output
1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Let’s take a moment to review the correlation. The score ranges from -1 to 1 and indicates if
there is a strong linear relationship — either in a positive or negative direction. So far so good.
However, there are many non-linear relationships that the score simply won’t detect.
If you are a little bit too well educated you know that the correlation matrix is symmetric.
However, relationships in the real world are rarely symmetric. More often, relationships are
asymmetric.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Categorical & Categorical: To find the relationship between two categorical variables, we can use following
methods:
• Two-way table: We can start analyzing the relationship by creating a two-way table of count and count%.
The rows represents the category of one variable and the columns represent the categories of the other
variable. We show count or count% of observations available in each combination of row and column
categories.
• Stacked Column Chart: This method is more of a visual form of Two-way table.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
=COUNTIFS(Range1,Condition1,
Range 2, Condition2 …)
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Categorical variables, including nominal (where numbers are simply labels) and ordinal, rank order
variables, are described by tabulating their frequencies or probability. If two categorical variables are
associated, the frequencies of values of one will depend on the frequencies of values of the other. Chi
square tests the hypothesized association between two categorical variables and contingency analysis
quantifies their association.
When Conditional Probabilities Differ from Joint
Probabilities, There Is Evidence of Association
Since the chi square components include expected cell counts in the denominator, sparse (with
expected counts less than five) cells inflate chi square. When sparse cells exist, we must either combine
categories or collect more data.
In the Recruiting Stars example, management was most interested in increasing the chances of hiring
Outstanding performers. Since some believed that Outstanding performers were recruited from
programs in the Home State, these categories were preserved. Same Region and Outside Region
program locations were combined. Poor and Average performance categories were combined. We
are left with a 2 x 2 contingency analysis, shown in
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
With fewer categories, all expected cell counts, are now greater than five:
25*25/40
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
DataCamp
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(i) Relationship
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(ii) What-If, Scenario
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iii) Sentiment Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iii) Sentiment Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iii) Sentiment Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iii) Sentiment Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iii) Sentiment Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iii) Sentiment Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iii) Sentiment Analytics
Lesson 1: Diagnostics Analytics Fundamentals
Arrow 3
2. Types of Diagnostics Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iv) Cohort analytics
➢ What is it?
Cohort analysis is a subset of behavioural analytics which allows you to study the behaviour of a group
over time. The groups or cohorts in this context are aggregations of data points, or relevant stakeholders
in your business, and the data may come from e-commerce platforms, web applications or sales data.
What is this?
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iv) Cohort analytics
2/ Next you need to define the metrics that will be able to help you answer the questions you’ve identified.
Cohort analysis requires the identification of the specific properties you are going to examine in the group.
For example, gender, date of first or last purchase, level of purchase, etc.
3/ Next you need to define or identify the specific cohorts you are going to assess. This will often require you
to perform attribute contribution in order to find the relevant differences between each group or sub-group
so that you can use that difference to explain the difference in their behaviour.
4/ Finally you perform the cohort analysis. The findings are often then visualised in a graph or table. This
visualisation can help spot the patterns that will then shape decision making.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iv) Cohort analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(iv) Cohort analytics
Lesson 1: Diagnostics Analytics Fundamentals
Arrow 3
2. Types of Diagnostics Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(v) Correlogram Analytics
The autocorrelation function plot (correlogram) show a series correlated with itself, like
by x time units. So if you take each of the correlation numbers we just calculated and
plotted them, you’d have the autocorrelation function plot or ACF plot
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(v) Correlogram Analytics
This series is undifferenced and has a slow This plot is for the same data set,
decay toward 0 correlation. Meaning that but now we’ve taken the first
the current values are much more difference.
correlated to recent values than values Lag-1: significance
further in the past. This suggests that the After Lag-1 is much less,
series is not stationary and will need to be suggesting now a stationary
differenced to reach a stationary series series
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(v) Correlogram Analytics
Qualitative Quantitative
Heuristics
Understanding the Big Picture ➢ Occam’s Razor
➢ Logic Trees ➢ Oder of Magnitude cuts
• Factor/Lever/Component ➢ Pareto
• Inductive Logic Tree ➢ Rule of 72
• Deductive Logic Tree ➢ S-curve
• Hypothesis Tree ➢ Expected Curve
• Decision Tree ➢ Baysian Thinking
➢ Question‐Based Problem Solving ➢ Reasoning by Analogy
• Root Cause ➢ Break-even Point
o Sherlock Holmes ➢ Marginal Analysis
o 5 Whys ➢ Distribution of outcomes
o Fishbone Big guns of BI
➢ Cleaving frames ➢ Relationship
• Price/Volume ➢ What-If, Scenario
• Collaborate/Compete ➢ Sentiment Analytics
• Principal/Agent ➢ Cohort Analytics
• Asset/Options ➢ Correlogram Analytics
➢ Multidimensional Analytics
➢ Visual Analytics
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(v) Multidimensional Analytics
Multi-dimensional Vizualization
Contour Chart
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics – Contour Chart
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics
The dimensionality of a model is the number of independent or input variables used by the model. In
the artificial intelligence literature, dimension reduction is often referred to as factor selection or feature
extraction.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics
Practical Considerations
The integration of expert knowledge through a discussion with the data provider (or user) will probably
lead to better results.
Numerical summaries and graphs of the data are very helpful for data reduction
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics
Interpreting the results of a Principal Component Analysis in Excel using XLSTAT - How to interpret a PCA correlation matrix
The first result to look at is the correlation matrix. We can see right away that the rates of people below and above 65 are negatively correlated
(r = -1). Either of the two variables could have been removed without effect on the quality of the results. We can also see that the Net Domestic
Migration has low correlation with the other variables, including the Net International migration. This means that U.S. nationals and non-nationals
may be moving to a state for different sets of reasons.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics
CAT.MEDV, which has been created by categorizing median value (MEDV) into two categories, high and
low. (There are a couple of aspects of MEDV, the median house value, that bear noting. For one thing, it is
quite low, since it dates from the 1970s. For another, there are a lot of 50s, the top value. It could be that
median values above $50,000 were recorded as $50,000.) The variable CAT.MEDV is actually a categorical
variable created from MEDV. If MEDV ≥ $30,000, CAT.MEDV = 1. If MEDV < $30,000, CAT.MEDV = 0.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics
CAT.MEDV is identical for ZN=17.5, 90, 95, and 100 (where all neighborhoods have CAT.MEDV=1).
These four categories can then be combined into a single category. Similarly categories ZN=12.5,
25, 28, 30, and 70 can be combined.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics
Dimensionality Reduction (giảm chiều dữ liệu), là một trong những kỹ thuật quan trọng trong Machine Learning. Các
feature vectors trong các bài toán thực tế có thể có số chiều rất lớn, tới vài nghìn. Ngoài ra, số lượng các điểm dữ
liệu cũng thường rất lớn. Nếu thực hiện lưu trữ và tính toán trực tiếp trên dữ liệu có số chiều cao này thì sẽ gặp khó
khăn cả về việc lưu trữ và tốc độ tính toán. Vì vậy, giảm số chiều dữ liệu là một bước quan trọng trong nhiều bài
toán. Đây cũng được coi là một phương pháp nén dữ liệu.
Dimensionality Reduction, nói một cách đơn giản, là việc đi tìm một hàm số, hàm số này lấy đầu vào là một điểm dữ
liệu ban đầu x∈RDx∈RD với DD rất lớn, và tạo ra một điểm dữ liệu mới z∈RKz∈RK có số chiều K<DK<D.
Principal Component Analysis (PCA), tức Phân tích thành phần chính - một phương pháp đơn giản nhất trong các
thuật toán Dimensionality Reduction dựa trên một mô hình tuyến tính. Phương pháp này dựa trên quan sát rằng
dữ liệu thường không phân bố ngẫu nhiên trong không gian mà thường phân bố gần các đường/mặt đặc
biệt nào đó. PCA xem xét một trường hợp đặc biệt khi các mặt đặc biệt đó có dạng tuyến tính là các không gian
con (subspace).
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
PCA chính là phương pháp đi tìm một hệ cơ sở mới sao cho thông tin của dữ liệu chủ yếu tập trung ở một
vài toạ độ, phần còn lại chỉ mang một lượng nhỏ thông tin. Và để cho đơn giản trong tính toán, PCA sẽ tìm một hệ
trực chuẩn để làm cơ sở mới.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Principal Component Analysis (PCA) is one of the
most popular data mining statistical methods.
Principal Component Analysis (PCA) is a powerful
and popular multivariate analysis method that lets
you investigate multidimensional datasets with
quantitative variables. It is widely used in biostatistics,
marketing, sociology, and many other fields.
Principal Component Analysis is one of the most frequently used multivariate data analysis methods.
It is a projection method as it projects observations from a p-dimensional space with p variables to a
k-dimensional space (where k < p) so as to conserve the maximum amount of information
(information is measured here through the total variance of the dataset) from the initial dimensions.
PCA dimensions are also called axes or Factors. If the information associated with the first 2 or 3 axes
represents a sufficient percentage of the total variability of the scatter plot, the observations could be
represented on a 2 or 3-dimensional chart, thus making interpretation much easier.
The Principal Components tool can reduce the dimensions (the number of numeric fields) in a
database. It does this by transforming the original set of fields into a smaller set that accounts for most
of the variance (i.e., information) in the data. The new fields are called factors, or principal
components.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
• The study and visualization of the correlations between variables to hopefully be able to limit the
number of variables to be measured afterwards;
• Obtaining non-correlated factors which are linear combinations of the initial variables so as to use
these factors in modeling methods such as linear regression, logistic regression or discriminant
analysis.
• Visualizing observations in a 2- or 3-dimensional space in order to identify uniform or atypical groups
of observations.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Options for Principal Component Analysis in Excel using the XLSTAT software
• Pearson or Covariance?
• Supplementary variables and observations
• Rotations: Varimax and others
XLSTAT offers several data treatments to be used on
the input data prior to Principal Component Analysis
computations:
•Correlation (Pearson), the classic PCA, that Result display
automatically standardizes the data prior to
computations to avoid inflating the impact of variables
with high variances on the result.
•Covariance, that works on unstandardized variances
and covariances (variables with high variances will play
stronger roles in the outputs). Covariance matrices
allocate more weight to variables with higher variances.
•Spearman. Spearman's correlations may be more
appropriate when running the PCA on variables with
different distributions.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
• PCA is used to calculate matrices to project the variables in a new space using a new matrix which shows the degree of similarity between the
variables. It is common to use the Pearson correlation coefficient or the covariance as the index of similarity, Pearson correlation and covariance
have the advantage of giving positive semi-defined matrices whose properties are used in PCA. However other indexes may be used. XLSTAT
makes it possible to also use the Spearman correlation coefficient because the corresponding matrices are also positive semidefinite. Making a
PCA on a Spearman correlation matrix is fully equivalent to a classic PCA (based on Pearson correlation) performed on the matrix of ranks. When
you run a PCA based on Spearman correlations, XLSTAT offers the option to display the matrix of the ranks in the report.
• Traditionally, a correlation coefficient rather than the covariance is used as using a correlation coefficient removes the effect of scale: thus a
variable which varies between 0 and 1 does not weigh more in the projection than a variable varying between 0 and 1000. However in certain
areas, when the variables are supposed to be on an identical scale or we want the variance of the variables to influence factor building,
covariance is used.
• Where only a similarity matrix is available rather than a table of observations/variables, or where you want to use another similarity index, you can
carry out a PCA starting from the similarity matrix. The results obtained will only concern the variables as no information on the observations was
available.
• Note: where PCA is carried out on a correlation matrix, it is called normalized PCA
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Options for Principal Component Analysis in Excel using the XLSTAT software
• Pearson or Covariance?
• Rotations: Varimax and others
• Supplementary variables and observations
Options for Principal Component Analysis in Excel using the XLSTAT software
• Pearson or Covariance?
• Rotations: Varimax and others
• Supplementary variables and observations
Supplementary variables and observations
XLSTAT lets you add variables (qualitative or quantitative) or observations to
the PCA after it has been computed. Those variables or observations are
called supplementary. This can be used in several contexts.
Here are two examples:
•If the user wants to investigate roughly how a set of dependent variables
relates to the others. The set of dependent variables should be used here as
a set of supplementary variables and the others (i.e. independent
variables) should be used to build the PCA.
•If the user simply wants to see how different categories of observations
behave in the PCA space (Males vs Females for example). In this case, a
qualitative supplementary variable (sex) may be used to color observations
according to the sex they belong to. It is also possible to display the
category centroids as well as confidence ellipses around categories.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
The first result to look at is the correlation matrix. We can see right away that the rates of people below and above 65 are negatively correlated
(r = -1). Either of the two variables could have been removed without effect on the quality of the results. We can also see that the Net Domestic
Migration has low correlation with the other variables, including the Net International migration. This means that U.S. nationals and non-nationals
may be moving to a state for different sets of reasons.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
• The next table and the corresponding chart are related to a mathematical object, the eigenvalues, which reflect
the quality of the projection from the N-dimensional initial table (N=7 in this example) to a lower number of
dimensions. In this example, we can see that the first eigenvalue equals 3.567 and represents 51% of the total
variability. This means that if we represent the data on only one axis, we will still be able to see % of the total
variability of the data.
• Each eigenvalue corresponds to a factor, and each factor to a one dimension. A factor is a linear combination of
the initial variables, and all the factors are un-correlated (r=0). The eigenvalues and the corresponding factors are
sorted by descending order of how much of the initial variability they represent (converted to %).
Ideally, the first two or three eigenvalues will correspond to a high % of the
variance, ensuring us that the maps based on the first two or three factors
are a good quality projection of the initial multi-dimensional table. In this
example, the first two factors allow us to represent 67.72% of the initial
variability of the data. This is a good result, but we'll have to be careful
when we interpret the maps as some information might be hidden in the
next factors. We can see here that although we initially had 7 variables, the
number of factors is 6. This is due to the two age variables, which are
negatively correlated (-1). The number of "useful" dimensions has been
automatically detected.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
linked with age and population renewal, and the vertical axis with domestic migration. These trends will be
helpful in interpreting the next map. To confirm that a variable is well linked with an axis, take a look at the
squared cosines table: the greater the squared cosine, the greater the link with the corresponding axis. The
closer the squared cosine of a given variable is to zero, the more careful you have to be when interpreting
the results in terms of trends on the corresponding axis. Looking at this table we can see that the trends for
Principal component analysis is often performed before a regression, to avoid using correlated variables,
or before clustering the data, to have a better overview of the variables. The number of clusters might
sometimes be a simple guess based on the maps. The above demographic data have also been used
in the tutorial on hierarchical clustering. The ">65 pop" variable has been removed as its inclusion would
double the weight of the age variables in the analysis.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(vi) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT
Going further
1. Adding supplementary variables to the PCA
It is possible to add supplementary variables to the PCA after it has been computed. This may help
increasing interpretation quality. In XLSTAT, those variables can be selected under the Suppl. Data tab
of the PCA dialog box. Supplementary variables can be divided into two types:
•Qualitative supplementary variables: they allow to color observations on the map according to
the category they belong to. In this tutorial's example, we could have added a column defining if
a state is mostly republican or mostly democrat.
•Quantitative supplementary variables: these variables can be added to see how they correlate
with the group of variables that have been used to build the PCA. In the case where PCA is
performed before a regression, the explanatory variables can be used to construct the PCA while
the dependent variable can be added as a supplementary variable. This may help to roughly
detect which explanatory variables could have the strongest effects on the dependent variable.
Lesson 3: Quantitative Diagnostics Analytics
2. Big guns of BI
(v) Multidimensional Analytics - Principal Component Analysis (PCA) - XLSTAT