Guidebook To Data Analyst

DISCLAIMER
1. This guidebook is designed as a summary of some Business Analytics

aspects, which may be relevant to the competition’s examination.
Participants are also recommended to use other materials to learn about
Business Analytics to facilitate their work during the competition.
2. All knowledge in this guidebook are from different sources, such as

published e-books, news and journal articles, etc., whose origins are cited
in the Reference part. RBAC’s organizers do not own any of the
frameworks, models or suggestions mentioned in this guidebook.
3. This guidebook is designed for internal use only, not for publishing or
any commercial benefits.
0
TABLE OF CONTENT
A. GENERAL KNOWLEDGE IN BUSINESS ANALYTICS. 1
I. Descriptive Analytics ........................................................................................... 2
II. Diagnostic Analytics .......................................................................................... 3
III. Predictive Analytics ......................................................................................... 4
IV. Prescriptive Analytics....................................................................................... 5
B. DATA ANALYSIS PROCESS ................................................... 6

I. Step 1 – Define Problems ................................................................................... 7
II. Step 2 – Data Collection ................................................................................. 13
III. Step 3 – Data Cleaning ................................................................................. 14
IV. Step 4 – Data Analysis ................................................................................... 18
V. Step 5 - Data Visualization ............................................................................ 32
C. HOW TO MAKE RECOMMENDATION .............................. 37

I. Common Traps During Recommendation Process .............................. 38
II. Disney Creative Strategy: From Dreamer, Realist To Critic .......... 38
D. SLIDES PRESENTATION....................................................... 40
I. The Pyramid Principle ....................................................................................... 41
II. Important Notes When Applying The Pyramid Principle ................ 43
1
PART A: GENERAL PART B: DATA PART C: HOW TO MAKE PART D: SLIDE
KNOWLEDGE IN BA ANALYSIS PROCESS RECOMMENDATION PRESENTATION
GENERAL KNOWLEDGE
IN BUSINESS
ANALYTICS
2
Business analytics contains solutions used to build models and simulations to create
scenarios, understand realities and predict future states. Business analytics is a multi-
stage process. Each step involves data analysis to reach a particular type of
conclusion to build the best possible strategy for optimized organizational action.
Gartner Analytic Ascendancy Model includes four stages of Data Analytics Maturity,
which assesses an organization's ability to effectively practice data exploration and
decision-making.
I. DESCRIPTIVE ANALYTICS
Descriptive Analytics analyses and finds an answer to ‘What happened in the past?’.
It is all about using a range of historic data to draw comparisons, summarize
individual variables, statistics to discover trends, spot anomalies, test hypotheses, and
check assumptions.
1. Best Practices
● Rule-based and directly
● Pattern tracking
● Clustering
● Summary statistics
● Data aggregation
● Data mining
3
2. Examples
● Identify which customer segments generated the highest dollar amount in sales
last year.
● Uncover which social media platforms delivered the best return on advertising
investment last quarter.
● Track MoM and YoY revenue growth or decline.
● Track demand for SKUs across geographic locations last year.
II. DIAGNOSTIC ANALYTICS

Diagnostic analytics deals with ‘Why did it happen?’. The goal is to identify root
causes for business results and isolate confounding information. Diagnostic analytics
seeks to tell stories with data. Understanding why a trend is developing or why a
problem occurred will make your business intelligence actionable.
1. Best practices
● Root cause analysis
● Data discovery
● Drill-down
● Data mining
● Data correlations
2. Examples
● Identify shared characteristics and behaviors of profitable customer segments
that may explain why they are spending more.
● Investigate the differences in marketing campaign performance by comparing
high-performing and poor-performing social media ads.
● Determine correlation by comparing the timing of key initiatives to MoM and
YoY revenue growth or decline.
● Look at any related factors to see if they contribute to the demand for particular
SKUs across geographic locations.
4
III. PREDICTIVE ANALYTICS
Predictive analytics is associated with ‘What will happen?’ and can also be called
predictive modeling. It forecasts possible future outcomes and identifies the likelihood
of those events happening. One of the most valuable forms of predictive analytics is
what-if analysis, which involves changing various values to see how those changes
will affect the outcome.
1. Best practices
● Ensure that all relevant variables impacting the outcome are considered during
prediction.
● Test model building with multiple algorithms and see which one fits best based
on accuracy and response times.
● Decide how accurate your predictive analytics need to be.
● Plan for disruption, continue to refine your predictive analytics models.
● Test, retest, and periodically test model accuracy with new sets of data.
2. Examples
● Forecast the revenue of a particular customer segment.
● Predict the schemes that will attract more customers in the upcoming
campaigns.
● Create more accurate projections regarding financial attributes and metrics for
the next fiscal year.
● Predict demand for various products in different regions at specific periods next
year.
5
IV. PRESCRIPTIVE ANALYTICS
Prescriptive analytics builds on predictive analytics to answer ‘How can we make it
happen?’ by recommending (prescribing) actions based on desired potential
outcomes. In the wider view of applying business analytics for organizational success,
prescriptive analytics delivers business value through recommendations built on the
data of results.
1. Best practices
● Learn from previous transactions.
● Utilize business rules, algorithms, machine learning, and computational
modeling procedures.
● Support what-if scenarios.
● Apply Simulation, Decision Analysis, and Optimization-based decision-making.
2. Examples
● How to penetrate new market segments.
● Generate the plan to promote new products next quarter.
● Ways to optimize cost and manage risk.
● Determine how to optimize warehousing & supply chain.
6
DATA ANALYSIS
PROCESS
Step 1: Define Problems
Step 2: Data Collection
Step 3: Data Cleaning
Step 4: Data Analysis
Step 5: Data Visualization
7
STEP 1 STEP 2 STEP 3 STEP 4 STEP 5
I. STEP 1 - DEFINE PROBLEMS

1. MECE framework
1.1 What is MECE
When processing large quantities of data, it is hard to work with information if you
don’t know where to start. A solution is using the MECE framework (Mutually
Exclusive, Collectively Exhaustive). MECE is a problem-structuring principle that
organizes information/ data efficiently and comprehensively.
● ME: no overlaps, each item is independent.
● CE: thorough, no missed item, all aspects of the problem must be included.
1.2 Why use MECE

Generally, when facing practical problems, it is necessary to define the issues, put
forward hypotheses or assumptions, and find solutions on the premise of fully
mastering the facts.
MECE is a very useful tool to analyze the problem step by step in a thorough way to
find a solution. MECE analysis has two steps:
● Confirm what the problems are.
● Find the entry point of MECE.
8
1.3. How can MECE be applied
1.3.1. Issue tree
The issue tree is one method of dividing and arranging all information, breaking down
the big problem into smaller, solvable ones, allowing users to have a clear and
systematic view of the problem. Some basic principles for creating an issue tree are:
● Breaking the big problem into separate branches.
● Apply the MECE principle, so that the tree can cover the whole problem.
● Focus on broad categories instead of small details while making branches.
● Focus on the most impactful parts of the problem (based on data).
1.3.2 Hypothesis tree

A hypothesis tree is the mapping of all MECE hypotheses that address the problem.
It is fairly similar to an issue tree, however, a hypothesis tree organizes a problem
around hypotheses, and often offers a more direct solution than an issue tree. While
making a hypothesis tree, it is important to:
● Understand the problem thoroughly.
● Write down the problem statement. Ensure clarity in the statement so that there
is no ambiguity.
9
● List the options to solve the problem, using a MECE tree. See that the options
do not overlap and that no option has been left out.
● Consider each option individually. Consider the pros and cons. Leave out those
that are illogical and include any new insight as an option as you understand
the problem better.
● Select the best option.
1.3.3 Decision tree

A decision tree is the mapping of decisions and their potential outcomes. It helps users
weigh the benefits and drawbacks of each solution. By using the decision tree, the
decision-maker can have a clear view of all possible outcomes and evaluate them to
choose the best option.
Tips: use any shapes you want when mapping. However, the meaning of each shape
should be clarified at the beginning to ensure the integrity of the map.
10
Some recommended steps in drawing a decision tree:
Step 1: Start with the main decision: Draw a small box to present the start point, then
draw lines from the box to present each possible action or solution. Label them.
11
Step 2: Add chances and decision nodes to expand the tree.
● Draw a box if another decision is needed
● Draw a circle if the results are uncertain
● Leave it blank if the problem is solved (for now)
● From each decision node, draw possible solutions. From each chance node,
draw lines representing possible outcomes
● (Optional) Include the probability of each outcome
Step 3: Continue to expand until every line reaches an endpoint (where there are
no more choices to be made or results to consider). Then, assign value for each
outcome for better comparison.
You can add triangles to signify the endpoints.
12
2. S.M.A.R.T criteria
While looking into a business problem, setting goals is essential to achieve the best
possible outcome. With that being said, there are various frameworks to set an
effective goal, with each giving a different perspective. In this guidebook, we will focus
on the popular S.M.A.R.T criteria.
The acronym stands for:

S — Specific — should target a specific (clearly stated) area for improvement.
M — Measurable — should have numbers or indicators to measure progress.
A — Achievable — should stretch you a little but within your skill/knowledge range.
R — Relevant — should be in line with the bigger picture and vision.
T — Timely — should specify when deadlines come to show a sense of urgency.
Besides the benefits of the original criteria, there are also extensions (Evaluate and
Re-adjust) that help the users monitor their progress effectively and efficiently. This
provides more flexibility and motivation, making the goals easier to achieve.
13
Examples:
SHOULD NOT SHOULD
Specific I want to make more money I want to increase the number of

from the restaurant. daily customers of the restaurant.
Measurable I will increase the number of I will increase the number of daily
daily customers. customers from 20/day to 50/day.
Achievable I will increase the number of I will increase the number of daily
daily customers from 20/day customers from 20/day to 50/day,
to 5000/day in a short time. and I can achieve this goal by
gradually improving the quality to
attract more customers.
Relevant I want to improve the total I want to improve the total daily
daily customers, so I will invest customers, so I will try to improve the
in real estate. ingredients’ quality by looking for a
reputable supplier.
Time-bound I will work hard to raise the I will work hard to raise the number of
number of daily customers by daily customers by 50% within 2
50% eventually. weeks.
II. STEP 2 - DATA COLLECTION

This guidebook does not cover this step, as the competition already provide datasets
to participants.
14
III. STEP 3 - DATA CLEANING

When combining multiple data sources, there are many opportunities for data to be
duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable.
That’s the reason why Data cleaning (or data cleansing/data scrubbing) is one of the
most important steps for every BA to generate accurate, credible insights from data,
which supports an effective decision-making process.
1. Definition
Data cleaning is the process of fixing or removing incorrect, corrupted, wrongly
formatted, duplicate, or incomplete data within a dataset.
2. Different types of data issues

● Invalid values: Some datasets have well-known values, e.g. gender must only
have “F” (Female) and “M” (Male). In this case, it’s easy to detect wrong values.
● Formats: The most common issue. It is possible to get values in different

formats like a name written as “Name, Surname” or “Surname, Name”.
● Attribute dependencies: When the value of a feature depends on the value of

another feature. For example, if we have some school data, the “number of
students” is related to whether the person “is a teacher?”. If someone is not a
teacher, he/she cannot have any students.
● Uniqueness: It is possible to find repeated data in features with unique values.
For example, we cannot have two products with the same ID.
● Missing values: Some features in the dataset may have blank or null values.
● Misspellings: Incorrectly written values.
● Misfielded values: When a feature contains the values of another.
15
Example:
3. Seven characteristics of quality data
● Validity: The degree to which your data conforms to defined business rules or
constraints. some common constraints include:
○ Mandatory constraints: Certain columns cannot be empty.
○ Data-type constraints: Values in a column must share data types.
○ Range constraints: Minimum and maximum values for numbers or dates.
○ Foreign-key constraints: A set of values in a column are defined in the
column of another table containing unique values.
○ Unique constraints: A field or fields must be unique in a dataset.
16
○ Regular expression patterns: Text fields must be validated this way.
○ Cross-field validation: Certain conditions that use multiple fields must hold.
○ Set-membership constraint: This one is the subcategory of foreign-key
constraints. Values for a column come from a set of discrete values or codes.
● Accuracy: Ensure your data is close to the true values.

● Completeness: The degree to which all required data is known.
● Consistency: Ensure your data is consistent within the same dataset and/or
across multiple data sets.
● Uniformity: The degree to which the data is specified using the same unit of
measure.
● Traceability: Being able to find (and access) the source of the data.
● Timeliness: How quickly and recently the data has been updated.
4. Process
Step 1: Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including duplicate observations
or irrelevant observations.
● Duplicate observations: repeated data entries.
● Irrelevant observations: observations that do not fit into the specific problem
you are trying to analyze.
Step 2: Fix structural errors

Some types of structural errors could include strange naming conventions, typos, or
incorrect capitalization. These inconsistencies can cause mislabeled categories or
classes. For example, you may find “N/A” and “Not Applicable” both appear, but they
should be analyzed as the same category.
Step 3: Filter unwanted outliers

Outliers are values that stand out and are significantly different from the others. In
this step, it is necessary to determine the validity of outliers. Only consider removing
an outlier when it proves to be irrelevant for analysis or is a mistake.
17
Step 4: Handle missing data
You can’t ignore missing data because many algorithms will not accept missing
values. There are a couple of ways to deal with missing data. Neither is optimal, but
both can be considered.
● Drop: When the missing values in a column are few and far between, the easiest
way to handle them is to drop the missing data rows. However doing this will
lose information, so be mindful of this before you remove it.
● Impute: This method involves calculating the missing values based on other
observations. Statistical techniques like median, mean, or linear regression are
helpful if there aren’t many missing values. Besides, you can also replace
missing data with entries from another “similar” database. However, there is an
opportunity to lose the integrity of the data because you may be operating from
assumptions and not actual observations.
● Flag: Missing data can be informative, especially if there is a pattern in play.
For example, you conduct a survey, and most women refuse to answer a
particular question. That’s why sometimes just flagging the data can help you
with those subtle insights:
○ For numeric data – just put in 0.
○ For categorical data – introduce the ‘missing’ category.
Step 5: Validate and Quality Assurance

At the end of the data cleaning process, you should be able to answer these questions
as a part of basic validation:
● Does the data make sense?
● Does the data follow the appropriate rules for its field?
● Does it prove or disprove your working theory, or bring any insight to light?
● Can you find trends in the data to help you form your next theory?
● If not, is that because of a data quality issue?
18
IV. STEP 4 – DATA ANALYSIS

1. Techniques to analyze data
1.1 Regression analysis
A type of statistical analysis method that determines the relationships between
independent and dependent variables. It can also help to predict the values given the
independent variables.
Examples:
● Relationship between price (independent variable X) and quantity of a product
(dependent variable Y) in a store.
● Relationship between obesity rate (independent variable X) and life expectancy
(dependent variable Y) of Vietnamese people.
Three kinds of regression analysis:

● Linear Regression
● Multiple Linear Regression
● Non-linear Regression
Regression analysis can be done on Excel with Data Analysis in Analysis Toolpak -
Excel Adds-in in Options.
Steps to activate Analysis Toolpak in Excel:
1. Click the File tab, click Options, and then click the Add-Ins category.
2. In the Manage box, select Excel Add-ins and then click Go.
If you're using Excel for Mac, in the file menu go to Tools > Excel Add-ins.
3. In the Add-Ins box, check the Analysis ToolPak check box, and then click OK.
○ If Analysis ToolPak is not listed in the Add-Ins available box, click Browse
to locate it.
○ If you are prompted that the Analysis ToolPak is not currently installed
on your computer, click Yes to install it.
19
1.1.1 Linear Regression
● Application:
Examining the relationship between 1 dependent (Y) and 1 independent variable
(Xi) to estimate Y from X.
● Linear regression model:
Y = β0 + β1 x + u
○ β0: Y-intercept
○ β1: population slope parameter
○ u: random error
● Linear regression equation:
Coefficients of the linear regression model are estimated using sample data:
̂i = β̂0 + β̂1 x
Y
○ β1: population slope parameters
Example: The relationship (regression equation) of advertising cost (X) and sales (Y)
(in millions USD) (instructions can be found via Detailed Instructions for Regression
Analysis).
Sales = 162.207 + 23.259 * (Advertising)
If advertising costs increase by 1 million USD, sales will probably increase by 23.259
million USD.
20
1.1.2 Multiple Linear Regression
● Application
Examining the relationship between 1 dependent (Y) and 2 or more independent
variables (Xi) to estimate Y value from the values of X.
● Multiple regression model:
Y = β0 + β1 x1 + β2 x2 + . . . +βk xk + u
○ β1,β2 ,...,βk : population slope parameters, holding all else constant
○ u: random error
● Multiple regression equation
Coefficients of the multiple regression model are estimated using sample data:
̂i = β̂0 + β̂1 x1i + β̂2 x2i +. . . +β̂k xki
Y
○ β1,β2 ,...,βk : population slope parameters, holding all else constant
Example: The relationship (regression equation) between Education Spending (X1),

Unemployment rate (X2), Employee Compensation (X3), and GDP (Y) in Euro area
countries (instructions can be found via Detailed Instructions for Regression Analysis).
GDP = 24,497.596 + 3.984 * (Education Spending) - 2,188.899 * (Unemployment

Rate) + 1.434 * (Employee Compensation) (after first multiple regression)
GDP = -11,139.247 + 2.040 * (Employee Compensation) (after backward

regression)
1.2 Time series analysis

Time series analysis is a statistical technique used to identify trends and cycles over
time. It is a sequence of data points that measure the same variable at different points
in time (for example, weekly sales figures). By looking at time-related trends, analysts
are able to forecast how the variable of interest may fluctuate in the future.
21
When conducting time series analysis, the main patterns you’ll be looking out for in
your data are:
● Trends: Stable, linear increases or decreases over an extended time period.
● Seasonality: Predictable fluctuations in the data due to seasonal factors over
a short period of time. For example, you might see a peak in swimwear sales in
summer around the same time every year.
● Cyclic patterns: Unpredictable cycles where the data fluctuates. Cyclical trends
are not due to seasonality, but rather, may occur as a result of economic or
industry-related conditions.
2. Tools for Business Analytics

2.1 Statistical calculations
● Mean (X): the average of a data set.
Formula:
o For sample:
o For population:
22
● Median is the middle value of an ordered sample of a data set
To calculate the median:
o Arrange all the recorded values in order of size
o Find the middle value
Position of median i = (n+1)/2
● Mode: The mode is the value that occurs most often in the group. This can be a
group of either numbers or categories.
If there are two modes in the group of numbers, the group is described as
bimodal.
● Standard Deviation
Measures the average distance between a single observation and the mean.
The larger the standard deviation, the more the values differ from the mean,
and therefore the more widely spread out.
Formula:
● For sample:
● For population:
● Standard Error
The estimated standard deviation of the population mean. It shows how precise
the model’s predictions are by demonstrating how far the data points are from
the regression curve on average. The lower the better, because it signifies that
23
the distances between the data points and the expected values are smaller.
S
Formula:
√n
● Range
The difference between the largest and smallest data values.
Formula: Range = Maximum - Minimum
● Interquartile Range (IQR)
The difference between the 75th and 25th percentiles (the 3rd and 1st quartiles).
This represents the range of the middle 50% of the data and serves as a robust
measure of the variation in the data.
Formula: IQR = Z75−Z25
2.2 Excel
2.2.1 Data Preparation
After importing and Before visualizing data, we need to initially clean and prepare
them. Using the following functions and shortcuts in Excel would make the job easier
and more efficient.
a. Text functions
Function Definition Syntax
Converts all characters in text

LOWER Lower (string)
string to lowercase

PROPER Proper (string)
string to proper case

UPPER Upper (string)
string to uppercase
Concatenate (string1,
CONCATENATE Joins multiple text strings into one
string2, string3...)
24
Removes duplicate spaces, and
TRIM spaces at the start and end of a Trim (string)
text string
VALUE Converts a text value into number Value (text)
b. Mathematical functions
Adds all the numbers given as an Sum (numerical

SUM
argument arguments)
Multiplies all the numbers given as Product (numerical

PRODUCT
an argument arguments)
SQRT Returns square root of a number SQRT (number)
Rounds a number up to the Ceiling (number,

CEILING
nearest multiple of significance significance)
FLOOR Rounds a number down to the Floor (number,

nearest multiple of significance significance)
ROUND Rounds a number to a specified Round (number,

number of digits num_digits)
SUMIF Adds the value of given cells Sumif (criteria range,

based on a given single condition criteria, sum range)
SUMIFS Adds the value of given cells Sumifs (sum range,

based on given multiple criteria range1, criteria1,
conditions criteria range2,
criteria2...)
25
c. Statistical functions
Returns average of numbers Average (numerical arguments)

AVERAGE
given as an argument
Returns maximum number of Max (numerical arguments)

MAX
given set of values
Returns minimum number of Min (numerical arguments)

MIN
given set of values
Returns median of numbers Median (numerical arguments)

MEDIAN
given as an argument
Counts the number of cells Count (range of cells)

COUNT
contains numerical values
Counts the number of non- CountA (range of cells)

COUNTA
empty cells
Counts number of cells that meet Countif (criteria range, criteria)

COUNTIF
given single condition
Counts number of cells that meet Countifs (criteria range1,

COUNTIFS given multiple conditions criteria1, criteria range2,
criteria2...)
Returns the average of cells Averageif (criteria range,

AVERAGEIF
based on given single condition criteria, average range)
Returns the average of cells Averageifs (average range,

AVERAGEIFS based on given multiple criteria range1, criteria1, criteria
conditions range2, criteria2,...)
Returns the quartile of dataset Quartile (range, quartile)

QUARTILE
(1, 2 or 3)
Returns standard deviation Stdev (dataset)

STDEV
based on the dataset
26
d. Date time functions
Returns numerical identification of Month (date argument)

MONTH month (between 1 to 12) of any
date argument
Returns numerical identification of Day (date argument)

DAY day (between 1 to 31) of any date
argument
YEAR Returns year of any argument Year (date argument)
Returns the year, month, and day Datedif (old date, recent date,
difference between two dates <Y/M/D>), Y for Year, M for
DATEDIF
Month and D for Days
difference)
Returns the day of week e.g. 1 for Weekday(date, 1)

WEEKDAY
Sunday, 2 for Monday and so on
Returns the week number of a date Weeknum(date)

WEEKNUM
in a year
Returns the last day of the month Eomonth (date, specified

EOMONTH before and after a specified Number of months)
number of months
Change the date into the day Text (date, “dddd")

TEXT name (Monday, Tuesday) to
analyze day trends
e. Logical functions
Returns TRUE, if all conditions are TRUE And (Condition1,

AND
else FALSE Condition2,...)
Returns TRUE, if any of condition are Or (Condition1,

OR
TRUE Condition2,...)
27
It is a conditional function, execute If (Condition, TRUE
IF TRUE statement if the condition is TRUE statement, FALSE
else execute FALSE statement statement)
f. Lookup functions
Returns a relative position of a value in MATCH (lookup value,

an array that matches with given lookup array, match type
MATCH criteria (0/1/-1))
Match type: -1 (greater than), 0 (exact
match), 1 (less than)
Looks for a value in the leftmost VLOOKUP (lookup value,

column of the table, and then returns a table array,
VLOOKUP
value in the same row from a column column_index_number,
you specify TRUE/FALSE)
Looks for a value in the top row of the HLOOKUP (lookup value,
table, and then returns a value in the table array,
HLOOKUP
same column from a row you specify row_index_number, TRUE/
FALSE)
2.2.2 Pivot table

A pivot table is a data summarization tool that is used in the context of data
processing. Pivot tables are used to summarize, sort, reorganize, group, count, total
or average data stored in a database. Below are some of the tools you may need
when using pivot tables:
● Insert/Remove subtotals and grand totals: subtotal is the sum of a set of
numbers, which is then added to another set(s) of numbers to make the grand
total. The Subtotal feature is not limited to only totaling subsets of values within
a data set. It allows you to group and summarize your data using SUM, COUNT,
AVERAGE, MIN, MAX, and other functions.
28
● Summarize values by/Show values at:
● Grouping: You can group numbers in Pivot Table to create frequency

distribution tables. This helps in analyzing numerical values by grouping them
into ranges.
29
● Sorting: To help you locate data that you want to analyze in a PivotTable more
easily, you can sort text entries (from A to Z or Z to A), numbers (from smallest
to largest or largest to smallest), and dates and times (from oldest to newest or
newest to oldest).
● Filtering: There are several methods for filtering data: insert one or more slicers,
apply filters to any field in the PivotTable's Row field with AutoFilter, add filters
to the PivotTable's Filter field.
30
● Pivot charts: A pivot chart is the visual representation of a pivot table in Excel.
Pivot charts and pivot tables are connected with each other.
31
2.2.3 Other tools (Click on the tool for reference)
● Power BI
● Tableau
3. Tips with complex data (Nowogrodzki 2020)

● Keep the raw data raw, don't manipulate without a copy.
● Visualize the information to check for outliers (for example histograms, box-
and-whisker plot or calculate the descriptive statistics and variation measures).
● Show your workflow - your entire data workflow — which version of the data
you used, the clean-up and quality-checking steps, and any processing you ran.
● Start early - will give you more time to analyze the data and provide insights.
32
V. STEP 5 - DATA VISUALIZATION

After preparing the datasets, we would visualize them for better analysis, using
different types of charts and following the rules for better readers’ experience.
1. Types
For more details about the types of data, please access this Step 5-Section 1.
Suitable types of chart for different relationships:
Maximum #
Relationship Most likely graph(s) Keywords
of dimensions
Change
Trend line
Rise/increase Fluctuation
Column chart
Time series Growth 4
Heat map
Decline/decrease
Sparklines
Trend
33
Stacked column
chart Rate or rate of total
Stacked area chart Percent or percentage of
Part to whole 4
Pareto chart (for two total share
simultaneous parts “Accounts for X percent”
to whole)
Larger than smaller than

Sorted bar/ column
Ranking Equal to greater than Less 4
chart
than
Plus or minus
Line chart
Variance
Deviation Column/ bar chart 4
Difference
Bullet graph
Relative to
Frequency
Distribution
Box/ whisker plot
Distribution Range 4
Histogram
Concentration
Normal curve, bell curve
Increases with
Scatterplot Decreases with
Correlation 6
Table pane Changes with
Varies with
Choropleth (filled
Geospatial N/A 2
gradient) map
*Six data dimensions: X-axis placement, Y-axis placement, Size, Shape, Colour,
Animation (interactive only, often used to display time).
Detailed instruction can be found at E-Book: Essentials of Business Analytics, p.119-
121.
34
2. Six Meta-rules for Data Visualization
● Simplicity over Complexity
The simplest chart is usually the one that communicates most clearly. Use the “not
wrong” chart – not the “cool” chart.
● Direct representation
Always directly represent the relationship you are trying to communicate. Don’t leave
it to the viewer to derive the relationship from other information.
● Single dimensionality
Generally, do not ask viewers to compare in two dimensions. Comparing differences
in length is easier than comparing differences in area.
● Use color properly
Never use color on top of color – color is not absolute.
As with shape, color should be used to provide the viewer meaning. Hence, it should
be used consistently, and within several rules for human perception.
● Use viewers’ experience to your advantage
Do not violate the primal perceptions of your viewers. Remember, up means more and
the maximum percent level is 100%, etc.
When a visualizer violates these perceptions, it can cause cognitive confusion in the
viewer and create a negative emotional reaction in the viewer.
● Represent the data story with integrity
Chart with graphical and ethical integrity. Do not lie, either by mistake or intentionally.
It is important that, at all times, the effect in the data must be accurately reflected in
the visualization.
3. Create effective graphs with a clear visual of information

3.1 Avoid clutter
Remove as much as you possibly can while ensuring the end-user gets the right insight
from your dashboard such as
- Remove chart border
- Remove gridlines
- Remove data markers
- Clean up axis label that carries no informative value
35
- Label data directly
- Leverage consistent color
BEFORE
AFTER
3.2 Highlight important elements

● Bold, italics, and underlining: Use titles, labels, captions, and short word
sequences to differentiate elements. Bolding is generally preferred over italics
and underlining because it adds minimal noise to the design while clearly
highlighting chosen elements.
● Case and typeface: Uppercase text in short word sequences is easily scanned,
which can work well when applied to titles, labels, and keywords. Avoid using
different fonts as a highlighting technique
36
● Color is an effective highlighting technique when used sparingly and generally
in concert with other highlighting techniques (for example, bold).
● Size is another way to attract attention and signal importance.
3.3 Eliminate distractions

Here are some specific considerations to help you identify potential distractions:
● Not all data are equally important.
● When detail isn’t needed, summarize.
● Ask yourself: would eliminating this change anything?
● Push necessary, but non‐message‐impacting items to the background.
BEFORE
AFTER
37
HOW TO MAKE
RECOMMENDATION
38
I. COMMON TRAPS DURING RECOMMENDATION PROCESS
After gaining insights from the analytic process, it is time to gather all useful
information and decide how to solve the business question. However, some common
challenges may be encountered during the recommendation developing process,
including:
● Having too many insights which can be contrary to each other, therefore cannot
make the final decision
● Too focused and biased with ONE specific idea, leading to other factors being
ignored
● Lack critical thinking, only focus on the pros but not the cons, or only focus on
short-term but not long-term solutions
● Lack of balance between practicality and creativity
II. DISNEY CREATIVE STRATEGY: FROM DREAMER, REALIST TO

CRITIC
Disney Creative Strategy is a framework developed by Walt Disney with the aim to
create solutions for a problem based on the balance between reality and creativity.
This 3-stage framework has been proved to minimize the difficulties of being
overwhelmed with too many ideas and being too practical or too dreamy.
39
1. First stage: The Dreamer - What?
This phase allows all team members to list out all creative ideas that they think of with
no restrictions, building a pool of innovative ideas.
● List out ALL insights and problems we identified from the analytic process
● With each issue, what is our end goal/success?
● What is the solution for each problem? List out ALL of the solutions, even the
craziest ones! Do not just stick to ONE specific way of thinking.
2. Second stage: The Realist - How?

In the next phase, the team has to place practicality and logic at first. The goal is to
turn imaginative ideas into a SMART (specific, measurable, actionable, relevant,
timed) plan. With the large pool of creative thoughts from The Dreamer stage, the
team has to investigate each of them in terms of feasibility:
● How can this idea be applied in reality?
● How can this idea be assessed? What are the metrics?
● What is the specific action plan for this idea?
● When and where can we apply this idea?
Then, teams can eliminate impractical ideas.
3. Third stage: The Critic - Why?

The last phase is time to apply critical thinking to evaluate your ideas, identifying all
drawbacks and obstacles the idea may encounter when being put into practice.
Questions the team can ask include:
● Why can't we execute the idea?
● What are the weaknesses of this idea?
● What is missing or redundant?
● Do the cons overweigh the pros?
● What can be added to minimize or get rid of the drawbacks?
40
SLIDES
PRESENTATION
41
In this information overload era, it is important for presenters to communicate their
ideas in a succinct, comprehensive, and structured way, with backup arguments and
data.
I. THE PYRAMID PRINCIPLE

The Pyramid Principle can be applied in slide presentation structure, in which the core
statement - the storyline is put directly at the top of the pyramid, in other words, the
beginning of a slide. This is followed by the arguments and the supporting data to
prove the arguments, at the bottom of the pyramid.
1. The Storyline
The storyline acts as a complete and strategic conclusion for each slide,
summarizing all key messages that the presenter wants to deliver. The ultimate goal
to create a set of storylines is to create a coherent story for your presentation and
enable the presenter to “elevator pitch”. Specifically, readers can still take a full grasp
of your presentation only by looking at the storyline of each slide.
42
Example:
2. The Argument
After presenting your conclusion, it is necessary to provide critical arguments to
support your recommendation. In this part, you should include the findings based on
the break-down arguments you have put on your issue tree.
Examples:
3. The Support
Lastly, supporting evidence should be included to prove your arguments and
conclusion below convincingly. Data visualization, graphs, or charts expressing your
identified insights or further explanation can be presented here.
To properly present a chart, the title with the conclusion for the chart should be placed
at the top. Data or chart explanations can be included at the bottom of the chart. You
can highlight significant numbers on the chart with different shapes or colors.
43
Example of support:
II. IMPORTANT NOTES WHEN APPLYING THE PYRAMID

PRINCIPLE
1. Avoid using full sentences
Try to make the sentences short and succinct instead of using full sentences. One
exception is the headline, as it has to deliver the entire message of the whole slide.
2. Explain one idea or one group of ideas on each slide

Make full use of the issue tree to break the business problem into different more minor
points and present only one point at a time. This will help readers stay focused on your
explanation and are not overwhelmed with many ideas at a time.
3. Stay concise, consistent, and precise

Use a consistent slide structure towards your proposal, with three parts: storyline,
arguments, and support. Try not to mix up the structure as it will make it difficult for
readers to follow the proposal.
Only include necessary arguments and supporting data/charts/graphs, which

significantly impact the conclusion (the storyline).
44
Example of a mixed structure:
Slide 1 - Storyline at the top, argument, support
Slide 2 - Half of the storyline, support, storyline at the bottom
45
REFERENCE
Alvarado, J n.d., ‘Descriptive, Predictive, Prescriptive, and Diagnostic Analytics: A
Quick Guide’, sigma, viewed 27 September 2021,
<https://www.sigmacomputing.com/blog/descriptive-predictive-prescriptive-and-
diagnostic-analytics-a-quick-guide/>.
Amran, A 2021, ‘Issue trees’, untools, blog, viewed 27 September 2021,

<https://untools.co/issue-
trees?fbclid=IwAR36AaqX7_oQAEA6svrpvfZMqpl1h37JQjB707RrkrXPpKyD-
BISLDcwXx4>.
Awware 2021, MECE framework: general principles and examples in marketing, blog,
viewed 18 September 2021, <https://awware.co/blog/mece-framework>.
Dilmegani, C 2021, ‘Data Cleaning in 2021: What it is, Steps to Clean Data & Tools’,
AIMuiltiple, viewed 24 September, <https://research.aimultiple.com/data-cleaning/>.
Elmansy, R 2016, ‘Disney’s Creative Strategy: The Dreamer, The Realist and The Critic’,
Designorate, viewed 27 September 2021, <https://www.designorate.com/disneys-
creative-strategy/>.
Gartner n.d, ‘Business Analytics’, Information Technology Glossary, viewed 27

September 2021, <https://www.gartner.com/en/information-
technology/glossary/business-analytics>.
Knaflic, CN, 2015, Storytelling with Data: A data visualization guide for business
professionals, John Wiley & Sons, Inc., Hoboken, New Jersey,
<https://books.google.com.vn/books?id=retRCgAAQBAJ&lpg=PR9&ots=Kp9GJnSu6
W&dq=%3Fstorytelling%20with%20data&lr&pg=PR9#v=onepage&q=?storytelling%2
0with%20data&f=false>.
46
Kowalewski, M 2020, ‘Data Cleaning In 5 Easy Steps + Examples’, Iterators, viewed 24
September, <https://www.iteratorshq.com/blog/data-cleaning-in-5-easy-steps/>.
LucidChart n.d., What is a Decision Tree Diagram , LucidChart, viewed 27 September

2021, <https://www.lucidchart.com/pages/decision-tree>.
MBA Crystal Ball n.d., MECE Framework McKinsey, blog, viewed 23 September 2021,
<https://www.mbacrystalball.com/blog/strategy/mece-framework/>.
Microsoft n.d., Get started with Power BI Desktop, Microsoft, viewed 27 September
2021, <https://docs.microsoft.com/en-us/power-bi/fundamentals/desktop-getting-
started>.
Microsoft n.d., What is Power BI?, Microsoft, viewed 27 September 2021,

<https://powerbi.microsoft.com/en-us/what-is-power-bi/>.
Nowogrodzki, A 2020, ‘Eleven tips for working with large data sets’, Nature, 13
January, viewed 27 September 2021, <https://www.nature.com/articles/d41586-020-
00062-z>.
Pochiraju, B & Seshadri, S 2019, Essentials of Business Analytics: An Introduction to

the Methodology and Its Applications, vol. 264, Springer,
<https://link.springer.com/content/pdf/10.1007/978-3-319-68837-4.pdf>.
Presentation Load 2020, ‘Present Better with the Pyramid Principle’, Presentation
Load, 7 July, viewed 27 September 2021,
<https://blog.presentationload.com/present-better-pyramid-principle/>.
Sánchez, P 2016, ‘Data cleansing & data transformation’, QuantDare, viewed 24

September, <https://quantdare.com/data-cleansing-and-transformation/>.
47
StudiousGuy n.d., MECE Framework, StudiousGuy, viewed 27 September 2021,
<https://studiousguy.com/mece-framework/>.
Tableau 2021, ‘Guide To Data Cleaning: Definition, Benefits, Components, And How To
Clean Your Data’, Tableau.com, viewed 24 September,
<https://www.tableau.com/learn/articles/what-is-data-cleaning>.
Tableau n.d., ‘Turn insights into action with the Tableau product suite’, Tableau, viewed
27 September 2021, <https://www.tableau.com/products>.
Tableau n.d., ‘Tutorial: Get Started with Tableau Desktop’, Tableau, viewed 27
September 2021, <https://help.tableau.com/current/guides/get-started-tutorial/en-
us/get-started-tutorial-home.htm>.
Wang, Z, Lu, F, Zhao, X & Zhang, Q 2020, MECE Guided Systematic Analysis and Fussy
Assessment of Major Sports Event Broadcasting Center Service Quality—Take the 7th
Military World Games as an Example, Springer, Cham, viewed 18 September 2021,
<https://link-springer-com.ezproxy.lib.rmit.edu.au/chapter/10.1007/978-3-030-
55506-1_20#citeas>.
48
RMITCHAMPI
BUSINESSON 2021
ANALYST
TRAN LIEN HUONG TRAN THUY MINH NHIEN

Email: s3866467@rmit.edu.vn Email: s3819824@rmit.edu.vn
Phone number: 0333 112 680 Phone number: 0906 674 553
Fanpage: https://www.facebook.com/RBAChampion
LinkedIn: https://www.linkedin.com/company/rba-champion
Email: rba.champion@gmail.com

Guidebook To Data Analyst

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Guidebook To Data Analyst

Uploaded by

Copyright:

Available Formats

DISCLAIMER

1. This guidebook is designed as a summary of some Business Analytics

2. All knowledge in this guidebook are from different sources, such as

B. DATA ANALYSIS PROCESS ................................................... 6

C. HOW TO MAKE RECOMMENDATION .............................. 37

II. DIAGNOSTIC ANALYTICS

STEP 1 STEP 2 STEP 3 STEP 4 STEP 5

I. STEP 1 - DEFINE PROBLEMS

1.2 Why use MECE

1.3.2 Hypothesis tree

1.3.3 Decision tree

The acronym stands for:

SHOULD NOT SHOULD

Specific I want to make more money I want to increase the number of

II. STEP 2 - DATA COLLECTION

STEP 1 STEP 2 STEP 3 STEP 4 STEP 5

III. STEP 3 - DATA CLEANING

2. Different types of data issues

● Formats: The most common issue. It is possible to get values in different

● Attribute dependencies: When the value of a feature depends on the value of

3. Seven characteristics of quality data

● Accuracy: Ensure your data is close to the true values.

Step 2: Fix structural errors

Step 3: Filter unwanted outliers

Step 5: Validate and Quality Assurance

STEP 1 STEP 2 STEP 3 STEP 4 STEP 5

IV. STEP 4 – DATA ANALYSIS

Three kinds of regression analysis:

Example: The relationship (regression equation) between Education Spending (X1),

GDP = 24,497.596 + 3.984 * (Education Spending) - 2,188.899 * (Unemployment

GDP = -11,139.247 + 2.040 * (Employee Compensation) (after backward

1.2 Time series analysis

2. Tools for Business Analytics

Converts all characters in text

Converts all characters in text

Converts all characters in text

VALUE Converts a text value into number Value (text)

Adds all the numbers given as an Sum (numerical

Multiplies all the numbers given as Product (numerical

SQRT Returns square root of a number SQRT (number)

Rounds a number up to the Ceiling (number,

FLOOR Rounds a number down to the Floor (number,

ROUND Rounds a number to a specified Round (number,

SUMIF Adds the value of given cells Sumif (criteria range,

SUMIFS Adds the value of given cells Sumifs (sum range,

Returns average of numbers Average (numerical arguments)

Returns maximum number of Max (numerical arguments)

Returns minimum number of Min (numerical arguments)

Returns median of numbers Median (numerical arguments)

Counts the number of cells Count (range of cells)

Counts the number of non- CountA (range of cells)

Counts number of cells that meet Countif (criteria range, criteria)

Counts number of cells that meet Countifs (criteria range1,

Returns the average of cells Averageif (criteria range,

Returns the average of cells Averageifs (average range,

Returns the quartile of dataset Quartile (range, quartile)

Returns standard deviation Stdev (dataset)

Returns numerical identification of Month (date argument)

Returns numerical identification of Day (date argument)

YEAR Returns year of any argument Year (date argument)

Returns the day of week e.g. 1 for Weekday(date, 1)

Returns the week number of a date Weeknum(date)

Returns the last day of the month Eomonth (date, specified