You are on page 1of 154

EXCEL DATA

ANALYTICS
WHY WE
NEED DATA?
PALEXY REPORT
DATA ANALYTICS – FLOW

Business Problem Data Preparation Descriptive Diagnostic Predictive Prescriptive


Definition Step Step

- Drill up - Training model - Recommendation


- Decision Variable - Data collection - Visualization
- Drill down - Define algorithm Optimization
- Initial Predictions - Cleaning & - Statistical
- Scope & Transformation - Model
Constraint Deployment

- Does the data show - What algorithm we - Apply to business to


- What is being - Where is data? - Does visualization
any relationship? choose? optimize the
analyzed & - Is it in a standard show any patterns?
- Time series analytics - Model validation & business operation
predicted? form to analyze? - Are the insights
- Are there any quick valuable? - What are the benchmarking
- Is the data available?
insights patterns? relationship?
- Is the data available?
• Norminal: không có thứ tự (định tính)
• Ordinal: có thứ tự (định tính)
• Interval: có thứ tự nhưng không có khái niệm số O (định lượng). Ví dụ: IQ
• Ratio: có thứ tự có khái niệm số O(định lượng)
TITANIC
DATASET
• PassengerId feature is the unique id of the row and it doesn't have any effect on Survived.
• Survived feature is binary (0 or 1);
• 1 = Survived
• 0 = Not Survived
• Pclass (Passenger Class) feature is the socio-economic status of the passenger. It is a categorical ordinal feature which
has 3 unique values (1, 2 *or *3);
• 1 = Upper Class
• 2 = Middle Class
• 3 = Lower Class
TITANIC •

Name, Sex and Age features are self-explanatory.
SibSp feature is the total number of the passengers' siblings and spouse. (tổng số anh chị em và vợ chồng)

DATASET •

Parch feature is the total number of the passengers' parents and children. (tổng số ba mẹ và con cái)
Ticket feature is the ticket number of the passenger. (mã vé)
• Fare feature is the passenger fare. (giá vé)
• Cabin feature is the cabin number of the passenger. (số phòng trên tàu)
• Embarked is port of embarkation. It is a categorical feature and it has 3 unique values (C, Q or S); (cảng bắt đầu, vị trí
bắt đầu của hành khách)
• C = Cherbourg
• Q = Queenstown
• S = Southampton
Đọc qua ý nghĩa các cột trong
Titanic dataset và cho biết loại
dữ liệu của các cột là loại gì?
BÀI TẬP
Có thể phân tích được gì từ dữ
liệu này?
• Statistics is a range of procedures for
gathering, organizing, analyzing and
presenting quantitative & qualitative data
• Descriptive: describe data
• Inferential: infer or estimate sth from sample
CATEGORICAL
Class Count of Data

300-400 497

400-500 503

Grand Total 1000


NUMERICAL : CENTER - MEAN – MEDIAN - MODE

Mean Median Mode


- Sensitive to extreme -Not sensitive to - May have no mode, 1
value extreme value or many
-Used for normal - Used for data - stating the most
distribution skewed frequently occurring

• CENTER
• VARY
• SHAPE
• OUTLIERS
VARIANCE

• VARIANCE LARGE: A LOT OF


VARIABLITY IN THE GROUP
• VARIANCE SMALL: LITTLE
VARIABLITY IN THE GROUP

• A sample is a set of data


extracted from the entire
population. And the
variance calculated from a
sample is called sample
variance.
STANDARD DEVIATION
• CONVERT THE VARIABILITY INTO
THE SAME UNITS AS THE ORIGNAL
• MEASURE
SHAPE

• BELLY SHAPE/NORMAL
DISTRIBUTION SHAPE

• roughly 68% of the data will be within one standard deviation around the mean (“around” meaning the
range from one standard deviation below the mean to one standard deviation above the mean)
• roughly 95% of the data will be within two standard deviations around the mean
• roughly 99.7% of the data will be within three standard deviations around the mean
Question: What percentage of
American men are between 63 and
75 inches tall?

Answer:
• SD = 3 inches
• Mean = 69 inches
• 69 – 2.3 = 63
• 69 + 2.3 = 75
• 95% of all men will be between
63 and 75 inches tall. In other
words, only about 5% will
• either be shorter than 63 inches
or taller than 75 inches.
SKEWNESS
Chebyshev’s theorem, we can be confident that at least 75% of any dataset lies within two standard deviations
on the either side of the mean
SD = 15 grades
KURTOSIS
QUANTILES
BOX PLOT
Survey about what proportion of customers to drink coffee
POPULATION: 100,000 total customers
SAMPLE: 5,000 responses
STATISTIC: 73% say they do drink coffee
PARAMETER: Proportion of all 100,000 customers that drink coffee

DESCRIPTIVE INFERENTIAL
73% of 5000 responses How we conclude for the
say they drink coffee proportion of population?
HYPOTHESIS TESTING
Objectives:
1) Comparison : Is there any difference between two data-sets/groups/data.
2) Relationship : Is there any connection between two groups/columns/data/variables.

Null Hypothesis (H0)


- A thing/things you observed nothing new
Alternative Hypothesis (H1)
- A thing/things you observed new and important

STEPS:
1. Formulate Ho and H1
2. Gather DATA
3. Choose type of test
4. Choose confidence level
5. Decide accept/reject Step 1
1. Z-Test ( the sample is assumed to be normally distributed - population mean and population standard deviation are known)
Null : Mean of sample = Mean of population
Alternate : Mean of sample ≠Mean of population
2. T-Test ( the sample is assumed to be normally distributed - population mean and population standard deviation are known)
Used to compare the mean of only two given samples. ( for three , we used ANOVA).
Types of T-Test : One sample , Independent and Paired.
Large T-Test Score = group are different
Small T-Test Score = Group are same
Null : Mean of one sample = Mean of other sample
Alternate : Mean of one sample ≠Mean of other sample
Independent t-Test: This test is used to compare two population means when the sample sizes are small, the population
variances are unknown, unequal/equal.
Paired t-Test: This test is used to compare two population means when there is a physical reason to pair the data and the two
sample sizes are equal.
3. F-Test: This test is used to compare two standard deviations and applies for all sample sizes.
1.ANOVA Test ( Analysis of Variance ) : (random sample, Homogeneity, Normality) (2 Types of Anova: One Way Anova and
Multi-Way Anova )
Used to compare multiple (three or more) samples with a single test/multiple tests and used to compare datasets as well.
Null : All samples are equal
Alternate : All samples are not equal
HYPOTHESIS TESTING

Type I error is when a null hypothesis is rejected when in reality


it is correct.
You have found something new but in fact, you have not
Type II error, and is made when we fail to reject a null
hypothesis that is actually false.
You dont realize you have found something special when in
fact you have

Alpha is the probability of a type 1 error


Beta is the probability of a type 2 error
Type 1 errors have a probability of “α” correlated to the level of confidence that you set. A test with a 95% confidence
level means that there is a 5% chance of getting a type 1 error.

You start your A/B test running a control version (A) website against your website test (B) that contains the image.
After 5 days, (B) outperforms the control version by a staggering 25% increase in conversions with an 85% level of
confidence.
You stop and implement the image in your banner. However, after a month, you noticed that your month-on-month
conversions have actually decreased.
That’s because you’ve encountered a type 1 error: your test didn’t actually beat your control version in the long run.
A real-life example of a type 2 error
Let’s say that you run an e-commerce store that sells high-end, complicated
hardware for tech-savvy customers. In an attempt to increase conversions,
you have the idea to implement an FAQ below your product page.

“Should we add an FAQ at the bottom of the product page”? You launch an
A/B test to see if the variation (B) could outperform your control version (A).

After a week, you do not notice any difference in conversions: both versions
seem to convert at the same rate and you start questioning your assumption.
Three days later, you stop the test and keep your product page as it is.

At this point, you assume that adding an FAQ to your store didn’t have any
effect on conversions.

Two weeks later, you hear that a competitor has implemented an FAQ at the
same time and observed tangible gains in conversions. You decide to re-run
the test for a month in order to get more statistically relevant results based on
an increased level of confidence (say 95%).
After a month – surprise – you discover positive gains in conversions for the
variation (B). Adding an FAQ at the bottom of your product page has indeed
brought your company more sales than the control version.
That’s right – your first test encountered a type 2 error!
APPLY THE CENTRAL
LIMIT THEOREM
• ASSUMPTION:
• Sample is large
• Normal distributed
EXAMPLE
A Social Scientist believes that people in this area have higher IQ than average
• Ho : µ = 100
• H1 : µ > 100
• α = 0.05
After survey:
• Sample : 256
• Average IQ of the sample : 102.5
• Reject Ho?
Assumption:
• Sample is large
• Population Normal distribution
• σ = 16
Sampling µx = µ = 100
σx = σ/√n = 16/ √256 = 1
• High P-Values: Your data are likely with a true null
Z-TEST • Low P-Values: Your data are unlikely with a true null
Left Tailed Test
H1: parameter < value
Notice the inequality points to the left

Decision Rule: Reject H0 if t.s. < c.v.

Right Tailed Test


H1: parameter > value
Notice the inequality points to the right

Decision Rule: Reject H0 if t.s. > c.v.

Two Tailed Test


H1: parameter not equal value
Another way to write not equal is < or >
Notice the inequality points to both sides

Decision Rule: Reject H0 if t.s. < c.v. (left) or t.s. > c.v. (right)
Y = ax + b
CORRELATION The more time boys spend playing video games,
The less time they spend exersicing

CORRELATION <> CAUSATION


LINEAR
REGRESSION
DATA VISUALIZATION IN
EXCEL
VISULIZATION PRINCIPALS
MISTAKES
MISTAKES
DATA STORY
TELLING
1 2 3 4 5

STEP 1 STEP 2 STEP 3 STEP 4 STEP 5


• Data source • Set KPI • Choose chart to • Set Filter • Present
visualize
ADD DATA FROM CONNECTIONS
BUILDING SUNBURST CHART AND TREE MAPS
INSERT VISUALIZATION
INSERT
VISUALIZATION
TOP 10
INSERT SLICER
INSERT TIMELINE
CREATE MACRO TO CLEAR FILTERS
EXCEL STATISTICS
ANALYSIS
TOOLPAK
MEAN-MEDIAN-MODE
NORMAL
DISTRIBUTION
MY ADD-INS
CHI SQUARE
TEST EXCEL
TESTING EXCEL
SKILLS
https://www.isograd.com/EN/freetestselection.php
VLOOKUP & INDEX MATCH
Index Vlookup
Tìm kiếm từ trái qua phải Dễ Dễ
Tìm kiếm từ phải qua trái Dễ Khó
Tìm kiếm khi cột chứa dữ liệu tìm Dễ Dễ
kiếm đứng đầu bảng
Tìm kiếm khi cột chứa dữ liệu tìm Dễ Khó
kiếm đứng cuối bảng
PIVOT TABLES
CALCULATE

=CALCULATE(COUNT([CustomerKey]),Customers[Gender] = "M")
EXCEL FORMAT CELLS
EXCEL FUNCTION – DAY OF THE MONTH
EXCEL
FUNCTION –
DAY NAME
OF THE
WEEK
EXCEL
FUNCTION –
DAY
NUMBER OF
WEEK
EXCEL
FUNCTION –
DAY
NUMBER OF
YEAR
EXCEL
FUNCTION –
WEEK
NUMBER OF
YEAR
EXCEL
FUNCTION –
MONTH
NAME
EXCEL
FUNCTION –
MONTH
NUMBER
EXCEL
FUNCTION –
CALENDAR
QUARTER
EXCEL
FUNCTION –
CALENDAR
YEAR
SUMIF
COUNTIF
EXCEL VBA
EXCEL VBA
INSERT AND FORMAT TEXT

• Same structure => does not have header


INSERT AND FORMAT
TEXT

• Personal Marco Workbook: effective in


all your system
VBA RECODER
RUN MACRO
– CTRL +
SHILF + J
ALT + F11
ERROR: SAVE FILE MACRO VBA EXCEL
FIX THE NAME
VBA
CONCEPTS
INSERT
MODULE –
STORE VBA
CODE
ADD PROCEDURE
RUN STEP BY STEP

Run step by step F8


Stop Run step by step: Reset
Variable
Type
Name Variable
EXPANDING
A MACRO
WITH THE IF
STATEMENT
OFFSET

Move down 4 cells and take the right 2 columns


FOR AND
FOR NEXT
STATEMENTS
DO WHILE
AND DO
UNTIL
STATEMENTS
DO WHILE
LOOP
FIND NUMBER NOT IN FORMULA
FIND NUMBER NOT IN FORMULA
REVISE
MARCRO
SHORTCUT
KEY
COMMUNICATION

• Communicating and sharing your work


• Questions : Who’s it for? What needs
do that have? What questions do they
have?
WHO’S IT FOR?
PEOPLE ON YOUR TEAM PEOPLE OUTSIDE YOUR TEAM

Managers Partners

Teammates Clients

…. ….

TECHNICAL THINGS WHY YOU NEED US TO DO THIS PROJECT?


BOSS,
SENIOR, LEAD
GET YOUR BOSS INVOLVED

CONFLICT
WITH YOUR
TEAMATES
PEOPLE OUTSIDE YOUR TEAM
BREAK IT DOWN
DATABASE
POWER PIVOT

[Total Sales] = SUM(Sales[ExtendedAmount])


POWER PIVOT
POWER PIVOT
In the Power Pivot window, select Home, From Database (see #1 below), From Access (#2).
POWER PIVOT
POWER PIVOT
POWER PIVOT
Inserting a New Pivot Table
u
SUM(), COUNT(),COUNTROWS(),MIN(), MAX(),COUNTBLANK(),
and DIVIDE()
CONDITIONAL FORMATING
SAME PERIOD LAST YEAR
CREATE 1 SLICER

You might also like