Professional Documents
Culture Documents
ANALYTICS
WHY WE
NEED DATA?
PALEXY REPORT
DATA ANALYTICS – FLOW
DATASET •
•
Parch feature is the total number of the passengers' parents and children. (tổng số ba mẹ và con cái)
Ticket feature is the ticket number of the passenger. (mã vé)
• Fare feature is the passenger fare. (giá vé)
• Cabin feature is the cabin number of the passenger. (số phòng trên tàu)
• Embarked is port of embarkation. It is a categorical feature and it has 3 unique values (C, Q or S); (cảng bắt đầu, vị trí
bắt đầu của hành khách)
• C = Cherbourg
• Q = Queenstown
• S = Southampton
Đọc qua ý nghĩa các cột trong
Titanic dataset và cho biết loại
dữ liệu của các cột là loại gì?
BÀI TẬP
Có thể phân tích được gì từ dữ
liệu này?
• Statistics is a range of procedures for
gathering, organizing, analyzing and
presenting quantitative & qualitative data
• Descriptive: describe data
• Inferential: infer or estimate sth from sample
CATEGORICAL
Class Count of Data
300-400 497
400-500 503
• CENTER
• VARY
• SHAPE
• OUTLIERS
VARIANCE
• BELLY SHAPE/NORMAL
DISTRIBUTION SHAPE
• roughly 68% of the data will be within one standard deviation around the mean (“around” meaning the
range from one standard deviation below the mean to one standard deviation above the mean)
• roughly 95% of the data will be within two standard deviations around the mean
• roughly 99.7% of the data will be within three standard deviations around the mean
Question: What percentage of
American men are between 63 and
75 inches tall?
Answer:
• SD = 3 inches
• Mean = 69 inches
• 69 – 2.3 = 63
• 69 + 2.3 = 75
• 95% of all men will be between
63 and 75 inches tall. In other
words, only about 5% will
• either be shorter than 63 inches
or taller than 75 inches.
SKEWNESS
Chebyshev’s theorem, we can be confident that at least 75% of any dataset lies within two standard deviations
on the either side of the mean
SD = 15 grades
KURTOSIS
QUANTILES
BOX PLOT
Survey about what proportion of customers to drink coffee
POPULATION: 100,000 total customers
SAMPLE: 5,000 responses
STATISTIC: 73% say they do drink coffee
PARAMETER: Proportion of all 100,000 customers that drink coffee
DESCRIPTIVE INFERENTIAL
73% of 5000 responses How we conclude for the
say they drink coffee proportion of population?
HYPOTHESIS TESTING
Objectives:
1) Comparison : Is there any difference between two data-sets/groups/data.
2) Relationship : Is there any connection between two groups/columns/data/variables.
STEPS:
1. Formulate Ho and H1
2. Gather DATA
3. Choose type of test
4. Choose confidence level
5. Decide accept/reject Step 1
1. Z-Test ( the sample is assumed to be normally distributed - population mean and population standard deviation are known)
Null : Mean of sample = Mean of population
Alternate : Mean of sample ≠Mean of population
2. T-Test ( the sample is assumed to be normally distributed - population mean and population standard deviation are known)
Used to compare the mean of only two given samples. ( for three , we used ANOVA).
Types of T-Test : One sample , Independent and Paired.
Large T-Test Score = group are different
Small T-Test Score = Group are same
Null : Mean of one sample = Mean of other sample
Alternate : Mean of one sample ≠Mean of other sample
Independent t-Test: This test is used to compare two population means when the sample sizes are small, the population
variances are unknown, unequal/equal.
Paired t-Test: This test is used to compare two population means when there is a physical reason to pair the data and the two
sample sizes are equal.
3. F-Test: This test is used to compare two standard deviations and applies for all sample sizes.
1.ANOVA Test ( Analysis of Variance ) : (random sample, Homogeneity, Normality) (2 Types of Anova: One Way Anova and
Multi-Way Anova )
Used to compare multiple (three or more) samples with a single test/multiple tests and used to compare datasets as well.
Null : All samples are equal
Alternate : All samples are not equal
HYPOTHESIS TESTING
You start your A/B test running a control version (A) website against your website test (B) that contains the image.
After 5 days, (B) outperforms the control version by a staggering 25% increase in conversions with an 85% level of
confidence.
You stop and implement the image in your banner. However, after a month, you noticed that your month-on-month
conversions have actually decreased.
That’s because you’ve encountered a type 1 error: your test didn’t actually beat your control version in the long run.
A real-life example of a type 2 error
Let’s say that you run an e-commerce store that sells high-end, complicated
hardware for tech-savvy customers. In an attempt to increase conversions,
you have the idea to implement an FAQ below your product page.
“Should we add an FAQ at the bottom of the product page”? You launch an
A/B test to see if the variation (B) could outperform your control version (A).
After a week, you do not notice any difference in conversions: both versions
seem to convert at the same rate and you start questioning your assumption.
Three days later, you stop the test and keep your product page as it is.
At this point, you assume that adding an FAQ to your store didn’t have any
effect on conversions.
Two weeks later, you hear that a competitor has implemented an FAQ at the
same time and observed tangible gains in conversions. You decide to re-run
the test for a month in order to get more statistically relevant results based on
an increased level of confidence (say 95%).
After a month – surprise – you discover positive gains in conversions for the
variation (B). Adding an FAQ at the bottom of your product page has indeed
brought your company more sales than the control version.
That’s right – your first test encountered a type 2 error!
APPLY THE CENTRAL
LIMIT THEOREM
• ASSUMPTION:
• Sample is large
• Normal distributed
EXAMPLE
A Social Scientist believes that people in this area have higher IQ than average
• Ho : µ = 100
• H1 : µ > 100
• α = 0.05
After survey:
• Sample : 256
• Average IQ of the sample : 102.5
• Reject Ho?
Assumption:
• Sample is large
• Population Normal distribution
• σ = 16
Sampling µx = µ = 100
σx = σ/√n = 16/ √256 = 1
• High P-Values: Your data are likely with a true null
Z-TEST • Low P-Values: Your data are unlikely with a true null
Left Tailed Test
H1: parameter < value
Notice the inequality points to the left
Decision Rule: Reject H0 if t.s. < c.v. (left) or t.s. > c.v. (right)
Y = ax + b
CORRELATION The more time boys spend playing video games,
The less time they spend exersicing
=CALCULATE(COUNT([CustomerKey]),Customers[Gender] = "M")
EXCEL FORMAT CELLS
EXCEL FUNCTION – DAY OF THE MONTH
EXCEL
FUNCTION –
DAY NAME
OF THE
WEEK
EXCEL
FUNCTION –
DAY
NUMBER OF
WEEK
EXCEL
FUNCTION –
DAY
NUMBER OF
YEAR
EXCEL
FUNCTION –
WEEK
NUMBER OF
YEAR
EXCEL
FUNCTION –
MONTH
NAME
EXCEL
FUNCTION –
MONTH
NUMBER
EXCEL
FUNCTION –
CALENDAR
QUARTER
EXCEL
FUNCTION –
CALENDAR
YEAR
SUMIF
COUNTIF
EXCEL VBA
EXCEL VBA
INSERT AND FORMAT TEXT
Managers Partners
Teammates Clients
…. ….
CONFLICT
WITH YOUR
TEAMATES
PEOPLE OUTSIDE YOUR TEAM
BREAK IT DOWN
DATABASE
POWER PIVOT