Professional Documents
Culture Documents
Statistical Computing
Learning Objectives
Well-informed decision
Problems Solved
Complex Problems
Data
Introduction to Statistics
Although both forms of analysis provide results, quantitative analysis provides more insight and a
clearer picture. This is why statistical analysis is important for business.
Major Categories of Statistics
There are two major categories of statistics: descriptive analytics and inferential analytics.
Descriptive analytics organizes the data and focuses on the main characteristics of the data.
HIGH
NUMBER OF STUDENT
MEDIUM
MATH SCORES
LOW
1847369250
0 1 2 3 4 5 6 7 8 9
Inferential analytics is valuable when it is not possible to examine each member of the population.
Tall
Inferential Method
Medium Categorize height as tall, medium, and short.
Take a sample from the population to study.
Short
Descriptive Method
Record the height of each and every person.
Provide the tallest, shortest, and average height of the
population.
Statistical Analysis Considerations
What is the
chance that
I will get a spade?
Events
Consider that we have tabulated the salaries of a group of employees in a company and we want to create a
frequency distribution table out of it.
Salary ($k) ID
100 1
100 2
50 3
300 4
300 5
List of employees
300 6
200 7
200 8
400 9
Probability Distribution Function: Use Case
Number of
employees
Salary
Probability Distribution Function: Use Case
3
2
Number of
employees
Number of
employees
Salary Range
Represents individual
Represents salary range
salaries
Note: Binned distribution doesn’t allow you to point out the salary of a particular employee.
Probability Distribution Function: Use Case
To sort out the problems with binned distribution and individual distribution, we use a distribution function.
Number of
employees
Continuous distribution
50-200 200-400
Salary Range
Note: Using probability distribution function allows you to find the probabilities of all the discrete
points.
Probability Distribution Function: Properties
p(x)
Equal probabilities of 1 f(x)= 1 , for 1≥ x ≥0
occurrence of random
variable
x
1
Normal Distribution
f(X)
0 Bell-shaped curve
1
02 Symmetric around the
mean
03 Mean = Median
σ
04 Total area under the curve is
1
μ X 05 Almost all the values fall within
3σ
Normal Distribution: Use Case
Consider a scenario where John and Dany have 200 and 150 followers on Twitter and Facebook, respectively, and
we are supposed to determine who is more popular.
I have
I have 150
200 followers
followers
Twitter
Po
st
comme
nts
Normal Distribution
f(X f(X
) )
μ = 190 X μ = 100 X
Consider the mean for Twitter distribution as 190 Consider the mean for Facebook distribution as
and standard deviation as 5 100 and standard deviation as 10
Z-Scores
Indicate how many standard deviations away from the mean does the point x lies
Consider the mean for twitter distribution as 190 Consider the mean for Facebook distribution as
and standard deviation as 5 100 and standard deviation as 10
f(X f(X
) )
μ = 190 X μ = 100 X
Z = X-μ Z =2 Z = X-μ Z =5
σ σ
Standard Normal Distribution
It is a normal distribution curve with mean at 0 and standard deviation 1. They are most commonly used to
compare two independent events.
f(X f(X
) ) z=5
z=2
μ = 0, σ = 1 X μ = 0, σ = 1 X
Note: From the above z-scores, we can infer that Dany is far more popular than John as she has a
z-score of 5 which constitutes 99 percentile of the follower data.
Central Limit Theorem
Central Limit Theorem
Sampling distribution of sample means is normal if the sample size is large enough.
The mean of sampling distribution of sample means can be used to infer the population mean.
f(X
) n = 18
n = 10
n=6
n=2
SE = σ/ n
μ = μp X
Note: μp: population mean, μ: sample mean of the sampling distribution, n: sample size, σ : population standard
deviation.
Confidence Intervals
Hypothesis Test: Use Case
The mean of sampling distribution of sample means can be used to infer the population mean.
XYZ company has finished the development of its most awaited product and wants to
test this product in the market on a certain set of customers.
Hypothesis Test: Use Case
The management decided to test NPS score of its new product out of a sample of learners.
Mean of the NPS Score of the population Mean of the NPS Score of the
before product launch= 80 and standard sample= 90
deviation = 10
This leads the management to proceed with a point estimate that the mean NPS of the population
will be 90 approximately.
Hypothesis Test: Use Case
The management now took n samples of size 25 each such that plotting the sampling distribution of all these
sample means resulted in a normal distribution with population mean of 80.
Normal Curve
SE = 10/5 = 2
7 7 μ = 80 8 8
4 8 2 6
Hypothesis Test: Use Case
From the below z-score, the management can easily infer that the sample with sample mean of 90 came from a
different distribution. (centered around 90)
Normal Curve
SE = 10/5 = 2
Z-score of 5
7 7 μ = 80 8 8 9
4 8 2 6 0
Interval Estimate
If the management opens the product for all its customers, the NPS score will definitely increase and will be
somewhere near 90.
7 7 μ = 80 8 8 9 μl 9 μ
4 8 2 6 0 0 u
Note: One can never be sure with a point estimate. Therefore, to measure the certainty there is
always a range associated with a point estimate.
Confidence Intervals
The management wants to be 95% sure that if it launches its new product for the entire customer base, the new
mean NPS will be 90.
Standard
Normal Curve
SE = 2
μl 9 μ - μ= 1.
0 u 1.9 0 96
6
95% Confidence
Interval
μl = 90 - 1.96 SE = 86.08
μ = 90 + 1.96 SE = 93.92
u
Constructing Confidence Intervals
Consider the previous scenario where the company tested its new product against a sample, and the mean NPS
came out to be 90.
SE = 2 Probability <
0.0001
μ= 9 μ=0 5
80 Z-score = 5
0
From the above z-score and probability, the management can clearly state that this mean (sample
mean of 90) is a part of different distribution.
Levels of Significance
Standard
Normal Curve
Significance
5% level
μ=0 = 1.64
Corresponding
critical z value
In the above case, if any sample mean has a z-score > 1.64 then it is considered unlikely, given the
alpha levels as 5%.
Null Hypothesis vs. Alternate Hypothesis
Null hypothesis can be rejected or accepted based on one tailed or two tailed tests.
5% 2.5% 2.5%
Calculating a sample
statistic
Type I and Type II Errors
2.5% 2.5%
Test of Association:
To determine whether one variable is associated with a different variable
Example: Determine whether the sales for different cell phones depend on the city or
country where they are sold.
Test of Independence:
To determine whether the observed value of one variable depends on the observed value
of a different variable
Example: Determine whether the color of the car that a person chooses is independent of
the person’s gender.
The test is usually applied when there are two categorical variables from a single population.
Chi-Square Test: Example
Null Hypothesis
fo .75 .25
Alternative Hypothesis
The formula for calculating expected and observed frequencies using chi-square:
(0,0) (0,1) (0,2) Correlation coefficient measures the extent to which two
variables tend to change together.
(1,0) (1,1) (1,2)
(2,0) (2,1) (2,2) The coefficient describes both the strength and direction of
the relationship.
3 × 3 matrix (simple square matrix)
Correlation Matrix
A correlation matrix that is calculated for the stock market will probably show the short-term, medium-term,
and long-term relationships between data variables.
What Is A/B Testing?
A/B Testing
We compare one page against the other for a difference in an element of the control page.
A/B Testing: Steps
When you’re nearing the end of a test, and your statistical confidence
level is still low, but you’re seeing a decent difference in conversion rate
between the variants, make a gut-check call and remove the poorer
performing pages.
A/B Testing: Advantages and Disadvantages
Advantages Disadvantages
After the launch of the Noob Guide into online marketing, company XU created a landing page that used
PayWithATweet.com to help expand the distribution of the guide.
Web Analyst
Scenario
Data was gathered by inserting KISSinsights (now Qualaroo) widget in the page which prompted many people to
provide their emails to get the pdf download.
Formulating Test Hypothesis
Test Hypothesis:
Will allowing visitors to download the PDF by providing their email address work better than receiving
it in exchange for a tweet, considering that not everyone has a Twitter account or is willing to share
such information with their followers.
Perform A/B Testing
At the end of this test, XU decided to discuss a possible hybrid option that might
produce the best of both worlds: a single version that lets the user decide whether
they want to pay with a tweet or provide their email.
Knowledge Check
Knowledge
Check On an intelligence test with a mean of 100 and a standard deviation of 15, Cersei scored 85.
What is Cersei's z-score?
1
a. -2
b. -1
c. 1
d. 2
Knowledge
Check On an intelligence test with a mean of 100 and a standard deviation of 15, Cersei scored 85.
What is Cersei's z-score?
1
a. -2
b. -1
c. 1
d. 2
a. F(x) = 1
b. F(x) = 1/2
c. F(x) = x
d. F(x) = log(x)
Knowledge
Check A continuous random variable x can take any value between 0 and 1. Its probability density
function is assumed to be uniform. What is the explicit form of its probability density
2 function f(x)?
a. F(x) = 1
b. F(x) = 1/2
c. F(x) = x
d. F(x) = log(x)
Since, domain of f(x) is restricted to 0 and 1 and f(x) is a uniform distribution, f(0) = 0 and f(1) =1
Knowledge
Check
In chi-square test, there is no association of variables if:
3
d. Both a and b
Knowledge
Check
What’s the first step of running an A/B test?
4
d. Both a and b
Before starting with A/B test, you must set your success metric.
Knowledge
Check
Which of the following is a false statement about A/B testing?
5
c. You have to at least run the test for a couple of weeks to get correct results
c. You have to at least run the test for a couple of weeks to get correct results
When you run a multivariate test, you use one page and dynamically supply multiple versions of multiple elements.
Lesson End Project
Problem Scenario: Proda, an electric utility company, manufactures devices that keep the electrical
grid operating efficiently. It enables grid operators to respond to the demand for reliable energy
with trends like digitalization, self-healing grids, and reinforced infrastructure.
Last month engineers in Proda developed two updated versions of EISR. It is an electrical insulator
material that impedes the free flow of electrons. Due to their resistivity, these insulators are
installed to prevent line damage from arcing. The CEO of Proda claimed that the updated versions
of
EISR performed much better. To test the same, the delivery team of Proda released the two
versions
among two customer groups (grouped based on random sampling) and found out that version 1
has
a mean feedback score of 50, whereas version 2 has a mean feedback score of 40.
However, to maintain standards, the company is trying to finalize on one of the two versions. You,
as
a statistician, must do some level of statistical testing to help the management arrive on the final
version.
Lesson End Project (Contd.)
Consider the mean and standard deviation for version 1 distribution as 70 and 5 and for version 2
distribution as 80 and 8.
Instructions to Perform:
▪ Initialize the null and alternate hypothesis
▪ Convert Version 1 distribution to standard normal distribution
▪ Convert Version 2 distribution to standard normal distribution
▪ Compare the Z-scores
Key Takeaways