You are on page 1of 21

Data Analysis

From theoretical to implementation


Using Excel, python and flourish

Lecture #7/8
Inference
Normal distribution & Hypothesis testing

Dr.Ghoniem Lawaty
GHONIEM.GHONIEM@GMAIL.COM
MIS, DA, ML, Digitization and Micro-Services, TOGAF, DEVOPS
Certified ATM for CMMI SCAMPI (A) method, DA,DS,ML, ICAgile
https://www.linkedin.com/in/ghoniem-abdel-azim-mostafa-33860691
Topics
• Inference in DA
• What is?
• Why?
• Sampling
• Random sample
• Criteria
• How to calculate
• Normal distribution
• Why?
• How?
• Rules of Normal distribution
• Role of normal distribution in inference
• Difference between normal distribution and standard normal distribution
• Practical samples
• Hypothesis testing
• What?
• Why?
• Rules
• Steps
• Practical samples
What is inference in DA?
 What is Inference?
• Inference is the use of samples to acquire
the new knowledge about the population
 Why?
• The cost of application on the population
• Inference is by nature predictive
 How to guarantee the accuracy of inference?
• Larger sample size
• Higher accuracy rate
• Low mean error
 How?
• Normal distribution
• Estimation : point and period
• Hypothesis testing
Normal distribution Vs Standard normal distribution
 Z-Score
• Also called the standard score
• Represent how many standard deviation below or
above the mean
 Z-Score values and ranges
• Zero means equal to average
• Between -1,+1 : area equal to 68%
• Between -2,+2: area equal to 95%
• Between -3,+3: area equal to 99%
 Rule of calculation
• Z-Score = (X-U)/S
 What do we mean by: Z-Score =1.2
• It means that the score is 1.2 standard deviation of
the mean
• So z = 1.2 * S + Mean
 How can we can calculate the area under the curve?
• We can use Z table to calculate the area, which
represents the amount of population elements
times the factors
 Normal distribution use the actual values, while standard values
use Z-Score values
Z-Score and normalization values
 Difference between standard value and normalized value:
• Both are used to have values on the same base
of scaling, in order to simplify visualization
process
• Z score lays between -4,4
• Normalized (Scaled) score lays between 0,1
 Example:
• Values : 1,2,3,4,5
• Min = 1
• Max = 5
• Avg = 3
• Stddev = 1.58
• Z-Score(1) = (1-3)/1.58 = -1.26
• Z-Score(5) = (5-3)/1.58 = 1.26
• X_Normalized(1)= (1-1)/(5-1) = 0
• X_Normalized(5)= (5-1)/(5-1) = 1
Calculating the area under the curve
 We can calculate it using the Z-Score value
 Then go to Z-Table and search for the value by
summing the vertical and horizontal values
and get the equivalent value
 Example: Z=1.25
• Go to columns:1.2
• Go to row:0.05
• The equivalent value is 0.3944
 This means: 39.44% of the population lays
between 0 and 1.25
 So if the population is 10000, then 3944
element lays between 0 and 1.25
Z-Score Vs area under the curve
 Z-Score represent the position of the sample
under the curve
 While the area under the curve represents the
portion of the population under the sample at
this point.
 As the previous example:
• This means: 0.3944% of the
population lays between 0 and 1.25
• So if the population is 10000, then
3944 element lays between 0 and
1.25
Inference decision tree
and lifecycle model

Population

Parameter is
Claimed Parameter is
unknown
(guessed)

Hypothesis Estimation with


testing for period using
sample data sample data
Sample size without known population(N)
• Sample :
• Calculate the sample size for students if we need
to calculate the age group, with accuracy 99%, and
accepted error 2 years, considering the age
variance =50
• Answer:
• Required accuracy : 99%
• Accepted error is e = 1 year only
• Standard deviation is : 3 years
• Solution
• Using the first law:
• Z = 2.58, use table z to get the value
• e = 2%
• N>= (2.58 * 3)^2/(2^2) = 48
• Without knowing the std:
• If we know that 50% accept change for digital
transformation
• (2.58^2 *.6*.4)/(0.2)^2 = 40
Sample size with known population(N)
• Sample
• Assume that we need to calculate the
average students age
• Required accuracy : 99%
• Accepted error is e = 1 year only
• Population is 1000000
• Solution
• Using the first law:
• e = 1%
• N>= (1000000) /(1+ 1000000*.01)=10000
Normal and standard distribution samples
• Police Academy
• If you know that 5000 student has requested to join
police academy
• Average students length 165
• Standard deviation =10
• If the academy accept students between 145-185
• Calculate:
• Number of accepted students
• Students exceed 175 cm
• Answer:
• Area between 145,185 = area between [-2Q,+2Q] =
.95
• Total accepted students = .95*5000 = 4750 student
• Students exceed 175
• Calculate Z-Score = ((x-U)/Q)
• = (175-165)/10 = 1
• Go to Z table at x(Column) +y(Row) =1
• Get the area = .3413
• Number of users = Total Number of users * (all area –
remaining area)
• = 4750* [1- (0.5+0.3413)] = 754
Normal and standard distribution samples
• K12 Result
• Assume that no. students are 1 Million
• We got a sample , the Average was 65
• Standard deviation =5
• If the pass grade is 50%
• Calculate:
• Number of Failed students
• Answer:
• We need to calculate students > 50%
• Students Less that 50
• Calculate Z-Score = ((x-U)/Q)
• = (65-50)/5 = 3
• Go to Z table at x(Column) +y(Row) =1
• Get the area = .4987
• Number of users = Total Number of users * (all area
– remaining area)
• = 1000000* [1- (0.5+0.4987)] = 1300
AnalyticA Journey-Management Levels Support
Statistical Estimation
• What is Confidence period?
• The period that contain the set the values
with a level of probability and confidence
level.
• How?
• We should have the following:
• Suitable sample
• Accuracy level
• Standard error
AnalyticA Journey-Management Levels Support
Statistical Estimation
• Sample
• We got a sample from Egyptian to calculate
net income
• Sample size was 5000 Family
• Average 6000 EGP
• Standard Deviation 71
• Required confidence level 95%
• Can Salary be greater than 7000?
• Answer
• Sample Error = Z*71/Sqrt(5000)
• Z(95%) = 2
• Error value = +- 2
• Period estimation at 95% = [6000-2,6000+2]
• = [5998,04,6002]
• We can not accept that salary can exceed
7000
AnalyticA Journey-Management Levels Support
Statistical Estimation
• Sample
• We got a sample from student in math exam
• Sample size was 100
• Average 35 point
• Standard Deviation 3
• Required confidence level 99%
• Calculate Confidence period
• Answer
• Sample Error = Z*3/Sqrt(100)
• Z(99%) = 2.58
• Error value = +- (2.58 * 3)/10 = .774
• Period estimation at 99% = [35-.774,35+.774]
• = [34.226,35.774]
What is Hypothesis testing
Setup
Hypothesis

Setup
significance
level

Determine
suitable test
statistic

Determine
critical
region

Perform
computation

Decision
Making
What is Hypothesis testing
• What is hypothesis testing?
• Branch of inference statistics Setup
Hypothesis
• Depends on distributing the normal distribution
curve to accept and reject region Setup
significance
• Acceptance and rejection are performed level

according to the hypothesis Determine


suitable test
• Steps statistic

• Setup hypothesis Determine


critical
• Setup significance level region

• Determine suitable test statistic Perform


computation
Determine critical region
• Perform computation Decision

• Decision making Making

• Accept if calculated value is in the acceptance region


• Reject if the calculated value is out of acceptance
region
Hypothesis Example
• Exam : Milk Machine
• Production company decided to test a machine that designed to Setup
Hypothesis
fill a bottle with 120 gm
• Sample of 100 bottle has been selected Setup
significance
• Average weight was 118.5, with Standard deviation 5 level

• Test the hypothesis that said: Determine


• Average is different than 120 at confidence level 1% suitable test
statistic

• Answer Determine
• Setup hypothesis: critical
region
• H0 : U= 120, H1 : U <> 120
• Setup significance level Perform
computation
• Z(99) = +-2.58
• Calculated value = (118.5-120)/(5/sqrt(100)) = -3 Decision
• Value has been found in the reject area, so we reject H0, that Making

says value is 120, and accept the alternative hypothesis.


Hypothesis Example
• Exam : Education System
• A teacher said that the new education system decrease the Setup
Hypothesis
student level under 75%
• Sample of 49 students has been taken Setup
significance
• Average was 65.5, with Standard deviation 6.4 level

• Test the hypothesis at confidence level 5% Determine


suitable test
• Answer statistic

• Setup hypothesis: Determine


• H0 : U>75, H1 : U <75 critical
region
• Setup significance level
• Z(95) = -1.96 Perform
computation
• Calculated value = (65.5-75)/(6.4/sqrt(49)) =
• =-4.925 Decision
• Value has been found in the reject area, so we reject H0, that Making

says impact is greater than 75%, and accept the alternative


hypothesis that says it’s less than 75%
Hypothesis Example
• Exam : Education System 2018 Setup
Hypothesis
• By 2015, The student average level in math was 36
• By 2018, Sample of 64 students has been taken Setup
significance
• Average was 40, with Standard deviation 8 level

• Test the hypothesis that says no improvement Determine


occurred on the students level in the new system, at suitable test
statistic
confidence level 5%
Determine
• Answer critical
region
• Setup hypothesis:
• H0 : U>36, H1 : U <36 Perform
computation
• Setup significance level
• Z(95) = -1.96 Decision
Making
• Calculated value = (40-36)/(8/sqrt(64)) =
• =4
• Value has been found in the accept area, so we
accept H0, that says there is an impact, and reject the
alternative hypothesis that says it’s less than 36
References
 Sessions
 https://sites.google.com/view/technologyops/software-approaches/data-
analysis
 TechnologyOps Portal
 https://sites.google.com/view/technologyops

You might also like