Professional Documents
Culture Documents
Lecture #7/8
Inference
Normal distribution & Hypothesis testing
Dr.Ghoniem Lawaty
GHONIEM.GHONIEM@GMAIL.COM
MIS, DA, ML, Digitization and Micro-Services, TOGAF, DEVOPS
Certified ATM for CMMI SCAMPI (A) method, DA,DS,ML, ICAgile
https://www.linkedin.com/in/ghoniem-abdel-azim-mostafa-33860691
Topics
• Inference in DA
• What is?
• Why?
• Sampling
• Random sample
• Criteria
• How to calculate
• Normal distribution
• Why?
• How?
• Rules of Normal distribution
• Role of normal distribution in inference
• Difference between normal distribution and standard normal distribution
• Practical samples
• Hypothesis testing
• What?
• Why?
• Rules
• Steps
• Practical samples
What is inference in DA?
What is Inference?
• Inference is the use of samples to acquire
the new knowledge about the population
Why?
• The cost of application on the population
• Inference is by nature predictive
How to guarantee the accuracy of inference?
• Larger sample size
• Higher accuracy rate
• Low mean error
How?
• Normal distribution
• Estimation : point and period
• Hypothesis testing
Normal distribution Vs Standard normal distribution
Z-Score
• Also called the standard score
• Represent how many standard deviation below or
above the mean
Z-Score values and ranges
• Zero means equal to average
• Between -1,+1 : area equal to 68%
• Between -2,+2: area equal to 95%
• Between -3,+3: area equal to 99%
Rule of calculation
• Z-Score = (X-U)/S
What do we mean by: Z-Score =1.2
• It means that the score is 1.2 standard deviation of
the mean
• So z = 1.2 * S + Mean
How can we can calculate the area under the curve?
• We can use Z table to calculate the area, which
represents the amount of population elements
times the factors
Normal distribution use the actual values, while standard values
use Z-Score values
Z-Score and normalization values
Difference between standard value and normalized value:
• Both are used to have values on the same base
of scaling, in order to simplify visualization
process
• Z score lays between -4,4
• Normalized (Scaled) score lays between 0,1
Example:
• Values : 1,2,3,4,5
• Min = 1
• Max = 5
• Avg = 3
• Stddev = 1.58
• Z-Score(1) = (1-3)/1.58 = -1.26
• Z-Score(5) = (5-3)/1.58 = 1.26
• X_Normalized(1)= (1-1)/(5-1) = 0
• X_Normalized(5)= (5-1)/(5-1) = 1
Calculating the area under the curve
We can calculate it using the Z-Score value
Then go to Z-Table and search for the value by
summing the vertical and horizontal values
and get the equivalent value
Example: Z=1.25
• Go to columns:1.2
• Go to row:0.05
• The equivalent value is 0.3944
This means: 39.44% of the population lays
between 0 and 1.25
So if the population is 10000, then 3944
element lays between 0 and 1.25
Z-Score Vs area under the curve
Z-Score represent the position of the sample
under the curve
While the area under the curve represents the
portion of the population under the sample at
this point.
As the previous example:
• This means: 0.3944% of the
population lays between 0 and 1.25
• So if the population is 10000, then
3944 element lays between 0 and
1.25
Inference decision tree
and lifecycle model
Population
Parameter is
Claimed Parameter is
unknown
(guessed)
Setup
significance
level
Determine
suitable test
statistic
Determine
critical
region
Perform
computation
Decision
Making
What is Hypothesis testing
• What is hypothesis testing?
• Branch of inference statistics Setup
Hypothesis
• Depends on distributing the normal distribution
curve to accept and reject region Setup
significance
• Acceptance and rejection are performed level
•
computation
Determine critical region
• Perform computation Decision
• Answer Determine
• Setup hypothesis: critical
region
• H0 : U= 120, H1 : U <> 120
• Setup significance level Perform
computation
• Z(99) = +-2.58
• Calculated value = (118.5-120)/(5/sqrt(100)) = -3 Decision
• Value has been found in the reject area, so we reject H0, that Making