Professional Documents
Culture Documents
Answer to original
question Collect data
24
The Basic Types of Statistics
• DESCRIPTIVE STATISTICS is relevant in several
different situations:
1. When a researcher needs to summarize or
describe the distribution of a single variable.
These statistics are called univariate (“one
variable”) descriptive statistics.
Studies
Experimental
Comparative Studies
Non-Experimental
• The purpose of a survey is to quantify
population characteristics
32
Experiments & Observational Studies
Experiment Terminology
Experimental Unit Treatment Response
33
Experiments & Observational
Studies
34
Experiments & Observational
Studies
Observational Study
Unit Treatment Response
patient smoking lung cancer
Note:
Only a well-designed and well-executed
experiment can reliably establish causation.
An observational study is useful for identifying
possible causes of effects, but it cannot
reliably establish causation.
36
SAMPLING
37
Sampling
• We can NOT study the whole population
38
Sampling is the rule in statistics
39
Sampling
• Samples must be collected in a way to allow
for generalizations to be made to the entire
population
• To accomplish this goal, the sample must
entail an element of chance (A random
sample must be used)
• The most fundamental type of random sample
is a Simple Random Sample
40
• Population
– Entire aggregation of cases that meets a specific
set of criteria
• Sample
– Subset of entities that make up the population
41
• Unit/ Element
– Most basic unit about which information is
collected
• Sampling Frame
– Listing of accessible population from which you’ll
draw your sample
42
43
Sampling
• Sampling
– Selection of a number of study units from a
defined study population
• Representative
– Includes all the characteristics of the population
from which it is drawn
44
45
Non-Probability Samples
• Convenience Sampling
– The use of the most conveniently (i.e., relevent to
the study”) available people as study participants
• Example
– Distributing a questionnaire to first 100 asthmatic
patients attending outpatient
• Example
– Distributing questionnaire to 200 students leaving
the hospital library 46
• Quota Sampling
– Identifying strata of the population.
– Specifying the proportions of elements needed
from various strata of population
– Example
• Distributing questionnaire to 200 students leaving the
hospital library
BUT
giving 150 to male students and 50 to female students
47
Probability Samples
• Each unit of sample is chosen by chance
48
Simple Random Sample
• To select a simple random sample
– Prepare sampling frame
49
50
Stratified Random Sample
• Population is divided into homogenous strata
52
To select a stratified random sample
53
54
Example
56
Size of the Population
Sample size
Sampling interval
57
Cluster Sample
• Selection of groups of study units (clusters)
instead of selection of units individually
58
Completely Randomized Design
The treatments are allocated entirely by
chance to the experimental units.
60
Completely Randomized Design
Example:
Which of two varieties of tomatoes (A & B) yield a
greater quantity of market quality fruit?
61
Completely Randomized Design
Divide the field into plots and randomly
allocate the tomato varieties (treatments) to
each plot (unit).
8 plots – 4 get variety A
UPHILL
62
Completely Randomized Design
Note:
Randomization is an attempt to make the
treatment groups as similar as possible — we
can only expect to achieve this when there is a
large number of experimental units to choose
from.
63
Data Collection
1- Identify your study question
2- Define your variables
3- Define your study design
4- Calculate your sample size
5- Define your inclusion and exclusion criteria
6- Design your DATA collection sheet
7- Define your instruments you are going to use
8- Go and collect your data
9- Enter your data
Define your question ????
Define your question ????
nE.g;
¨What is the number of ….. ????
¨Is there a relationship between ….. ???
¨Does this (variable) ….. affect that (variable)???
Your question will lead you to
your variable(s)
Data Collection Sheet
Data Collection Sheet
A- Personal characteristics
– Age (continuous, categorical)
– Sex
– Residence (by district, by site urban # rural)
– Income (continuous, categorical)
– No. of children
B- Study characteristics
– No. of patient days
– No. of bacterial growth
– Satisfaction with food
– Percent of CO in classrooms
– Etc;…….
Prepare Your Coding Sheet
• Code your variables (especially categorical)
Numerical Presentation
Objective:
At the end of this session participants should be able to:
• Recognize the advantages and limitations of ordered array
• Explain the method of construction of an ordered array
• Explain the method of construction of a frequency distribution, a
cumulative frequency distribution and cross tabulation
• Tabulate a given set of data and Comment on the results
• Compute a percentage distribution and a cumulative percentage
distribution
Results are presented as a mass of unordered data (raw data)
63 40 32 24 29 36 48 19 23 39
[Ordered array]
18 19 19 21 22 23 24 24 29 30
32 35 36 39 40 42 48 51 63 63
• Age of youngest subject = 18
• Age of eldest subject = 63
• About ½ of the subjects below the age of 30 Computer = Sorting
[Raw data]
R&R University Primary Illiterate R&R
Secondary Prep. Secondary Illiterate Primary
Prep. Illiterate Primary Prep. Illiterate
[Ordered array]
A AB O O B
AB B A A B
AB AB B B A
O AB B A AB
K = 1 + 3.322 (log10n)
Where:
• n is the number of individuals.
• Estimate based on this formula can be or ¯ for convenience and
clear presentation
Example: If n= 275 then K = 1+ 3.322 ´ 2.4393 = 9
2- Width of the class interval(W)
Should be of same width although this is sometimes impossible.
W = Range (R)/K
Where:
• R=largest observation - smallest observation in the data set
17 22 13 25 16 19 14 18 26 14.9
23 22 19.7 12 17 24 26 13 18 20
• Smallest = 12
• Largest = 26
• R = 30-10 =20
• If width = 5 then no. of categories = 20/5 = 4 intervals
Weight (Kg) Tally Frequency
10- //// 5
15- //// // 7
20- //// 5
25-30 /// 3
Total 20
Distribution of a sample of subjects by weight
Weight
Frequency CF % Cum. %
(Kg)
10- 5 5 25 25
15- 7 12 35 60
20- 5 17 25 85
25-30 3 20 15 100
Total 20 100
Cumulative frequency or Cum. % to facilitate obtaining
information regarding frequency or % of values within two or
more contiguous class intervals.
3- Methods of writing class intervals to avoid overlap
A B C D
15 to less than 20 15-19.9 15-19 15-
20 to less than 25 20-24.9 20-24 20-
25 to less than 30 25-29.9 25-29 25-
30 to less than 35 30-34.9 30-34 30-35
Most clearest Quantitative Quantitative Cont. &
Big space continuous discrete discrete
row.
1 2 6 7 3 5 5 2 2
6 2 5 1 3 1 8 1 1
4 1 1 4 4 4 6 1 2
2 1 0 3 3 4 3 1 4
2 3 3 7 4 2 6 1
1 8 4 3 3 5 2 1
Why ?
• Attract the reader’s attention
The human brain is more tolerant of visual
presentations than it is of numerical ones, and
can assimilate information more rapidly and
retain it far longer when pictures are used.
■Bar Graph
■Pie Chart
■Both these are graphical means for
Frequency Distribution
Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
Relative Frequency Distribution
Relative Percent
Rating Frequency Frequency
Poor .10 10
Below Average .15 15
Average .25 25 .10(100) = 10
Above Average .45 45
Excellent .05 5
Total 1.00 100
1/20 = .05
Bar Graph
6
5
4
3
2
1
Rating
Poor Below Average Above Excellent
Average Average
Pie Chart
n The pie chart is a commonly used graphical device
for presenting relative frequency distributions for
qualitative data.
■ First draw a circle; then use the relative
frequencies to subdivide the circle
into sectors that correspond to the
relative frequency for each class.
■ Since there are 360 degrees in a circle,
a class with a relative frequency of .25 would
consume .25(360) = 90 degrees of the circle.
Pie Chart
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
Relative Percent
Students wt (Kg
Frequency Frequency
50-59 .04 4
60-69 .26 26
2/50 .04(100)
70-79 .32 32
80-89 .14 14
90-99 .14 14
100-109 .10 10
Total 1.00 100
Relative Frequency and
Percent Frequency Distributions
■ Insights Gained from the Percent Frequency
Distribution
• Only 4% of the students wt are in the Kg50-59 class.
• 30% of the students wt are under Kg70.
• The greatest percentage (32% or almost one-third)
of the students wt are in the Kg70-79 class.
• 10% of the students wt are Kg100 or more.
Dot Plot
• One of the simplest graphical
summaries of data is a dot plot.
• A horizontal axis shows the range of
data values.
• Then each data value is represented by
a dot placed above the axis.
Dot Plot
Students weight
.
. .. . . .
. .. .. .. .. . .
. . . ..... .......... .. . .. . . ... . .. .
50 60 70 80 90 100 110
Weight (Kg)
10
8
6
4
2
.25
.20
.15
.10
.05
0
Example
15
10
5
0
0 . 25 30 35 40 45 50 55 60 65
Age (years)
Distribution of a group of
cholera patients by age
Histogram
■Moderately Skewed Left
– A longer tail to the left
– Example:
.35 exam scores
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Histogram
■Moderately Right Skewed
– A Longer tail to the right
– Example:
.35 housing values
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Histogram
■Highly Skewed Right
– A very long tail to the right
– Example:
.35 executive salaries
.30
Relative Frequency
.25
.20
.15
.10
.05
0
Cumulative Distributions
80
60 (89.5, 76)
40
20
Weight
(Kg)
50 60 70 80 90 100 110
Summary Statistics
138
Introduction
140
(1)Measures of Central
Tendency
Mid-range Arithmetic
Mode Median mean
141
1- Mid-range
A- Ungrouped data
18 + 30
Mid-range = 24 Kg
2
142
B- Grouped data
LL. of first interval + UL. of last interval
2
Example:
Body weight (kg) Frequency
25- 5
30- 2
35- 14
25+75
40- 9 Mid-range 50 Kg
60-75 4 2
Total 34
143
Advantages
ØEasy
ØQuick
Disadvantages
ØUsed only with quantitative variables
ØRough measure
144
2- Mode
The observation or observations of highest frequency
A) Ungrouped data Examples: Weight (kg)
18 11 16 14 19 15 13 12
No Mode
16 12 16 14 18 16 14 12
Mode = 16 Kg
145
The mode is the most frequently occurring value in a set of discrete data.
There can be more than one mode if two or more values are equally
common.
Example
Suppose the results of an end of term Statistics exam were distributed as
follows:
Student Score
1 94
2 81
3 56
4 90
5 70
6 65
7 90
8 90
9 30
B) Grouped data
Examples:
Weight (kg) Frequency
25- 14
60-75 4 25+30
Total 43 1st mode= =27.5 Kg
2
35+40
2nd mode= =37.5 Kg
2
147
Bld. Gr. Frequency
A 10
B 14
AB 25
0 9
Mode is AB
Total 58
148
Advantages
ØEasy
ØUsed with all types of variables
ØNot affected with extreme observations
Disadvantages
ØNeglects the less frequent observations
ØSometimes there is no mode
ØThe distribution may be bi-modal or multi-modal
149
The median
The median is the middle observation in a set
• 50% of the data have a value less than the median, and 50% of the
data have a value greater than the median.
• The median is the value halfway through the ordered data set, below
and above which there lies an equal number of data values.
Calculation of the median from raw data
Let n = the number of observations
If n is odd, ~ n+1
x=
2
If n is even, the median is the mean of the n th observation
and the æç n + 1ö÷th observation 2
è2 ø
150
A) Ungrouped data
Odd number of observations:
• Arrange observations Ascending order
• Rank of median = (n + 1)/2
Example:
Ø Row data è 24 – 18 – 22 – 20 - 16 kg
Ø Arranged data è 16 – 18 – 20 – 22 – 24 kg
Ø Rank = (5+1)/2
Ø Median = value of 3rd observation = 20 kg
151
Even number of observations:
Example:
ØRow data è 26 - 24 – 18 – 22 – 20 - 16 kg
ØArranged data è 16 – 18 – 20 – 22 – 24 - 26 kg
(20+22) / 2 = 21 kg 152
Example
With an odd number of data values, for example 21, we have:
96 48 27 72 39 70 7 68 99 36 95 4 6 13
Data
34 74 65 42 28 54 69
Ordered 4 6 7 13 27 28 34 36 39 42 48 54 65 68
Data 69 70 72 74 95 96 99
57 55 85 24 33 49 94 2 8 51 71 30 91 6 47 50
Data
65 43 41 7
Ordered 2 6 7 8 24 30 33 41 43 47 49 50 51 55 57 65
Data 71 85 91 94
Median
Halfway between the two 'middle' data points
- in this case halfway between 47 and 49, and
so the median is 48
Calculate the median age, weight and height of the group
Calculation of the median from a grouped frequency distribution
156
Calculation of the median from a grouped frequency distribution
Solution: n
n = 60 Þ = 30
2
The median is in the 2nd class.
n -F
m e d ia n » l. c . b . + 2 ´w
f
Solution: n
= 5×5
2
The median is in the 3rd class.
n -F
m e d ia n » l. c . b . + 2 ´w
f
ordinal
Disadvantage
Ø Does not take all observations into consideration
162
4- The arithmetic mean
A) Ungrouped data
24+20+22+16+18 100
Mean= = = 20 Kg
5 5
163
Notation…
When referring to the number of observations in a
population, we use uppercase letter N
Population Sample
Size N n
Mean
165
Arithmetic Mean…
Sample Mean
Population Mean
166
Statistics is a pattern language…
Population Sample
Size N n
Mean
167
Exercise
Body Mass Index: 24.4 30.4 21.4 25.1 21.3 23.8 20.8 22.9
20.9 23.2 21.1 23.0 20.6 26.0
168
:
Mid-point of
Frequency
Weight (kg) interval fjXj
fj
Xj
15- 3 20 60
25- 6 30 180
35- 8 40 320
45- 2 50 100
55-65 1 60 60
20 720
Total
S fj S fj Xj
720
`X = = 36 Kg
169
20
Advantages
Disadvantages
170
Exercise:
Weight: 83.9 99.0 63.8 71.3 65.3 79.6 70.3 69.2 56.4 66.2
Height: 185 180 173 168 175 183 184 174 164 169 205 161
177 174
171
Mean, Median, Mode…
If a distribution is symmetrical,
the mean, median and mode may coincide…
median
mode
mean
4.172
Mean, Median, Mode…
If a distribution is asymmetrical, say skewed
to the left or to the right, the three
measures may differ. E.g.:
4.173
Measures of Variability…
Measures of central location fail to tell the whole
story about the distribution; that is, how much are
the observations spread out around the mean value?
4.174
Measures of Relative
Importance
Number of observations having a given characteristic
Proportion =
Total number of observations
177
Summary Statistics
178
Objectives
After this session participants will be able to do the following
Compute and interpret the following measures of dispersion:
• Range
• Standard deviation
• Variance
• Coefficient of variation
Choose and apply the suitable measure of dispersion
179
• They are also called measures of spread or
variation.
• Definition:
181
1- Range
• It is the simplest measure of dispersion.
• Definition:
• It is the difference between the highest and lowest
values.
• Advantage:
• It is quick and easy to calculate.
• Disadvantage:
• It does not use directly the majority of the
observations.
• It is very sensitive to extreme values.
182
• Example The following data represent the
weight of 10 persons:
• 20 -60 - 53 -80- 89 - 56- 42- 46- 88- 95 kg
ØFind the range
• Answer : largest observation = 95
• smallest observation = 20
• The range = 95 - 20 = 75 kg
183
Age (years) Frequency
15- <25 4
25- <35 8
35- <45 26
45- < 55 8
55- < 65 4
Total 50
Ø Compute the range
Answer :
Upper limit of last interval =65
Lower limit of first interval =15
Range = 65 - 15 = 50 years
184
2- Standard Deviation
• Definition:
• It is a measure of the spread of data around their mean.
• It is the positive square root of the variance.
• The value of standard deviation and variance are always
positive .
• Advantage:
• It is the preferred measure of dispersion.
• It uses all of the measurements in the set.
– Disadvantage:
• It is influenced by a few (or even only one) extreme
values.
185
Steps of calculation:
1. Determine the sum of observations (åX)
2. Find (åX)2
3. Find the square of each observation X2
4. Find the sum of the squared observations (åX2)
åX2 – (åX)2
n
S=
n-1
Example (Ungrouped data)
187
Answer
åX = 14 +15 +16 +17 + 18 = 80
(åX)2 = (80)2 = 6400
åX2 = 196 + 225 + 256 + 289 + 324
= 1290
6400
1290 -
S = 5 = 1.6 kg
5 -1
Example (grouped data)
Weight fj xj fj xj fj xj2
( kg)
15 - 3 20 60 1200
25 - 6 30 180 5400
35 - 8 40 320 12800
45 - 2 50 100 5000
55 -65 1 60 60 3600
Total 20 720 28000
åf j x j åf j x j 2
28000 – (720)2
20
S=
20-1
3- Variance
Variance and its related measure, standard deviation, are arguably
the most important statistics.
They are used to measure variability, they also play a vital role in
almost all statistical inference procedures.
190
Statistics is a pattern language…
Population Sample
Size N n
Mean
Variance
191
Variance…[Check your calculator?]
population mean
population size
sample mean
192
Application…
The following sample consists of the number of
jobs six randomly selected students applied for:
17, 15, 23, 7, 9, 13.
Finds its mean and variance.
193
Sample Mean & Variance…
Sample Mean
Sample Variance
194
Standard Deviation…
The standard deviation is simply the square root
of the variance, thus:
195
Statistics is a pattern language…
Population Sample
Size N n
Mean
Variance
Standard
Deviation
196
Students work
7. For the following data {7, 2, 9, 7, 5}, calculate the
a. Mean
b. Median
c. Mode
d. Range
e. Variance
f. standard deviation
g. what percentile is the number “9” in the data set?
197
Students work
•The following are average weights of 30 students
of UoS
65 67 70 71 68 69 65 68 65 68 69 83 90 45 49 67 68 69 70
71 72 71 71 72 71 72 74 70 65 66
•By using Excel Calculate: mean, SE, Median, Mode,
SD, Range,
•Using Tools > Data Analysis may need to “add in”… >
[ in
Excel, you can produce all of these tests
198
Empirical Rule – The standard
deviation and the normal distribution
For unimodal, moderately symmetrical, sets of
data approximately:
i.e. Normally Distributed Data
• 68% of observations lie within 1 standard
deviation of the mean.
• 95% of observations lie within 2 standard
deviations of the mean.
199
The Empirical Rule
x 200
The Empirical Rule
68% within
1 standard deviation
34% 34%
95% within
2 standard deviations
68% within
1 standard deviation
34% 34%
13.5% 13.5%
x - 2s x-s x x + s x + 2s 203
The Empirical Rule
99.7% of data are within 3 standard deviations of the mean
95% within
2 standard deviations
68% within
1 standard deviation
34% 34%
2.4% 2.4%
0.1% 0.1%
13.5% 13.5%
x - 3s x - 2s x-s x x + s x + 2s x + 3s
204
z-Scores and Location
• By itself, a raw score or X value provides very little
information about how that particular score
compares with other values in the distribution.
• For example, your score (X) = 53. This score may
be a relatively low score, or an average score, or
an extremely high score depending on the mean
and standard deviation for the distribution from
which the score was obtained.
• If you transformed your score (X) into a z-score,
the value of the z-score tells exactly where your
score (x) is located relative to all the other scores
in the class.
205
z-Scores and Location (cont.)
• The process of changing an X value into a z-score
involves creating a signed number, called a z-score,
such that
a. The sign of the z-score (+ or –) identifies
whether the X value is located above the
mean (positive) or below the mean (negative).
b. The numerical value of the z-score
corresponds to the number of standard
deviations between X and the mean of the
distribution (class average).
206
z-Scores and Location (cont.)
• Thus, a score (x) that is located two standard
deviations above the mean will have a z-score
of +2.00. And, a z-score of +2.00 always
indicates a location above the mean by two
standard deviations.
207
Definition of z-score
Population z-score Sample z-score
x-µ x-x
z= z=
s s
In either case, the z-score tells us how
many standard deviations above (if z > 0)
or
below (if z < 0) the mean an observation is.
208
Interpretation of z-Scores
• If z = 0 an observation is at the mean.
• If z > 0 the observation is above the mean in
value, e.g. if z = 2.00 the observation is 2 SDs
above the mean.
• If z < 0 the observation is below the mean in
value, e.g. if z = -1.00 the observation is 1 SD
below the mean.
209
The Empirical Rule (z-scores)
99.7% of data are within 3 standard deviations of the mean
95% within
2 standard deviations
68% within
1 standard deviation
34% 34%
2.4% 2.4%
0.1% 0.1%
13.5% 13.5%
211
Outliers based on z-scores
• When we consider the empirical rule an
observation with a
z-score < -2.00 or z-score > 2.00
might be characterized as a mild outlier.
212
Measures of Shape –
Skewness and Kurtosis
Statistical software packages will give some
measure of skewness and kurtosis for a
given numeric variable.
Skewness measures departure from symmetry
and is usually characterized as being left or
right skewed as seen previously.
Kurtosis measures “peakedness” of a
distribution and comes in two forms,
platykurtosis and leptokurtosis.
213
Skewness
Pearson’s Skewness Coefficient
x - median If skewness < -.20 severe left skewness
Skewness = If skewness > +.20 severe right skewness
s
Fisher’s Measure of Skewness has a complicated
formula but most software packages compute it.
Skewness = -.5786
Suggesting slight left
skewness.
Skewness = 1.944
Suggesting strong
right skewness.
215
Kurtosis
Measures peakedness of a distribution.
Normal distribution
has Kurtosis = 0.
12
P(cruise) = = 0.24
50
Certain and impossible
• Probability of an event is a Number
between 0 and 1.
• An event(E) that is certain to happen, then
P(E) = 1
• e.g. A die is thrown
6
P(integers)= =1
6
Certain and impossible
An event(E) that is impossible
to happen, then P(E) = 0
0 ½ 1
Probability
Equally likely to
happen or not to happen Certain to
Certain not
to happen happen
Chance
50 %
0% 100%
Likelihood
1 1
P(H)= and P(T)=
2 2
Applied Probability
25 times
What is the Probability of getting 5 or 6
When a die is thrown ?
P( 5 or 6 ) = 2 1
= = 0.33 3
6 3
Probability
• A jar contains 12 blue ,
8 green, and 5 red marbles
–If you reach in & choose 1
–What is the P it is blue?
–What is the P it is not blue?
–What is the P it is not black?
• What is the P it is blue?
P = 12/25 = .48
• What is the P it is not blue
P = 13/25 = .52
• What is the P it is not black
P = 25/25 = 1
Example:
P( red marbles )=
5 = 1 = 0.5
10 2
The Addition Rule
• It is applied for mutually exclusive events:
– Cannot occur together.
• toss 1 coin, H and T are mutually exclusive, can
get one or the other, not both
P (A or B) = P (A) + P (B).
The Addition Rule
• Example
• Probability of getting a head or tail when
you toss a coin
Example:
You roll a die. Find the probability that you roll a number less
than 3 or a 4.
1 1
= ´
52 52
1
= = 0.00037
2704
• Example 1: toss 2 coins or 1 coin 2 times, H1
and T2 are independent
Dihybrids
for a dihybrid cross, YyRr x YyRr, what is the probability of an F2 plant having the genotype YYRR.
Probability that an egg from a YyRr parent will receive the Y and the R alleles = ½ x ½ = ¼
probability that a sperm from a YyRr parent will receive the Y and the R alleles = ½ x ½ = ¼
the overall probability of an F2 plant having the genotype YYRR
= ¼ x ¼ = 1/16.
• A Biostatistics Class has 17 boys
and 16 girls.
• One student is chosen at random.
• The Probability that the student
is a girl is:
• # of students = 16 + 17
• # of students = 33
• # of girls = 16
• P = 16/33 = .485
What’s the probability of …
• getting a 6 on a dice
• a letter chosen from the word RABBIT
is a B
• getting a number less than 3 on a
dice
• a person’s birthday is on a Sunday
this year
Two dice are rolled. Let us define event E as the set of possible
outcomes where the sum of the numbers on the faces of the two dice is
equal to 5.
Two dice are rolled. Let us define event E as the set of possible
outcomes where the sum of the numbers on the faces of the two dice is
equal to 5.
S = { (1,1),(1,2),(1,3),(1,4),(1,5),(1,6)
(2,1),(2,2),(2,3),(2,4),(2,5),(2,6)
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6)
(4,1),(4,2),(4,3),(4,4),(4,5),(4,6)
(5,1),(5,2),(5,3),(5,4),(5,5),(5,6)
(6,1),(6,2),(6,3),(6,4),(6,5),(6,6) }
Two dice are rolled. Let us define event E as the set of possible
outcomes where the sum of the numbers on the faces of the two dice is
equal to 5.
S = { (1,1),(1,2),(1,3),(1,4),(1,5),(1,6)
(2,1),(2,2),(2,3),(2,4),(2,5),(2,6)
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6)
(4,1),(4,2),(4,3),(4,4),(4,5),(4,6)
(5,1),(5,2),(5,3),(5,4),(5,5),(5,6)
(6,1),(6,2),(6,3),(6,4),(6,5),(6,6) }
E = {(1,4),(2,3),(4,1), (3,2)}
95% within
2 standard deviations
68% within
1 standard deviation
34% 34%
2.4% 2.4%
0.1% 0.1%
13.5% 13.5%
Proportion under
the normal curve
260
Given below are the steps for Finding the Area using the Z-
Score Table
1-Calculate the Z-score using the formula, z=(x−μ)/σ round the answer to the
hundredth (two decimals)
2-Note the absolute value, by ignoring the sign.
3-Read the Z value with the first decimal on left most column and move along the
row to match the column showing the value in the second decimal place.
4-Note down the area given inside the table.
2. Standardized the value and sketch
the Standard Normal Curve
264
This is the score of Aysha in two tests, compared
to the class
Test A: as a z-score, z = (78-70) / 8 = 1.00
Test B: as a z-score , z = (78 - 66) / 6 = 2.00
Find the area beyond 69; subtract from this the area
beyond 76.
Find z for 69: = 1.125. “Area beyond z” = 0.1314.
Find z for 76: = 2.00. “Area beyond z” = 0.0228.
0.1314 - 0.0228 = 0.1086 .
Thus 10.86% of scores fall between 69 and 76 (52 out of
480).
Question 1: The test scores of students in a class test has a mean of 70
and with a standard deviation of 12. What is the probable percentage of
students scored more than 85?
Solution:
The z score for the given data is,
z = 85–70/12 = 1.25
From the z score table the fraction of the data within this z score is 0.8944.
This means 89.44% of the students are within the test scores of 85 and hence the percentage of
students who are above the test score of 85 = (100 – 89.44)% = 10.56%
The z score of the employees with a salary less than 3000 = 3000−4000/600
= - 1.67 (approx)
The z score of the employees with a salary more than 4500 = 4500−4000/600
= 0.83 (approx)
Therefore, the fraction of data between the z scores of -1.67 and 0.83 = 0.7967 – 0.0475 = 0.7492
Hence, 74.92% of clerical level employees are within the salary bracket [3000, 4500].
Problems: Normal Distribution
•If the random variable X has a normal distribution with
• mean 40 and std. dev. 5, calculate the following
•probabilities.
– P(X > 43) =
– P(X = 40) =
– P (Y < 18) =
In the following examples, the mean time it takes expectant mothers to
locate a baby face in a crowd is 77 milliseconds. There is a standard
deviation of 10 milliseconds for the recognition of the babies faces.
What proportion of expectant mothers took an average of 90 (X=90)
milliseconds or less to recognize the babies faces?
In the following example the average miles per gallon (MPG) a Ford motor car
gets is 23 with a standard deviation of 5. How many miles to the gallon does
the top 10% of Ford cars get?
Chapter 9:
Basics of Hypothesis Testing
In Chapter 9:
Parameters Statistics
Vary No Yes
Calculated No Yes
Sampling Distributions of a Mean
x ~ N (µ , SE x )
s
where SE x =
n
Sample mean (x bar) based on large samples will have a Normal sampling
distribution with an expectation equal to the population mean with a standard
error equal to the standard deviation of the population divided by the square
root of the sample size n
Hypothesis Testing…
•Any study starts by identifying the hypotheses
behind the study
x - µ 0 185 - 170
zstat = = = 3.00
SE x 5
Reasoning Behinµzstat
x ~ N (170,5)
Sampling distribution of xbar
under H0: µ = 170 for n = 64 Þ
3 P-value
• The P-value answer the question: What is the
probability of the observed test statistic or one more
extreme when H0 is true?
• This corresponds to the AUC (Area under Curve) in the
tail of the Standard Normal distribution beyond the
zstat.
• Convert z statistics to P-value :
For Ha: μ > μ0 Þ P (probability) of area right to zstat = right-tail
beyond zstat
For Ha: μ < μ0 Þ P of area left to zstat = left tail beyond zstat
For Ha: μ ¹ μ0 Þ P = 2 × one-tailed P-value
• Use Table B or software to find these probabilities (next
two slides).
One-sided (Tailed) P-value for zstat of
0.6
One-sided (Tailed) P-value for zstat of 3.0
Two-Sided (Tailed) P-Value
• One-sided Ha Þ
AUC in tail beyond
zstat
• Two-sided Ha Þ
consider potential
Examples: If one-sided P
deviations in both = 0.0010, then two-sided
directions Þ P = 2 × 0.0010 = 0.0020.
double the one- If one-sided P = 0.2743,
sided P-value then two-sided P = 2 ×
0.2743 = 0.5486.
Interpretation
• P-value answer the question: What is the
probability of the observed test statistic …
when H0 is true?
• Thus, smaller and smaller P-values provide
stronger and stronger evidence against H0
• Small P-value Þ strong evidence
Interpretation
Conventions*
P > 0.10 Þ non-significant evidence against H0
0.05 < P £ 0.10 Þ marginally significant evidence
0.01 < P £ 0.05 Þ significant evidence against H0
P £ 0.01 Þ highly significant evidence against H0
Examples
P =.27 Þ non-significant evidence against H0
P =.01 Þ highly significant evidence against H0
* It is unwise to draw firm borders for “significance”
Interpreting
Overwhelming Evidence
the p-value…
(Highly Significant)
Strong Evidence
(Significant)
Weak Evidence
(Not Significant)
No Evidence
(Not Significant)
p=.001 p=.27
α-Level (Used in some situations)
Is the average
Population of GPA 2.7 ?
5 million college (Imagine that 2.7 was
students mean GPA for U.S. college
students in 1990)
X - µo X - µ o
Z stat = =
SE(X) s
n
2.91 - 2.7
= = 3.44
.61
100
Example: Grade Inflation (cont’d)
p-value calculation and interpretation
Interpretation:
• We conclude that the mean GPA of U.S.
college students today is greater than 2.70,
which is what is was back in 1990.