You are on page 1of 321

Introduction

STATISTICS and BIOSTATISTICS


Statistics is the art and science of data. It deals
with
•Planning Research
•Collecting Data
•Describing Data
•Summarizing- Presenting Data
•Analyzing Data
•Interpreting Results
• Reaching decisions or discovering new
knowledge
* Biostatistics:
The tools of statistics are employed in many fields:
business, education, psychology, agriculture,
economics, … etc.
When the data analyzed/managed are derived
from the biological science and medicine,
we use the term biostatistics to distinguish this
particular application of statistical tools and
concepts.
Goals of Biostatistics
• Improvement of the intellectual content of
the data

• Organization of data into understandable


forms

• Reliance on test of experience as a standard


of validity
The Cycle of Statistical Investigation
Real problems Design method of
Curiosity data collection
Pose the question

Answer to original
question Collect data

Interpret the results - Summary and


What do they mean? analysis of data
Data:
• The raw material of Statistics is data.
• We may define data as figures.
– Figures result from the process of counting or from
taking a measurement.
• For example:
• - When a hospital administrator counts the
number of patients (counting).
• - When a nurse weighs a patient Patient’s
temperature, weight, height, arterial blood
pressure (measurement)
* Sources of Data:
We search for suitable data to serve as the raw
material for our investigation.
Such data are available from one or more of the
following sources:
1- Routinely kept records.
For example:
- Hospital medical records contain immense
amounts of information on patients.
2- External sources.
• The data needed to answer a question may
already exist in the form of “published reports,
commercially available data banks, or the
research literature”,
• i.e. someone else has already asked the same
question.
3- Surveys:
The source may be a survey, if the data needed is
about answering certain questions.
For example:
If the administrator of a clinic wishes to obtain
information regarding the mode of transportation
used by patients to visit the clinic,
then a survey may be conducted among
patients to obtain this information.
4- Experiments.
Frequently the data needed to answer
a question are available only as the
result of an experiment.
For example:
If a nurse wishes to know which of several strategies is
best for maximizing patient compliance, she might
conduct an experiment in which the different
strategies of motivating compliance are tried with
different patients.
Variable: (Measurement) ‫ﻣﺘﻐﯿﺮ‬

It is a characteristic that takes on different values


in different persons, places, or things.
For example:
- heart rate,
- the heights of adult males,
- the weights of preschool children,
- the ages of patients seen in a dental clinic.
• Constant ‫ﺛﺎﺑت‬
• Observation which do not vary from time to
time or from person to person è
number of fingers, number of eyes
Types of Data
• There are different ways to classify variables
and measurements
– Categorical measurements place observations into
unordered categories

– Ordinal measurements place observations into categories


that can be put into rank order

– Quantitative measurements impose equal spacing


between ordered intervals
Type of Variables (Measurements)
Quantitative Variables Qualitative Variables (Categorical
Observations are along a or Nominal)
numeric scale Many characteristics are not
capable of being measured.
For example: Some of them can be ordered or
- the heights of adult males, ranked.
- the weights of preschool For example:
children, - Sex,
- Age
- Blood pressure - Blood Group
- Volume - Taste
- Density - color
- Mass - social classes based on income,
education, etc.
Quantitative variables
A discrete variable A continuous variable
is characterized by gaps or can assume any value within a specified
interruptions in the relevant interval of values assumed
values that it can by the variable.
assume.
For example:
For example: - Height,
- The number of daily - weight,
admissions to a general - skull circumference.
hospital,
- The number of decayed, Noobserved
matter how close together the
heights of two people, we
missing or filled teeth per can find another person whose
child in an elementary height falls somewhere in between.
school
-
- .
KNOW YOUR
Qualitative Variables
(Categorical or Nominal)
Qualitative Variables with two categories
(Binary = Dichotomous=0/1)
• These often relate to the presence or absence of
some attribute
• Male/female
• Disease/No disease
• Married/Unmarried
• Diabetic/non diabetic
• Smoker/non smoker (ex-smoker?)
• Hypertensive/normotensive
Qualitative Variables with more
than two categories
• 1= Nominal
– Country of birth
– Blood group A/B/AB/O
– Married/Single/Divorced/Separated/ Widow

• 2= Ordinal: qualitative variables whose categories can


be put in a definite order
– STAGE OF CANCER classified as stage I, stage II, stage III,
stage IV
– OPINION classified as strongly agree (5), agree (4), neutral
(3), disagree (2), strongly disagree (1); so-called Liekert scale
Any quantitative variable can be converted
into categorical one (two categories or
ordinal)
Quantitative Variable Categorization Ordinal Two categories
Systolic blood pressure >140 mm Hg Hypertensive Normal
90-140 Normal Abnormal
<90 Hypotensive

• Blood glucose >120 Hyperglycaemia Normal


80-120 Normal Abnormal
<80 Hypoglycaemia

• Smoking >10/day Heavy Smoker


5-10 Moderate Non-smoker
<5 Mild
2 Types of Inaccuracies
• Be objective when you collect your data

• Avoid imprecision (the random inability to get


the same result upon repetition…….
Irreproducibility)

• Avoid bias ( a systematic deviation from the


truth)
* A population:
It is the largest collection of values of a random
variable for which we have an interest at a
particular time.
For example:
The weights of all the children enrolled in a
certain elementary school.
Populations may be finite or infinite.
* A sample:
It is a part of a population.
For example:
The weights of only a fraction of these
children.
Types of Studies and Study
Design

24
The Basic Types of Statistics
• DESCRIPTIVE STATISTICS is relevant in several
different situations:
1. When a researcher needs to summarize or
describe the distribution of a single variable.
These statistics are called univariate (“one
variable”) descriptive statistics.

For example, percentages, averages, and graphs can


all be used to describe single variables.
2. When the researcher wishes to describe the
relationship between two or more variables.
These statistics are called bivariate (“two
variable”) or multivariate (more than two
variable) descriptive statistics.

These statistics, called measures of association, allow


us to quantify the strength and direction of a
relationship.
• INFERENTIAL STATISTICS ‫اﻻﺳﺗﻧﺑﺎط‬

– This second class of statistical techniques


becomes relevant when we wish to generalize to
a population: the total collection of all cases in
which the researcher is interested and that he or
she wishes to understand better
• Populations can theoretically range from
inconceivable in size (“all humanity”) to quite small
(all 35-year-olds currently residing in Sharjah) but are
usually fairly large, and social scientists almost never
have the resources or time to test every case in a
population.

• Hence the need for inferential statistics, which


involve using information from samples (carefully
chosen subsets of the population) to make
inferences about populations
Research classifications
• Observational vs. Experimental
Observational – researcher collects info on
attributes or measurements of interest, but
does not influence results.

Experimental – researcher deliberately


influences events and investigates the effects
of theuse
We often intervention,
these whene.g.we clinical trials and
are interested in
laboratory
studying experiments.
the effect of a treatment on individuals or
experimental units.
29
Observational (Surveys)

Studies
Experimental

Comparative Studies

Non-Experimental
• The purpose of a survey is to quantify
population characteristics

• Comparative studies are done to quantify


relationships between variables
Experiments & Observational Studies
We conduct an experiment when it is (ethically,
physically etc) possible for the experimenter to
determine which experimental units receive
which treatment.

32
Experiments & Observational Studies

Experiment Terminology
Experimental Unit Treatment Response

patient drug cholesterol


patient pre-surgery antibiotic infection
mouse radiation mortality

33
Experiments & Observational
Studies

In an observational study, we compare the units


that happen to have received each of the
treatments.

34
Experiments & Observational
Studies
Observational Study
Unit Treatment Response
patient smoking lung cancer

hospital ICU staffing level ICU mortality

e.g. You cannot set up a control


(non-smoking) group and treatment (smoking)
group.
35
Experiments & Observational Studies

Note:
Only a well-designed and well-executed
experiment can reliably establish causation.
An observational study is useful for identifying
possible causes of effects, but it cannot
reliably establish causation.

36
SAMPLING

37
Sampling
• We can NOT study the whole population

• Data yielded from the sample are then used to


infer population characteristics

• Sampling saves time and money and can be


more accurate when taken in the right way

38
Sampling is the rule in statistics

39
Sampling
• Samples must be collected in a way to allow
for generalizations to be made to the entire
population
• To accomplish this goal, the sample must
entail an element of chance (A random
sample must be used)
• The most fundamental type of random sample
is a Simple Random Sample
40
• Population
– Entire aggregation of cases that meets a specific
set of criteria

• Sample
– Subset of entities that make up the population

41
• Unit/ Element
– Most basic unit about which information is
collected

• Sampling Frame
– Listing of accessible population from which you’ll
draw your sample

42
43
Sampling
• Sampling
– Selection of a number of study units from a
defined study population

• Representative
– Includes all the characteristics of the population
from which it is drawn

44
45
Non-Probability Samples
• Convenience Sampling
– The use of the most conveniently (i.e., relevent to
the study”) available people as study participants

• Example
– Distributing a questionnaire to first 100 asthmatic
patients attending outpatient

• Example
– Distributing questionnaire to 200 students leaving
the hospital library 46
• Quota Sampling
– Identifying strata of the population.
– Specifying the proportions of elements needed
from various strata of population

– Example
• Distributing questionnaire to 200 students leaving the
hospital library
BUT
giving 150 to male students and 50 to female students

47
Probability Samples
• Each unit of sample is chosen by chance

• All units have an equal or at least a known


chance of being included in the sample

48
Simple Random Sample
• To select a simple random sample
– Prepare sampling frame

– Draw a sample of desired size using

• Table of random number


• Computer
• Mechanical device

49
50
Stratified Random Sample
• Population is divided into homogenous strata

• Aim at achieving more representative sample

– Urban # rural areas


– Age groups

52
To select a stratified random sample

• Divide population into strata (subgroups)

• Prepare sampling frame for each strata

• Draw a sample of desired size from each strata

53
54
Example

• Suppose you are drawing a sample of 300 of


CoS students and you wish to have
proportional representation from every major.
If only 10% of the students were majoring in
DAB (Biotechnology), random and systematic
sampling could result in a sample with very
few (or even no) DAB students.
Systematic Random Sample

• Individuals are chosen at regular intervals from the


sampling frame.
• Calculate sampling interval
• sampling interval (K) = study population (N)/ sample
size (n)
• Randomly select an integer between 1 to k
• Then take every integer by adding sampling interval

56
Size of the Population

Sample size

Sampling interval

57
Cluster Sample
• Selection of groups of study units (clusters)
instead of selection of units individually

– Divide population into clusters (e.g geographically)


– Randomly sample clusters
– Measure all units within sampled clusters

58
Completely Randomized Design
The treatments are allocated entirely by
chance to the experimental units.

60
Completely Randomized Design
Example:
Which of two varieties of tomatoes (A & B) yield a
greater quantity of market quality fruit?

Factors that may affect yield:


• different soil fertility levels
• exposure to wind/sun
• soil pH levels
• soil water content etc.

61
Completely Randomized Design
Divide the field into plots and randomly
allocate the tomato varieties (treatments) to
each plot (unit).
8 plots – 4 get variety A
UPHILL

1(A) 2 (A) 3(B 4(A)


5(B) 6 (A) 7 (B) 8 (B)
Randomly assign A & B varieties in each strip of similar elevation.
What if the field sloped upward from left to right?

62
Completely Randomized Design
Note:
Randomization is an attempt to make the
treatment groups as similar as possible — we
can only expect to achieve this when there is a
large number of experimental units to choose
from.

63
Data Collection
1- Identify your study question
2- Define your variables
3- Define your study design
4- Calculate your sample size
5- Define your inclusion and exclusion criteria
6- Design your DATA collection sheet
7- Define your instruments you are going to use
8- Go and collect your data
9- Enter your data
Define your question ????
Define your question ????

What do you want to know about ?


What do you want to count ?

nE.g;
¨What is the number of ….. ????
¨Is there a relationship between ….. ???
¨Does this (variable) ….. affect that (variable)???
Your question will lead you to
your variable(s)
Data Collection Sheet
Data Collection Sheet

A- Personal characteristics
– Age (continuous, categorical)
– Sex
– Residence (by district, by site urban # rural)
– Income (continuous, categorical)
– No. of children
B- Study characteristics
– No. of patient days
– No. of bacterial growth
– Satisfaction with food
– Percent of CO in classrooms
– Etc;…….
Prepare Your Coding Sheet
• Code your variables (especially categorical)

• Prepare your coding sheet


Go and collect your data
Enter your DATA
1- Identify your study question

Frequency of overweight and obesity among students of the university?

2- Define your variables

How Are Overweight and Obesity Diagnosed?


Calculate the BMI (body mass index)
Data Organization & Summarization

Numerical Presentation

Objective:
At the end of this session participants should be able to:
• Recognize the advantages and limitations of ordered array
• Explain the method of construction of an ordered array
• Explain the method of construction of a frequency distribution, a
cumulative frequency distribution and cross tabulation
• Tabulate a given set of data and Comment on the results
• Compute a percentage distribution and a cumulative percentage
distribution
Results are presented as a mass of unordered data (raw data)

Listing of the values of a collection in order of magnitude from


the smallest value to the largest value or the reverse.
• Quantitative data.
• Qualitative ordinal.
Advantages:
Enables one to determine quickly the value of the smallest
measurement, the largest measurement.
[Raw data]
18 21 19 22 24 63 51 30 42 35

63 40 32 24 29 36 48 19 23 39

[Ordered array]
18 19 19 21 22 23 24 24 29 30

32 35 36 39 40 42 48 51 63 63
• Age of youngest subject = 18
• Age of eldest subject = 63
• About ½ of the subjects below the age of 30 Computer = Sorting
[Raw data]
R&R University Primary Illiterate R&R
Secondary Prep. Secondary Illiterate Primary
Prep. Illiterate Primary Prep. Illiterate

[Ordered array]

Illiterate Illiterate Illiterate Illiterate R&R


R&R Primary Primary Primary Prep.
Prep. Prep. Secondary Secondary University
[Blood groups of 20 adults]

A AB O O B
AB B A A B
AB AB B B A
O AB B A AB

Blood group Tally Freq.


A //// 5
B //// / 6
AB //// / 6
O /// 3
Total 20
Distribution of a sample of adults by blood group

Blood group Freq. %


A 5 25
B 6 30
AB 6 30
O 3 15
Total 20 100
Grouped data
To group = to select a set of contiguous, non overlaping intervals such that each value in
the set of observations can be placed in one and only one of the intervals (referred to as
class interval)

1- How many intervals to include?


• Too few …. Undesirable ….. Loss of information.

• Too many ….. the objective of summarization will not be met.


Optimum 4-12 or 6-15 (Daniel)

As a guide (not final) we can use Sturges’ rule to estimate


number of intervals (K)

K = 1 + 3.322 (log10n)

Where:
• n is the number of individuals.
• Estimate based on this formula can be ­ or ¯ for convenience and
clear presentation
Example: If n= 275 then K = 1+ 3.322 ´ 2.4393 = 9
2- Width of the class interval(W)
Should be of same width although this is sometimes impossible.

W = Range (R)/K

Where:
• R=largest observation - smallest observation in the data set

Rule: W = 5-10 units and multiples of 10

When these widths are employed it is generally good practice to have


the lower limit of each interval end in zero or 5.
Rule: LL of first interval £ smallest observation.
UL of last interval ³ largest observation.
Computer need user input regarding interval widths & number of
intervals desired.

Weight (Kg) of 20 subjects

17 22 13 25 16 19 14 18 26 14.9
23 22 19.7 12 17 24 26 13 18 20

• Smallest = 12
• Largest = 26
• R = 30-10 =20
• If width = 5 then no. of categories = 20/5 = 4 intervals
Weight (Kg) Tally Frequency
10- //// 5
15- //// // 7
20- //// 5
25-30 /// 3
Total 20
Distribution of a sample of subjects by weight

Weight
Frequency CF % Cum. %
(Kg)
10- 5 5 25 25
15- 7 12 35 60
20- 5 17 25 85
25-30 3 20 15 100
Total 20 100
Cumulative frequency or Cum. % to facilitate obtaining
information regarding frequency or % of values within two or
more contiguous class intervals.
3- Methods of writing class intervals to avoid overlap

A B C D
15 to less than 20 15-19.9 15-19 15-
20 to less than 25 20-24.9 20-24 20-
25 to less than 30 25-29.9 25-29 25-
30 to less than 35 30-34.9 30-34 30-35
Most clearest Quantitative Quantitative Cont. &
Big space continuous discrete discrete

Cont. & discrete Best


• A concise, but informative title should be given, which should

include the date, place or whatever else is common to all the

entries in the table.

• A heading should be provided for the rows, which gives a brief

description of the variable which changes in value from row to

row.

• A heading should be provided for the columns.


• The units of measurement should be given for all entries in the
table.
• Notes should be used to give, where appropriate, the source of
data and definitions of terms.
• Table is preferred to be closed ended.

10- < 15 < 15


15- 15- 15-
20- 20- 20-
25-29 25-29 25 + (25-) (25 and over)
Closed-ended Open one side Open two sides
No. of notified cholera cases per week over 52 epidemiologic weeks

1 2 6 7 3 5 5 2 2
6 2 5 1 3 1 8 1 1
4 1 1 4 4 4 6 1 2
2 1 0 3 3 4 3 1 4
2 3 3 7 4 2 6 1
1 8 4 3 3 5 2 1

No. of cases Tally Freq.


0- //// //// //// 14
2- //// //// //// /// 18
4- //// //// // 12
6-8 //// /// 8
Total 52
Distribution of epidemiologic weeks by number of cholera
cases

No. of cases Weeks %


0-1 14 26.9
2-3 18 34.6
4-5 12 23.1
6-8 8 15.4
Total 52 100.0
Contingency tables

Length of stay by gender for a sample


of inpatients

Length of stay (days)


Gender Total
< 10 10-19 20+
Male 3 1 1 5
Female 39 10 2 51
Total 42 11 3 56
TB
Smoking Total
Yes No
Smoker 10 50 60
Non smoker 5 75 80
Total 15 125 140
Graphical Presentation
Graphical Presentation

Why ?
• Attract the reader’s attention
The human brain is more tolerant of visual
presentations than it is of numerical ones, and
can assimilate information more rapidly and
retain it far longer when pictures are used.

• To compare two or more numbers


Graphical Presentation
• To express the distribution of
individual objects or
measurements into different
categories.

• To express the change in some


quantity over a period of time.

• To express the relationship


between two measurements, in
a situation where they occur in
Summarizing Qualitative
Data
■Frequency Distribution (shows how many)
■Relative Frequency Distribution (shows
what fraction)
■Percent Frequency Distribution (shows
what percentage)

■Bar Graph
■Pie Chart
■Both these are graphical means for
Frequency Distribution

A frequency distribution is a tabular summary of


data showing the frequency (or number) of items
in each of several nonoverlapping classes.

The objective is to provide insights about the data


that cannot be quickly obtained by looking only at
the original data.
Example: Rotana Hotel
Guests staying at Rotana Hotel were
asked to rate the quality of their
accommodations as being excellent,
above average, average, below average, or
poor. The ratings provided by a sample of 20 guests are:
Frequency Distribution

Rating Frequency

Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
Relative Frequency Distribution

The relative frequency of a class is the fraction or


proportion of the total number of data items
belonging to the class.

A relative frequency distribution is a tabular


summary of a set of data showing the relative
frequency for each class.
Percent Frequency Distribution
The percent frequency of a class is the relative
frequency multiplied by 100.

A percent frequency distribution is a tabular


summary of a set of data showing the percent
frequency for each class.
Relative Frequency and
Percent Frequency Distributions

Relative Percent
Rating Frequency Frequency
Poor .10 10
Below Average .15 15
Average .25 25 .10(100) = 10
Above Average .45 45
Excellent .05 5
Total 1.00 100

1/20 = .05
Bar Graph

n A bar graph is a graphical device for depicting


qualitative or quantitative discrete variables.

n A frequency, relative frequency, or percent frequency


scale can be used for the other axis (usually the
vertical axis).

n Using a bar of fixed width drawn above each class


label, we extend the height appropriately.

n The bars are separated to emphasize the fact that each


class is a separate category.

•The length of the bars should be proportional to the


frequency of the event
Bar Graph

Rotana Hotel Quality Ratings


10
9
8
7
Frequency

6
5
4
3
2
1
Rating
Poor Below Average Above Excellent
Average Average
Pie Chart
n The pie chart is a commonly used graphical device
for presenting relative frequency distributions for
qualitative data.
■ First draw a circle; then use the relative
frequencies to subdivide the circle
into sectors that correspond to the
relative frequency for each class.
■ Since there are 360 degrees in a circle,
a class with a relative frequency of .25 would
consume .25(360) = 90 degrees of the circle.
Pie Chart

Rotana Hotel Quality


Ratings
Excellent
5%
Poor
10%
Below
Average
Above 15%
Average
45%
Average
25%
Example: Rotana Hotel

■ Insights Gained from the Preceding Pie Chart


• One-half of the customers surveyed gave Rotana
a quality rating of “above average” or “excellent”
(looking at the left side of the pie). This might
please the manager.
• For each customer who gave an “excellent” rating,
there were two customers who gave a “poor”
rating (looking at the top of the pie). This should
displease the manager.
Summarizing Quantitative
Data
• Frequency Distribution
• Relative Frequency and Percent
Frequency Distributions
• Dot Plot
• Histogram
• Cumulative Distributions
• Ogive
Example: Students’ weights

■ Sample of weight for 50 students

91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73

Including a line in the table for every possible


weight is not a good idea.
Need to categorize.
Frequency Distribution
• Guidelines for Selecting Number of Classes
• Use between 5 and 20 classes.
• Data sets with a larger number of elements
usually require a larger number of classes.
• Smaller data sets usually require fewer classes
Frequency Distribution
• Guidelines for Selecting Width of
•Use classes of equal width.
Classes
•Approximate Class Width =
Largest Data Value - Smallest Data Value
Number of Classes
Frequency Distribution
For students weight, if we choose six classes:
Approximate Class Width = (109 - 52)/6 = 9.5 @ 10

Students weight(Kg) Frequency


50-59 2
60-69 13
70-79 16
80-89 7
90-99 7
100-109 5
Total 50
Relative Frequency and
Percent Frequency Distributions

Relative Percent
Students wt (Kg
Frequency Frequency
50-59 .04 4
60-69 .26 26
2/50 .04(100)
70-79 .32 32
80-89 .14 14
90-99 .14 14
100-109 .10 10
Total 1.00 100
Relative Frequency and
Percent Frequency Distributions
■ Insights Gained from the Percent Frequency
Distribution
• Only 4% of the students wt are in the Kg50-59 class.
• 30% of the students wt are under Kg70.
• The greatest percentage (32% or almost one-third)
of the students wt are in the Kg70-79 class.
• 10% of the students wt are Kg100 or more.
Dot Plot
• One of the simplest graphical
summaries of data is a dot plot.
• A horizontal axis shows the range of
data values.
• Then each data value is represented by
a dot placed above the axis.
Dot Plot
Students weight
.
. .. . . .
. .. .. .. .. . .
. . . ..... .......... .. . .. . . ... . .. .
50 60 70 80 90 100 110

Weight (Kg)

Not used much anymore. Common when graphical drawing


tools were primitive.
Histogram
n Another common graphical presentation of
quantitative data is a histogram.
n The variable of interest is placed on the horizontal
axis.
n A rectangle is drawn above each class interval with
its height corresponding to the interval’s frequency,
relative frequency, or percent frequency.
n Unlike a bar graph, a histogram has no natural
separation between rectangles of adjacent classes.

It is a bar graph without spaces between the bars because


the data are in continuous intervals
Histogram
Students weight (Kg)
18
16
14
12
Frequency

10
8
6
4
2

50-59 60-69 70-79 80-89 90-99 100-110


Histogram (Common categories)
■Symmetric
– Left tail is the mirror image of the right tail
– Examples:.35 heights and weights of people
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Example

Distribution of a group of cholera patients by age

Age (years) Frequency %


25- 3 14.3
30- 5 23.8
35- 7 33.3
45- 4 19.0
60-65 2 9.5
Total 21 100
35
30
25
20
%

15
10
5
0
0 . 25 30 35 40 45 50 55 60 65

Age (years)

Distribution of a group of
cholera patients by age
Histogram
■Moderately Skewed Left
– A longer tail to the left
– Example:
.35 exam scores
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Histogram
■Moderately Right Skewed
– A Longer tail to the right
– Example:
.35 housing values
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Histogram
■Highly Skewed Right
– A very long tail to the right
– Example:
.35 executive salaries
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Cumulative Distributions

Cumulative frequency distribution - shows the


number of items with values less than or equal to
the upper limit of each class..

Cumulative relative frequency distribution – shows


the proportion of items with values less than or
equal to the upper limit of each class.

Cumulative percent frequency distribution – shows


the percentage of items with values less than or
equal to the upper limit of each class.
Cumulative Distributions
■Students weight
Cumulative Cumulative
Cumulative Relative Percent
Weight (Kg) Frequency Frequency Frequency
< 59 2 .04 4
< 69 15 .30 30
< 79 31 2 + .62
13 62
15/50 .30(100)
< 89 38 .76 76
< 99 45 .90 90
< 109 50 1.00 100
Ogive

■ An ogive is a graph of a cumulative distribution.


■ The data values are shown on the horizontal axis.
■ Shown on the vertical axis are the:
• cumulative frequencies, or
• cumulative relative frequencies, or
• cumulative percent frequencies
■ The frequency (one of the above) of each class is
plotted as a point.
■ The plotted points are connected by straight lines.
Ogive
■ Students weight
• Because the class limits for the weight data are
50-59, 60-69, and so on, there appear to be one-
unit gaps from 59 to 60, 69 to 70, and so on.
• These gaps are eliminated by plotting points
halfway between the class limits.
• Thus, 59.5 is used for the 50-59 class, 69.5 is
used for the 60-69 class, and so on.
Ogive with
Cumulative Percent Frequencies
Students weight
100
Cumulative Percent Frequency

80

60 (89.5, 76)

40

20
Weight
(Kg)
50 60 70 80 90 100 110
Summary Statistics

138
Introduction

• Statistical methods can be used to summarize


data.
• Measures of average are also called measures of
central tendency and include the mean, median,
mode, and midrange.
• Measures that determine the spread of data
values are called measures of variation or
measures of dispersion and include the range,
variance, and standard deviation.
measures of central tendency

140
(1)Measures of Central
Tendency

Mid-range Arithmetic
Mode Median mean

141
1- Mid-range
A- Ungrouped data

Example: Body weight (kg)


28- 30- 22- 18- 29
Smallest obs. + Largest obs
Mid-range =
2

18 + 30
Mid-range = 24 Kg
2
142
B- Grouped data
LL. of first interval + UL. of last interval
2
Example:
Body weight (kg) Frequency
25- 5
30- 2
35- 14
25+75
40- 9 Mid-range 50 Kg
60-75 4 2
Total 34

143
Advantages
ØEasy

ØQuick

Disadvantages
ØUsed only with quantitative variables

ØAffected by extreme (outlying) observations

ØNeglects all intermediate observations

ØRough measure
144
2- Mode
The observation or observations of highest frequency
A) Ungrouped data Examples: Weight (kg)

69, 67, 70, 73, 69, 71


Mode = 69 Kg

12, 14, 16, 18, 16, 14


Mode = 14, 16 Kg

18 11 16 14 19 15 13 12
No Mode

16 12 16 14 18 16 14 12
Mode = 16 Kg
145
The mode is the most frequently occurring value in a set of discrete data.
There can be more than one mode if two or more values are equally
common.

Example
Suppose the results of an end of term Statistics exam were distributed as
follows:

Student Score
1 94
2 81
3 56
4 90
5 70
6 65
7 90
8 90
9 30
B) Grouped data

Examples:
Weight (kg) Frequency
25- 14

30- 2 Interval of 1st mode = 25-30 Kg


35- 14 Interval of 2nd mode = 35-40 Kg
40- 9

60-75 4 25+30
Total 43 1st mode= =27.5 Kg
2
35+40
2nd mode= =37.5 Kg
2
147
Bld. Gr. Frequency
A 10
B 14
AB 25
0 9
Mode is AB
Total 58

148
Advantages
ØEasy
ØUsed with all types of variables
ØNot affected with extreme observations
Disadvantages
ØNeglects the less frequent observations
ØSometimes there is no mode
ØThe distribution may be bi-modal or multi-modal

149
The median
The median is the middle observation in a set
• 50% of the data have a value less than the median, and 50% of the
data have a value greater than the median.
• The median is the value halfway through the ordered data set, below
and above which there lies an equal number of data values.
Calculation of the median from raw data
Let n = the number of observations

If n is odd, ~ n+1
x=
2
If n is even, the median is the mean of the n th observation
and the æç n + 1ö÷th observation 2
è2 ø

150
A) Ungrouped data
Odd number of observations:
• Arrange observations Ascending order
• Rank of median = (n + 1)/2
Example:
Ø Row data è 24 – 18 – 22 – 20 - 16 kg
Ø Arranged data è 16 – 18 – 20 – 22 – 24 kg
Ø Rank = (5+1)/2
Ø Median = value of 3rd observation = 20 kg

151
Even number of observations:

•Arrange observations in an ascending order

•Rank of two middle observations = (n/2), (n/2)+1

Example:

ØRow data è 26 - 24 – 18 – 22 – 20 - 16 kg

ØArranged data è 16 – 18 – 20 – 22 – 24 - 26 kg

ØRank = (6/2), (6/2)+1 = 3, 4

ØMedian = Average of 2 middle observations =

(20+22) / 2 = 21 kg 152
Example
With an odd number of data values, for example 21, we have:

96 48 27 72 39 70 7 68 99 36 95 4 6 13
Data
34 74 65 42 28 54 69

Ordered 4 6 7 13 27 28 34 36 39 42 48 54 65 68
Data 69 70 72 74 95 96 99

48, leaving ten values below and ten


Median
values above
With an even number of data values, for example 20, we have:

57 55 85 24 33 49 94 2 8 51 71 30 91 6 47 50
Data
65 43 41 7
Ordered 2 6 7 8 24 30 33 41 43 47 49 50 51 55 57 65
Data 71 85 91 94
Median
Halfway between the two 'middle' data points
- in this case halfway between 47 and 49, and
so the median is 48
Calculate the median age, weight and height of the group
Calculation of the median from a grouped frequency distribution

How to estimate the median for the following:.

Height (cm) 1-5 6 - 10 11 - 15 16 - 20


Frequency, f 2 3 4 2 n = 11

156
Calculation of the median from a grouped frequency distribution

To estimate the median of a grouped frequency distribution, we

Ø locate the class containing the median using total


frequency divided by 2,
Ø use n -F
m e d ia n » l. c . b . + 2 ´w where,
f
F is the cumulative frequencies up to the class containing the
median,
( think of n/2 – F as the distance along the class to the median ),

f is the frequency of the class containing the median,


w is the width of the class containing the median.
Ø or, use reasoning to save the need to remember the formula.
Ø l.c.b. = Lower Category Boundary 157
Estimate the median for the following:.
2. Patients number 10 - 12 13 - 15 16 - 18
Frequency, f 20 24 16

Solution: n
n = 60 Þ = 30
2
The median is in the 2nd class.
n -F
m e d ia n » l. c . b . + 2 ´w
f

distance along class: n - F = 30 - 20 = 10


2
class width = 16 - 13 = 3 As the data give patients
number the boundaries
10 are 13 and 16, not 12·5
median » 13 + ´ 3 » 14 × 3
24 and 15·5.
158
Grouped Data
Estimate the median for the following:.

Height (cm) 1-5 6 - 10 11 - 15 16 - 20


Frequency, f 2 3 4 2 n = 11

Solution: n
= 5×5
2
The median is in the 3rd class.
n -F
m e d ia n » l. c . b . + 2 ´w
f

distance along class: n- F = 5×5- 5 = 0×5


2
class width = 15 × 5 - 10 × 5 = 5
0×5
median » 10 × 5 + ´ 5 » 11 × 1
4 159
Grouped Data
Estimate the median for the following:.

Height (cm) 1-5 6 - 10 11 - 15 16 - 20


Frequency, f 7 10 8 5 n = 30
The median is in the 2nd class. n
n -F = 15
m e d ia n » l. c . b . 2
+ 2 ´w
f
where,

distance along class: n - F = 15 7= 8 F is the cumulative


2 frequencies up to the
frequency of class, f = 10 class containing the
median =
width of class, w = 10 × 5 - 5 × 5 = 5
8
Þ median » 5 × 5 + ´5 = 9×5 160
10
: Estimate the median for the following:.
1. Length(cm) 21 - 30 31 - 35 36 - 40 41 - 50
Frequency, f 5 7 10 6
n
n = 28 Þ = 14
2
The median is in the 3rd class.
n -F
m e d ia n » l. c . b . + 2 ´w
f

distance along class: n - F = 14 - 12 = 2


2 F is the cumulative
frequencies up to the
class width = 40 × 5 - 35 × 5 = 5 class containing the
2 median =
Þ median » 35 × 5 + ´ 5 = 36 × 5
10
161
Advantages
Ø Used with quantitative variables and qualitative

ordinal

Ø Not affected by extreme (outlying) observations

Ø Suitable for biological values

Disadvantage
Ø Does not take all observations into consideration

162
4- The arithmetic mean
A) Ungrouped data

Sum of all observations (åX)


Mean=
Number of observations (n)
Example:
Ø24 – 20 – 22 – 16 – 18 kg
ØX1 – X2 – X3 – X4 – X5

24+20+22+16+18 100
Mean= = = 20 Kg
5 5
163
Notation…
When referring to the number of observations in a
population, we use uppercase letter N

When referring to the number of observations in a


sample, we use lower case letter n

The arithmetic mean for a population is denoted with


Greek letter “mu”:

The arithmetic mean for a sample is denoted with an


“x-bar”:
164
Statistics is a pattern language…

Population Sample

Size N n

Mean

165
Arithmetic Mean…

Sample Mean
Population Mean

166
Statistics is a pattern language…
Population Sample

Size N n

Mean

167
Exercise

Body Mass Index: 24.4 30.4 21.4 25.1 21.3 23.8 20.8 22.9
20.9 23.2 21.1 23.0 20.6 26.0

Compute the mean, median, mid range, and mode

168
:
Mid-point of
Frequency
Weight (kg) interval fjXj
fj
Xj
15- 3 20 60
25- 6 30 180
35- 8 40 320
45- 2 50 100
55-65 1 60 60
20 720
Total
S fj S fj Xj

720
`X = = 36 Kg
169
20
Advantages

ØTake all observations into consideration.

Disadvantages

ØUsed with quantitative variables only.

ØAffected by extreme observations.

ØCannot be obtained from open-ended table

170
Exercise:

Weight: 83.9 99.0 63.8 71.3 65.3 79.6 70.3 69.2 56.4 66.2

88.7 59.7 64.6 78.8

Height: 185 180 173 168 175 183 184 174 164 169 205 161

177 174

For each variable compute the mean and median

171
Mean, Median, Mode…
If a distribution is symmetrical,
the mean, median and mode may coincide…
median
mode

mean

4.172
Mean, Median, Mode…
If a distribution is asymmetrical, say skewed
to the left or to the right, the three
measures may differ. E.g.:

4.173
Measures of Variability…
Measures of central location fail to tell the whole
story about the distribution; that is, how much are
the observations spread out around the mean value?

For example, two sets of class grades


are shown. The mean (=50) is the
same in each case…

But, the red class has greater variability


than the blue class.

4.174
Measures of Relative
Importance
Number of observations having a given characteristic
Proportion =
Total number of observations

Percentage = proportion multiplied by 100


Rate = a measure of the “speed” at which events are
occurring (e.g., incidence rate of a certain disease is the
speed with which new cases occur in the community).
(e.g., 10 new cases in 10 days = 1 cases/day)
Ratio = is defined as the fraction a/b for two mutually
exclusive groups (e.g., male/female ratio)
The biostatistics class has 15 male and 35
female students.
• Proportion of males = 15/15+35 = 15/50 = 0.3
• Percentage of males = 0.3 x 100 = 30%
• Ratio of males to females = 15/35= 3:7

177
Summary Statistics

178
Objectives
After this session participants will be able to do the following
Compute and interpret the following measures of dispersion:
• Range
• Standard deviation
• Variance
• Coefficient of variation
Choose and apply the suitable measure of dispersion

179
• They are also called measures of spread or
variation.

• Definition:

• Dispersion refers to the spread of the values


around the central tendency.

• It indicates how the values disperse or spread


around the average.
180
Measures of central location fail to tell the whole
story about the distribution; that is, how much are
the observations spread out around the mean value?

For example, two sets of class


grades are shown. The mean
(=50) is the same in each case…

But, the red class has greater


variability than the blue class.

181
1- Range
• It is the simplest measure of dispersion.
• Definition:
• It is the difference between the highest and lowest
values.
• Advantage:
• It is quick and easy to calculate.
• Disadvantage:
• It does not use directly the majority of the
observations.
• It is very sensitive to extreme values.

182
• Example The following data represent the
weight of 10 persons:
• 20 -60 - 53 -80- 89 - 56- 42- 46- 88- 95 kg
ØFind the range
• Answer : largest observation = 95
• smallest observation = 20
• The range = 95 - 20 = 75 kg

183
Age (years) Frequency

15- <25 4
25- <35 8
35- <45 26
45- < 55 8
55- < 65 4
Total 50
Ø Compute the range
Answer :
Upper limit of last interval =65
Lower limit of first interval =15
Range = 65 - 15 = 50 years

184
2- Standard Deviation
• Definition:
• It is a measure of the spread of data around their mean.
• It is the positive square root of the variance.
• The value of standard deviation and variance are always
positive .
• Advantage:
• It is the preferred measure of dispersion.
• It uses all of the measurements in the set.
– Disadvantage:
• It is influenced by a few (or even only one) extreme
values.

185
Steps of calculation:
1. Determine the sum of observations (åX)
2. Find (åX)2
3. Find the square of each observation X2
4. Find the sum of the squared observations (åX2)

åX2 – (åX)2
n
S=
n-1
Example (Ungrouped data)

The following data represent the weight of 5 persons

Ø Find the standard deviation.


14- 15 - 16 - 17 -18 kg

187
Answer
åX = 14 +15 +16 +17 + 18 = 80
(åX)2 = (80)2 = 6400
åX2 = 196 + 225 + 256 + 289 + 324
= 1290

6400
1290 -
S = 5 = 1.6 kg
5 -1
Example (grouped data)
Weight fj xj fj xj fj xj2
( kg)
15 - 3 20 60 1200
25 - 6 30 180 5400
35 - 8 40 320 12800
45 - 2 50 100 5000
55 -65 1 60 60 3600
Total 20 720 28000
åf j x j åf j x j 2

28000 – (720)2
20
S=
20-1
3- Variance
Variance and its related measure, standard deviation, are arguably
the most important statistics.
They are used to measure variability, they also play a vital role in
almost all statistical inference procedures.

Population variance is denoted by


(Lower case Greek letter “sigma” squared)

Sample variance is denoted by


(Lower case “S” squared)

190
Statistics is a pattern language…
Population Sample

Size N n

Mean

Variance

191
Variance…[Check your calculator?]
population mean

The variance of a population is:

population size
sample mean

The variance of a sample is:

Note! the denominator is sample size (n) minus one !

192
Application…
The following sample consists of the number of
jobs six randomly selected students applied for:
17, 15, 23, 7, 9, 13.
Finds its mean and variance.

193
Sample Mean & Variance…
Sample Mean

Sample Variance

Sample Variance (shortcut method)

194
Standard Deviation…
The standard deviation is simply the square root
of the variance, thus:

Population standard deviation:

Sample standard deviation:

195
Statistics is a pattern language…
Population Sample

Size N n

Mean

Variance

Standard
Deviation

196
Students work
7. For the following data {7, 2, 9, 7, 5}, calculate the
a. Mean
b. Median
c. Mode
d. Range
e. Variance
f. standard deviation
g. what percentile is the number “9” in the data set?

197
Students work
•The following are average weights of 30 students
of UoS
65 67 70 71 68 69 65 68 65 68 69 83 90 45 49 67 68 69 70
71 72 71 71 72 71 72 74 70 65 66
•By using Excel Calculate: mean, SE, Median, Mode,
SD, Range,
•Using Tools > Data Analysis may need to “add in”… >
[ in
Excel, you can produce all of these tests

198
Empirical Rule – The standard
deviation and the normal distribution
For unimodal, moderately symmetrical, sets of
data approximately:
i.e. Normally Distributed Data
• 68% of observations lie within 1 standard
deviation of the mean.
• 95% of observations lie within 2 standard
deviations of the mean.

199
The Empirical Rule

x 200
The Empirical Rule

68% within
1 standard deviation

34% 34%

x-s x x+s 201


Empirical Rule

• One Sigma Rule – Approximately 68% of


the data values will lie within one standard
deviation from the mean.
• That is, one can expect a deviation of
more than one sigma from the mean to
occur once in every three observations.
• This true because approximately 33%
(approximately 1/3) of the values are
outside one standard deviation from the
mean 202
The Empirical Rule

95% within
2 standard deviations

68% within
1 standard deviation

34% 34%

13.5% 13.5%

x - 2s x-s x x + s x + 2s 203
The Empirical Rule
99.7% of data are within 3 standard deviations of the mean

95% within
2 standard deviations

68% within
1 standard deviation

34% 34%
2.4% 2.4%
0.1% 0.1%
13.5% 13.5%

x - 3s x - 2s x-s x x + s x + 2s x + 3s
204
z-Scores and Location
• By itself, a raw score or X value provides very little
information about how that particular score
compares with other values in the distribution.
• For example, your score (X) = 53. This score may
be a relatively low score, or an average score, or
an extremely high score depending on the mean
and standard deviation for the distribution from
which the score was obtained.
• If you transformed your score (X) into a z-score,
the value of the z-score tells exactly where your
score (x) is located relative to all the other scores
in the class.
205
z-Scores and Location (cont.)
• The process of changing an X value into a z-score
involves creating a signed number, called a z-score,
such that
a. The sign of the z-score (+ or –) identifies
whether the X value is located above the
mean (positive) or below the mean (negative).
b. The numerical value of the z-score
corresponds to the number of standard
deviations between X and the mean of the
distribution (class average).

206
z-Scores and Location (cont.)
• Thus, a score (x) that is located two standard
deviations above the mean will have a z-score
of +2.00. And, a z-score of +2.00 always
indicates a location above the mean by two
standard deviations.

207
Definition of z-score
Population z-score Sample z-score

x-µ x-x
z= z=
s s
In either case, the z-score tells us how
many standard deviations above (if z > 0)
or
below (if z < 0) the mean an observation is.
208
Interpretation of z-Scores
• If z = 0 an observation is at the mean.
• If z > 0 the observation is above the mean in
value, e.g. if z = 2.00 the observation is 2 SDs
above the mean.
• If z < 0 the observation is below the mean in
value, e.g. if z = -1.00 the observation is 1 SD
below the mean.

209
The Empirical Rule (z-scores)
99.7% of data are within 3 standard deviations of the mean

95% within
2 standard deviations

68% within
1 standard deviation

34% 34%

2.4% 2.4%
0.1% 0.1%

13.5% 13.5%

-3.00 -2.00 -1.00 0 1.00 2.00 3.00


210
z-score
The Empirical Rule (z-scores)
Therefore for normally distributed data:
• 68% of observations have z-scores between
-1.00 and 1.00
• 95% of observations have z-scores between
-2.00 and 2.00
• 99.7 of observations have z-scores between
-3.00 and 3.00

211
Outliers based on z-scores
• When we consider the empirical rule an
observation with a
z-score < -2.00 or z-score > 2.00
might be characterized as a mild outlier.

• Any observation with a


z-score < - 3.00 or z-score > 3.00
might be characterized as an extreme outlier.

212
Measures of Shape –
Skewness and Kurtosis
Statistical software packages will give some
measure of skewness and kurtosis for a
given numeric variable.
Skewness measures departure from symmetry
and is usually characterized as being left or
right skewed as seen previously.
Kurtosis measures “peakedness” of a
distribution and comes in two forms,
platykurtosis and leptokurtosis.
213
Skewness
Pearson’s Skewness Coefficient
x - median If skewness < -.20 severe left skewness
Skewness = If skewness > +.20 severe right skewness
s
Fisher’s Measure of Skewness has a complicated
formula but most software packages compute it.

Fisher’s Skewness > 1.00 moderate right tail skewness


> 2.00 severe right tail skewness
Fisher’s Skewness < -1.00 moderate left tail skewness
< -2.00 severe left tail skewness
214
Skewness

Skewness = -.5786
Suggesting slight left
skewness.

Skewness = 1.944
Suggesting strong
right skewness.

215
Kurtosis
Measures peakedness of a distribution.
Normal distribution
has Kurtosis = 0.

Leptokurtotic distributions are more


peaked than normal with fatter tails,
Kurtosis > 0

Platykurtotic distributions are less


peaked (squashed normal) than
normal,
Kurtosis < 0
216
4- Coefficient of Variation (CV)
• It is a measure of relative variation.
• It is independent of the units used.
• Therefore can be used for comparing sets of
measurements of different units.

• Suppose we want to compare the dispersion of


two sets of data, if we use the standard deviation
for comparison it may lead to fallacious results
• It may be that the two variables involved are
measured in different units. For example, we may
wish to know, for a certain population, whether
serum cholesterol levels, measured in milligrams per
100 ml, are more variable than body weight,
measured in pounds.
• The coefficient of variation is a measure that is
independent of the units of measurements and it is
the solution for such comparison.
• It is expressed as percentage.
S
CV = ‫ـــــــــــــــــــــــ‬ ´100
Mean

The standard deviation


expressed as a percentage of
the mean
• Example
The following are the ages of 5 children
5- 3 – 4 –7 – 6
ØFind the coefficient of variation
Answer
åX =5 +3 + 4 + 7 + 6=25
Mean = 25/5=5 years s = 1.58
CV= 1.58/ 5 X100 = 31.6 %
Basic Probability
• Probability is a mathematical construction
that determines the likelihood of occurrence
of events that are subject to chance.

• When we say an event is subject to chance,


we mean that the outcome is in doubt and
there are at least two possible outcomes.
Before a match, the first formal interaction soccer referees have with team captains results in a coin toss.
Law 8 states that : .the team that wins the toss of a coin decides which goal it will attack in the first half"...
• The probability P, of an event is the RATIO
of successful outcomes, called Successes, to all
outcomes of the event, called Possibilities
Empirical Probability
Empirical (or statistical) probability is based on
observations obtained from probability experiments.

The empirical frequency of an event E is the relative


frequency of event E.
P (E ) = Frequency of Event E
Total frequency
=
f
n
Example:
A travel agent determines that in every 50 reservations she makes, 12 will be for a
cruise.
What is the probability that the next reservation she makes will be for a cruise?

12
P(cruise) = = 0.24
50
Certain and impossible
• Probability of an event is a Number
between 0 and 1.
• An event(E) that is certain to happen, then
P(E) = 1
• e.g. A die is thrown
6
P(integers)= =1
6
Certain and impossible
An event(E) that is impossible
to happen, then P(E) = 0

e.g. A die is thrown


0
P(getting a ‘0’) = = 0
6
CHANCE
• Chance is how likely it is that something will
happen. To state a chance, we use a percent.

0 ½ 1
Probability

Equally likely to
happen or not to happen Certain to
Certain not
to happen happen

Chance

50 %
0% 100%
Likelihood

When the probability is greater than


0.5, implies the event is likely to
happen
When the probability is smaller than
0.5, implies the event is unlikely to
happen
What is the probability that the sun will rise
in the west tomorrow? ==== zero
Probability of
getting a head
when flipping a
coin.
The possible outcomes
Example 1:
An unbiased coin

The total possible outcomes is head (H) or Tail (T)

1 1
P(H)= and P(T)=
2 2
Applied Probability

If you tossed a coin 50


times, how many times
should it land on tails?

25 times
What is the Probability of getting 5 or 6
When a die is thrown ?

Total possible outcomes is 6.

P( 5 or 6 ) = 2 1
= = 0.33 3
6 3
Probability
• A jar contains 12 blue ,
8 green, and 5 red marbles
–If you reach in & choose 1
–What is the P it is blue?
–What is the P it is not blue?
–What is the P it is not black?
• What is the P it is blue?
P = 12/25 = .48
• What is the P it is not blue
P = 13/25 = .52
• What is the P it is not black
P = 25/25 = 1
Example:

There are 5 red marbles,


3 green marble and 2
black marbles

P( red marbles )=
5 = 1 = 0.5
10 2
The Addition Rule
• It is applied for mutually exclusive events:
– Cannot occur together.
• toss 1 coin, H and T are mutually exclusive, can
get one or the other, not both

The probability that event A or B will occur (if they are


mutually exclusive) is given by

P (A or B) = P (A) + P (B).
The Addition Rule

• Example
• Probability of getting a head or tail when
you toss a coin

• P(H or T) = (1/2) + (1/2) = 1


Example: in F2 generation,
what is the probability of
getting purple flowers from
Pp X Pp)
= P(PP or Pp) = P(PP) + P(Pp)
= (1/4) + (1/2) = 3/4
The Addition Rule

Example:
You roll a die. Find the probability that you roll a number less
than 3 or a 4.

The events are mutually exclusive.

P (roll a number less than 3 or roll a 4)


= P (number is less than 3) + P (4)
2 1 3
= + = = 0.5
6 6 6
The Addition Rule
Example: If a card is drawn at random from a pack of 52
playing cards, find the probability that
Either a ‘king’ or a ‘queen’ is drawn
P(king or queen) P(king)
king + P(queen)
queen
= 8 2
= = = .15
52 13

Number of kings = 4 Number of queens = 4


The Addition Rule
Example:
100 college students were surveyed and asked how many hours
a week they spent studying. The results are in the table below.
Find the probability that a student spends between 5 and 10
hours or more than 10 hours studying.
Less More
5 to 10 Total
then 5 than 10
Male 11 22 16 49
Female 13 24 14 51
Total 24 46 30 100
The events are mutually exclusive.

P (5 to10 hours or more than 10 hours) = P (5 to10) + P (10)


46 30 76
= + = = 0.76
100 100 100
Multiplication Rule
• The multiplication rule states that the
probability that two or more independent
events will occur together is the product of
their individual probabilities
• It happens for Independent events
– Occurrence of one doesn't affect probability of
the other
Multiplication Rule

•Two cards are drawn one after the other at


random from a pack of 52 play cards. The first
card drawn is put back into the pack and the pack
is shuffled before the second card is drawn.

•Find the probability of finding first and second


kings of clubs
P(first king of clubs)
& P(second king of clubs)

1 1
= ´
52 52
1
= = 0.00037
2704
• Example 1: toss 2 coins or 1 coin 2 times, H1
and T2 are independent

• Probability of getting a head and tail when you toss two


coins
P(H and T) = (1/2) x (1/2) = 1/4
The Multiplication and Addition Rules Applied to
Monohybrid Crosses
Example: in F2 generation, what is the
probability (P) of getting homozygous
rounded seeds (RR) or the P of getting
heterozygous round seeds from Rr X Rr)

P(RR) = P(R) * P(R) = (1/2) * (1/2) = ¼


P(rR) = P(r) * P(R) = (1/2) * (1/2) = ¼
P(Rr) = P(R) * P(r) = (1/2) * (1/2) = ¼
P of heterozygous round seeds
= P(rR) + P(Rr) = ¼ + ¼ (addition rule)
Monohybrids
Probability that an egg from F1 (Pp) will receive a (p) allele = ½
Probability that a sperm from F1 (Pp) will receive a (p) allele = ½
The probability that two recessive alleles will unite at fertilization = ½ x ½ = ¼

Dihybrids
for a dihybrid cross, YyRr x YyRr, what is the probability of an F2 plant having the genotype YYRR.
Probability that an egg from a YyRr parent will receive the Y and the R alleles = ½ x ½ = ¼
probability that a sperm from a YyRr parent will receive the Y and the R alleles = ½ x ½ = ¼
the overall probability of an F2 plant having the genotype YYRR
= ¼ x ¼ = 1/16.
• A Biostatistics Class has 17 boys
and 16 girls.
• One student is chosen at random.
• The Probability that the student
is a girl is:
• # of students = 16 + 17
• # of students = 33
• # of girls = 16
• P = 16/33 = .485
What’s the probability of …
• getting a 6 on a dice
• a letter chosen from the word RABBIT
is a B
• getting a number less than 3 on a
dice
• a person’s birthday is on a Sunday
this year
Two dice are rolled. Let us define event E as the set of possible
outcomes where the sum of the numbers on the faces of the two dice is
equal to 5.
Two dice are rolled. Let us define event E as the set of possible
outcomes where the sum of the numbers on the faces of the two dice is
equal to 5.

S = { (1,1),(1,2),(1,3),(1,4),(1,5),(1,6)
(2,1),(2,2),(2,3),(2,4),(2,5),(2,6)
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6)
(4,1),(4,2),(4,3),(4,4),(4,5),(4,6)
(5,1),(5,2),(5,3),(5,4),(5,5),(5,6)
(6,1),(6,2),(6,3),(6,4),(6,5),(6,6) }
Two dice are rolled. Let us define event E as the set of possible
outcomes where the sum of the numbers on the faces of the two dice is
equal to 5.

S = { (1,1),(1,2),(1,3),(1,4),(1,5),(1,6)
(2,1),(2,2),(2,3),(2,4),(2,5),(2,6)
(3,1),(3,2),(3,3),(3,4),(3,5),(3,6)
(4,1),(4,2),(4,3),(4,4),(4,5),(4,6)
(5,1),(5,2),(5,3),(5,4),(5,5),(5,6)
(6,1),(6,2),(6,3),(6,4),(6,5),(6,6) }

E = {(1,4),(2,3),(4,1), (3,2)}

P(the faces of the two dice is equal to 5)=4x(1/6x1/6)=4/36=1/9


Continuous Probability Distributions

• Probability Density Functions


– A continuous random variable is one that can
assume an uncountable number of values

– We cannot list the possible values because there is


an infinite number of them
The Empirical Rule (z-scores)

99.7% of data are within 3 standard deviations of the mean

95% within
2 standard deviations

68% within
1 standard deviation

34% 34%

2.4% 2.4%
0.1% 0.1%

13.5% 13.5%

-3.00 -2.00 -1.00 0 1.00 2.00 3.00


z-score
The Empirical Rule (z-scores)
Therefore for normally distributed data:
• 68% of observations have z-scores between
-1.00 and 1.00
• 95% of observations have z-scores between
-2.00 and 2.00
• 99.7 of observations have z-scores between
-3.00 and 3.00
Is cholesterol a problem for young boys?
•The level of cholesterol in the blood is important
because high cholesterol levels may increase the risk of
heart disease.
•The distribution of blood cholesterol levels in a large
population of people of the same age and sex is
roughly Normal.
•For 14-year-old boys, the mean is μ = 170 milligrams
of cholesterol per deciliter of blood (mg/dl) and the
standard deviation is σ = 30 mg/dl.
•Levels above 240 mg/dl may require medical
attention.
•What Probability of 14-year-old boys have more than
240 mg/dl of cholesterol?
1. Draw the Normal Curve

Proportion under
the normal curve

Cholesterol levels for 14-year-old boys who may require


medical attention.
2. Standardized the value and sketch
the Standard Normal Curve

260
Given below are the steps for Finding the Area using the Z-
Score Table
1-Calculate the Z-score using the formula, z=(x−μ)/σ round the answer to the
hundredth (two decimals)
2-Note the absolute value, by ignoring the sign.
3-Read the Z value with the first decimal on left most column and move along the
row to match the column showing the value in the second decimal place.
4-Note down the area given inside the table.
2. Standardized the value and sketch
the Standard Normal Curve

264
This is the score of Aysha in two tests, compared
to the class
Test A: as a z-score, z = (78-70) / 8 = 1.00
Test B: as a z-score , z = (78 - 66) / 6 = 2.00

Conclusion: Aysha did much better on Test B.


2. z-scores enable us to determine the relationship
between one score and the rest of the scores, using just
one table for all normal distributions.
e.g. If we have 480 scores, normally distributed with a
mean of 60 and an SD of 8, how many would be 76 or
above?
(a) Graph the problem:
(b) Work out the z-score for 76:
z = (X - X) / s = (76 - 60) / 8 = 16 / 8 = 2.00

(c) We need to know the size of the area beyond z


(remember - the area under the Normal curve corresponds
directly to the proportion of scores).
0.0228

(d) So: as a proportion of 1, 0.0228 of scores are likely to


be 76 or more.
As a percentage, = 2.28%

As a number, 0.0228 * 480 = 10.94 scores.


How many scores would be 54 or less?
Graph the problem:

z = (X - X) / s = (54 - 60) / 8 = - 6 / 8 = - 0.75


Use table by ignoring the sign of z : “area beyond z” for
0.75 = 0.2266. Thus 22.7% of scores (109 scores) are 54
or less.
How many scores would be 76 or less?

Subtract the area above 76, from the total area:


1.000 - 0.0228 = 0.9772 . Thus 97.72% of scores are 76 or
less.
How many scores fall between the mean and 76?

Use the “area between the mean and z” column in the


table.
For z = 2.00, the area is .4772. Thus 47.72% of scores lie
between the mean and 76.
How many scores fall between 69 and 76?

Find the area beyond 69; subtract from this the area
beyond 76.
Find z for 69: = 1.125. “Area beyond z” = 0.1314.
Find z for 76: = 2.00. “Area beyond z” = 0.0228.
0.1314 - 0.0228 = 0.1086 .
Thus 10.86% of scores fall between 69 and 76 (52 out of
480).
Question 1: The test scores of students in a class test has a mean of 70
and with a standard deviation of 12. What is the probable percentage of
students scored more than 85?

Solution:
The z score for the given data is,

z = 85–70/12 = 1.25

From the z score table the fraction of the data within this z score is 0.8944.

This means 89.44% of the students are within the test scores of 85 and hence the percentage of
students who are above the test score of 85 = (100 – 89.44)% = 10.56%

Hence, the required probable percentage is 10.56%.


Question 2: An organization made a survey on the monthly salary of their
clerical level employees, in dollars. The data revealed the mean as 4000 with
a standard deviation of $600. Find what percentage of employees are in the
salary bracket [3000, 4500].

The z score of the employees with a salary less than 3000 = 3000−4000/600

= - 1.67 (approx)

The z score of the employees with a salary more than 4500 = 4500−4000/600

= 0.83 (approx)

From the z score table, the fraction of the data within,

z score of -1.67 = 0.0475

z score of 0.83 = 0.7967

Therefore, the fraction of data between the z scores of -1.67 and 0.83 = 0.7967 – 0.0475 = 0.7492

Hence, 74.92% of clerical level employees are within the salary bracket [3000, 4500].
Problems: Normal Distribution
•If the random variable X has a normal distribution with
• mean 40 and std. dev. 5, calculate the following
•probabilities.
– P(X > 43) =

– P(X < 38) =

– P(X = 40) =

– P(X > 23) =


Problem: Normal
•The time (Y) it takes your professor to drive home each night is
normally distributed with mean 15 minutes and standard
deviation 2 minutes. Find the following probabilities. Draw a
picture of the normal distribution and show (shade) the area that
represents the probability you are calculating.
– P(Y > 25) =

– P( 11 < Y < 19) =

– P (Y < 18) =
In the following examples, the mean time it takes expectant mothers to
locate a baby face in a crowd is 77 milliseconds. There is a standard
deviation of 10 milliseconds for the recognition of the babies faces.
What proportion of expectant mothers took an average of 90 (X=90)
milliseconds or less to recognize the babies faces?
In the following example the average miles per gallon (MPG) a Ford motor car
gets is 23 with a standard deviation of 5. How many miles to the gallon does
the top 10% of Ford cars get?
Chapter 9:
Basics of Hypothesis Testing
In Chapter 9:

9.1 Null and Alternative Hypotheses


9.2 Test Statistic
9.3 P-Value
9.4 Significance Level
9.5 One-Sample z Test
9.6 Power and Sample Size
Terms Introduced in previous chapters

• Population º all possible values


• Sample º a portion of the population
• Statistical inference º generalizing from a sample to
a population with calculated degree of certainty
• Two forms of statistical inference
– Hypothesis testing
– Estimation
• Parameter º a characteristic of population, e.g., population
mean µ
• Statistic º calculated from data in the sample, e.g., sample
mean ( )
x
Distinctions Between Parameters and
Statistics

Parameters Statistics

Source Population Sample

Notation Greek (e.g., μ) Roman (e.g., xbar)

Vary No Yes

Calculated No Yes
Sampling Distributions of a Mean

The sampling distributions of a mean (SDM)


describes the behavior of a sampling mean

x ~ N (µ , SE x )
s
where SE x =
n
Sample mean (x bar) based on large samples will have a Normal sampling
distribution with an expectation equal to the population mean with a standard
error equal to the standard deviation of the population divided by the square
root of the sample size n
Hypothesis Testing…
•Any study starts by identifying the hypotheses
behind the study

•We decide between two hypotheses.


•The null hypothesis is
• H0: there is NO difference

•The alternative hypothesis or research hypothesis


is
• Ha: there is a difference
Hypothesis Testing…

• In the language of statistics, if you prove a


difference, then you are rejecting the null
hypothesis in favor of the alternative hypothesis.
•That is, there is enough evidence to support the
alternative hypothesis).
Hypothesis Testing…
• There are two possible errors.

• A Type I error occurs when we reject a true


null hypothesis

• P(Type I error) = a [usually 0.05 or 0.01]


Hypothesis Testing…

A Type II error occurs when we don’t


reject a false null hypothesis [accept
the false null hypothesis].
Hypothesis Testing…

•The probability of a Type I error is denoted as α


(Greek letter alpha). The probability of a type II
error is β (Greek letter beta).

•The two probabilities are inversely related.


Decreasing one increases the other, for a fixed
sample size.

•In other words, you can’t have a and β both real


small for any sample size. You may have to take a
much larger sample size.
Hypothesis Testing
The critical concepts are theses:
1. There are two hypotheses, the null and the alternative
hypotheses.
2. The procedure begins with the assumption that the null
hypothesis is true.
3. The goal is to determine whether there is enough evidence to
infer that the alternative hypothesis is true, or the null is not
likely to be true.
4. There are two possible decisions:
• Conclude that there is enough evidence to support the
alternative hypothesis. Reject the null.
• Conclude that there is not enough evidence to support the
alternative hypothesis. Fail to reject the null.
Hypothesis Testing Steps
A. Null and alternative hypotheses
B. Test statistic
C. P-value and interpretation
D. Significance level (optional)
1 Null and Alternative Hypotheses
• Convert the research question to null and
alternative hypotheses
• The null hypothesis (H0) is a claim of “no
difference in the population or between
treatments
• The alternative hypothesis (Ha) claims “H0 is
false”
• Collect data and seek evidence against H0 as a
way of supporting Ha
Illustrative Example: “Body Weight”
• The problem: In the 1970s, 20–29-year-old
men in the U.S. had a mean μ body weight of
170 pounds. Standard deviation σ was 40
pounds. We test whether mean body weight
in the population now differs.
• Null hypothesis H0: μ = 170 (“no difference”)
• The alternative hypothesis can be either Ha: μ
> 170 (one-sided (Tailed) test) or
Ha: μ ≠ 170 (two-sided (Tailed) test)
2 Test Statistic
This is an example of a one-sample test of a
mean when σ is known. Use this statistic to
test the problem:
x - µ0
z stat =
SE x
where µ 0 º population mean assuming H 0 is true
s
and SE x =
n
“Body Weight” Example: z statistic
• For the illustrative example, μ0 = 170
• We know σ = 40
• Take an SRS (simple random sampling) of n =
64. Therefore
s 40
SE x = = =5
n 64
• If we found a sample mean of 173, then
x - µ 0 173 - 170
zstat = = = 0.60
SE x 5
Illustrative Example: z statistic
If we found a sample mean of 185, then

x - µ 0 185 - 170
zstat = = = 3.00
SE x 5
Reasoning Behinµzstat

x ~ N (170,5)
Sampling distribution of xbar
under H0: µ = 170 for n = 64 Þ
3 P-value
• The P-value answer the question: What is the
probability of the observed test statistic or one more
extreme when H0 is true?
• This corresponds to the AUC (Area under Curve) in the
tail of the Standard Normal distribution beyond the
zstat.
• Convert z statistics to P-value :
For Ha: μ > μ0 Þ P (probability) of area right to zstat = right-tail
beyond zstat
For Ha: μ < μ0 Þ P of area left to zstat = left tail beyond zstat
For Ha: μ ¹ μ0 Þ P = 2 × one-tailed P-value
• Use Table B or software to find these probabilities (next
two slides).
One-sided (Tailed) P-value for zstat of
0.6
One-sided (Tailed) P-value for zstat of 3.0
Two-Sided (Tailed) P-Value
• One-sided Ha Þ
AUC in tail beyond
zstat
• Two-sided Ha Þ
consider potential
Examples: If one-sided P
deviations in both = 0.0010, then two-sided
directions Þ P = 2 × 0.0010 = 0.0020.
double the one- If one-sided P = 0.2743,
sided P-value then two-sided P = 2 ×
0.2743 = 0.5486.
Interpretation
• P-value answer the question: What is the
probability of the observed test statistic …
when H0 is true?
• Thus, smaller and smaller P-values provide
stronger and stronger evidence against H0
• Small P-value Þ strong evidence
Interpretation
Conventions*
P > 0.10 Þ non-significant evidence against H0
0.05 < P £ 0.10 Þ marginally significant evidence
0.01 < P £ 0.05 Þ significant evidence against H0
P £ 0.01 Þ highly significant evidence against H0

Examples
P =.27 Þ non-significant evidence against H0
P =.01 Þ highly significant evidence against H0
* It is unwise to draw firm borders for “significance”
Interpreting
Overwhelming Evidence
the p-value…
(Highly Significant)

Strong Evidence
(Significant)

Weak Evidence
(Not Significant)

No Evidence
(Not Significant)

0 .01 .05 .10

p=.001 p=.27
α-Level (Used in some situations)

• Let α ≡ probability of erroneously rejecting H0


• Set α threshold (e.g., let α = .10, .05, or whatever)
• Reject H0 when P ≤ α
• Retain H0 when P > α
• Example: Set α = .10. Find P = 0.27 Þ retain H0
• Example: Set α = .01. Find P = .001 Þ reject H0
(Summary) One-Sample z Test
A. Hypothesis statements
H0: µ = µ0 vs.
Ha: µ ≠ µ0 (two-sided or tailed test) or
Ha: µ < µ0 (left-sided or tailed test) or
Ha: µ > µ0 (right-sided or tailed test)
B. Test statistic x-µ s
z stat = 0
where SE x =
SE x n
C. P-value: convert zstat to P value
D. Significance statement (usually not necessary)
5 Conditions for z test
• σ known (not from data)
• Population approximately Normal or large
sample (central limit theorem)
• SRS
• Data valid
The Lake Wobegon Example
“where all the children are above average”

• Let X represent Weschler Adult Intelligence scores


(WAIS)
• Typically, X ~ N(100, 5)
• Take SRS of n = 9 from Lake Wobegon population
• Data Þ {116, 128, 125, 119, 89, 99, 105, 116, 118}
• Calculate: x-bar = 112.8
• σ = 15
• Does sample mean provide strong evidence that
population mean μ > 100?
Example: “Lake Wobegon”
A. Hypotheses:
H0: µ = 100 versus
Ha: µ > 100 (one-sided)
Ha: µ ≠ 100 (two-sided)
B. Test statistic:
s 15
SE x = = =5
n 9
x - µ 0 112.8 - 100
zstat = = = 2.56
SE x 5
C. P-value: P = Pr(Z ≥ 2.56) = 0.0052

P =.0052 Þ it is unlikely the sample came from this


null distribution Þ strong evidence against H0
Two-Sided P-value: Lake Wobegon
• Ha: µ ≠100
• Considers random
deviations “up” and
“down” from μ0 Þtails
above and below ±zstat
• Thus, two-sided P
= 2 × 0.0052
= 0.0104
Example: Grade inflation?
(Has mean GPA increased since 1990?)

Is the average
Population of GPA 2.7 ?
5 million college (Imagine that 2.7 was
students mean GPA for U.S. college
students in 1990)

How likely is it that


100 students would
have an average
Sample of GPA as large as 2.91
100 college students
if the population
average was 2.7?
Example: Grade inflation?
Has mean GPA increased since 1990?

µ = current mean GPA of U.S. college students


H o : µ = 2.7 (mean GPA now is same as it was in 1990)
H A : µ > 2.7 (mean GPA now is greater than it
was in 1990)

Alternative hypothesis reflects research hypothesis that


the mean GPA for college students is greater than it was
in 1990.
Example: Grade inflation?
(Has mean GPA increased since 1990?)

X = 2.91 GPA distribution reasonably


normal.
s = .61
n = 100
How likely are we to obtain a
GPA
sample mean this large
sampling from a population
whose mean µ = 2.70 ?
Example: Grade Inflation (cont’d)
Test Statistic for a Single Population Mean (µ)

X - µo X - µ o
Z stat = =
SE(X) s
n
2.91 - 2.7
= = 3.44
.61
100
Example: Grade Inflation (cont’d)
p-value calculation and interpretation

P(Z > 3.44) = .0003. Therefore the probability


that chance variation alone would produce a
sample mean of 2.9 when sampling from a
population whose mean is actually 2.7 is .0003
or 3 out of 10,000! It is highly unlikely that
chance variation would produce this result!
Decision:
• Because our p-value = .0003 < .05 we reject
the null hypothesis in favor of the
alternative. s

Interpretation:
• We conclude that the mean GPA of U.S.
college students today is greater than 2.70,
which is what is was back in 1990.

You might also like