Professional Documents
Culture Documents
CHAPTER 1: Introduction
Learning objectives:
After completing this chapter, the student will be able to:
1. Define Statistics and Biostatistics
2. Enumerate the importance and limitations of statistics
3. Define and identify the different types of data and understand why we need
to classifying variables
Statistical thinking has now a day became very essential for different fields of study. Its
usefulness has now spread to such diverse fields as agriculture, business, accounting,
marketing, economics, management, medicine, political science, psychology, sociology,
engineering, journal, metrology, tourism, etc. In biomedical research, meaningful
conclusions can only be drawn based on data collected from a valid scientific design
using appropriate statistical methods. Therefore, the selection of an appropriate study
design is important to provide an unbiased and scientific evaluation of the research
questions. Each design is based on a certain rationale and is applicable in certain
experimental situations.
Biostatistics is the segment of statistics that deals with data arising from biological
processes or medical experiments. Thus biostatistics is the application of statistical
techniques in a health related area (application of statistical methods on biological,
medical and public health data).
Why biostatistics?
Because some statistical methods are more heavily used in health applications than
elsewhere (e.g Survival analysis, longitudinal data analysis)
The word statistics on the other hand, has two meanings. In the more common
usage,
Statistics (plural sense) refers to numerical information (aggregates of facts).
Example includes statistics of births, disease cases, imports, exports, etc. In these
examples statistics are numbers or facts. The subject of statistics (singular sense),
has a much broader meaning than just collecting and publishing numerical
information. Statistics in this sense may be defined as the science of
1
JU / Biostatistics
i. Collection of data
ii. Organization of data
iii. Presentation of data
iv. Analysis of data and
v. Interpretation of data
Stage 1: Data collection
Is the process of gathering information or data about the variable of interest for
our specific purpose?
Constitutes the first step in a statistical investigation.
At most care must be exercised in collecting data because they form the
foundation of statistical analysis. If the data are faulty, the conclusions drawn
can never be reliable. The data may be available from existing published or
unpublished sources or else may be collected by the investigator himself. i.e
data may obtained either primarily or secondarily.
Stage 2: Organization of data
Is the process of editing, classification and tabulation of data.
• Editing: is the process of checking and connecting data for omission,
inconsistencies, irrelevant answer and wrong computation in the
collected data.
• Classification: is the task of grouping the collected and edited data in to
different similar categories based on some criteria
• Tabulation: is to put classified data in the form of table.
Arranging or classification of data in the suitable order makes the information
easier for presentation.
Stage 3: Presentation of data
The organized data can now be presented in the form of tables and diagram.
At this stage, large data will be presented in tables in a very summarized and
condensed manner.
The main purpose of data presentation is to facilitate statistical analysis.
Graphs and diagrams may also be used to give the data a vivid meaning and
make the presentation attractive.
2
JU / Biostatistics
3
JU / Biostatistics
4
JU / Biostatistics
Hence statistics has already become a very important subject area and that various
tools of statistics are being used to solve problems in everyday life, in research,
marketing, planning a production and quality control and other areas. Nevertheless,
statistics has its own limitations and it can also be misused.
Specific Uses:-
Statistics condenses and summarizes complex data. The original set of data (raw data) is
normally voluminous and disorganized unless it is summarized and expressed in few
numerical values.
Statistics facilitates comparison of data. Measures obtained from d/t set of data can be
compared to draw conclusion about those sets. Statistical values such as averages,
percentages, ratios, etc, are the tools that can be used for the purpose of comparing sets of
data.
5
JU / Biostatistics
Statistics helps in predicting future trends. Statistics is extremely useful for analyzing the
past and present data and predicting some future trends.
Statistics influences the policies of government. Statistical study results in the areas of
taxation, on unemployment rate, on the performance of every sort of military equipment,
etc, may convince a government to review its policies and plans with the view to meet
national needs and aspirations.
Statistical methods are very helpful in formulating and testing hypothesis and to develop
new theories.
Limitations:-
Statistics doesn’t deal with single (individual) values. Statistics deals only with aggregate
values. But in some cases single individual is highly important to consider in some
situations. Example, the sun, a deriver of bus, president, etc.
Statistics can’t directly deal with qualitative characteristics. It only deals with data which
can be quantified. Example, not deal with marital status (married, single, divorced,
widowed) but it deal with number of married, number of single, number of divorced.
Statistical conclusions are not universally true. Statistical conclusions are true only under
certain condition or true only on average. The conclusions drawn from the analysis of the
sample may, perhaps, differ from the conclusions that would be drawn from the entire
population. For this reason, statistics is not an exact science.
Example, in Island there are 100 males and 2 females are live for one year. From these 2
females married with two male. These means 100% of females married with 2% of
males. Based on this information one can try to make decision as “birth rate of male is
higher than that of female”. This conclusion may or may not true.
Statistical interpretations require a high degree of skill and understanding of the subject.
It requires extensive training to read and interpret statistics in its proper context. It may
lead to wrong conclusions if inexperienced people try to interpret statistical; results.
6
JU / Biostatistics
Statistics can be misused. Sometimes statistical figures can be misleading unless they are
carefully interpreted.
Example, the report of head of the minister about Ethio-Somalia terrorist attack mission
dismissed terrorists 25% at first day, 50% at second day, 75% at third day. However, we
doubt about the mechanisms how the mission is measured and quantified. This leads miss
use of statistical figures.
Misuse of Statistics
Statistics can be misused in several ways unless the necessary data are collected
wisely and appropriate methods are applied. Statistics can be misused in the
following ways
a) They can be used for wrong purposes; that is for purposes that are
different from the purpose for which they are collected.
b) They can be collected incorrectly and so are biased
c) They can be analyzed carelessly and the results obtained are misleading.
7
JU / Biostatistics
Nominal Scale:- Data that represent categories or names. There is no implied order
to the categories of nominal data. In these types of data, individuals are simply
placed in the proper category or group, and the number in each category is
counted. Each item must fit into exactly one category. For example patients
survival status of propanol may be recorded as treated and control patients with
myocardial infarction.
Other E.g.
Religion: Christianity, Islam, Hinduism, etc.
Sex: Male, Female
Eye color: brown, black, etc.
Ordinal Scale:- Whenever observations are not only different from category to
category, but can be ranked according to some criterion. The variables deal with
their relative difference rather than with quantitative differences. The spaces or
intervals between the categories are not necessarily equal.
Ordinal data are data which can have meaningful inequalities. The inequality signs
< or > may assume any meaning like ‘stronger, softer, weaker, better than’, etc.
Example:
Patients may be characterized as unimproved, improved & much improved.
letter grading system, authority, career, etc
Individuals may be classified according to socio-economic as low, medium &
high. It is usually impossible to infer that difference between member of one
category and the next adjacent category.
Interval Scale: With this scale it is not only possible to order measurements, but
also the distance between any two measurements is known but not meaningful
quotients. There is no true zero point but arbitrary zero point. Interval data are the
types of information in which an increase from one level to the next always reflects
the same increase in the characteristic. Possible to add or subtract interval data but
they may not be multiplied or divided.
8
JU / Biostatistics
Example:
Temperature of zero degrees does not indicate lack of heat. The two common
temperature scales; Celsius (C) and Fahrenheit (F). We can see that the same
difference exists between 10oC (50oF) and 20oC (68OF) as between 25oc (77oF) and
35oc (95oF) i.e , the measurement scale is composed of equal-sized interval. But we
cannot say that a temperature of 20oc is twice as hot as a temperature of 10oc.
because the zero point is arbitrary.
Ratio Scale:- Characterized by the fact that equality of ratios as well as equality of
intervals may be determined. Fundamental to ratio scales is a true zero point.
Eg: variables such as age, height, length, volume, rate, time, amount of rainfall,
etc. are requiring ratio scale.
Elementary unit: Is the specific person, business, product, and so on, with some
characteristics to be measured or categorized (information is recorded on it).
E.g, The weight of particular person in the class, the person is elementary
unit.
9
JU / Biostatistics
10
JU / Biostatistics
methods used to gathering the required information from the units under
investigation. The quality of data greatly affects final output of an investigation.
Hence, at most care should be attached to the data collection process and every
possible precaution should be taken to ensure accuracy while collecting data.
Otherwise, with inaccurate and inadequate data, the whole analysis is likely to be
faulty and also the decisions to be taken will also be misleading.
i) Direct Observation
In this approach, an investigator stays the place of survey and notes down the first
hand information. Direct observations can be used to discover a variety of
information including consumer behavior, working methods & other aspects of
social & economic behavior. Direct observation is more experimental and usually
applied in scientific studies. It is time consuming and also costly. Also the method
is highly subjective.
ii) Interview Method-
It is a conversation between two groups, i.e incited by the interviewer in order to
obtain the required information. The interviewer sets a series of questions directly
elected for his/her work in advance & conducts the interview. Interviewing is a
11
JU / Biostatistics
12
JU / Biostatistics
Types of Questions
1. Open-ended Questions: permit free responses that should be recorded in the
respondent’s own words. The respondent is not given any possible answers to
choose from. Such questions are useful to obtain information on:
- Facts with which the researcher is not very familiar,
- Opinions, attitudes, and suggestions of informants, or
- Sensitive issues
2. Closed Questions: offer a list of possible options or answers from which the
respondents must choose. When designing closed questions one should try to:
- Offer a list of options that are exhaustive and mutually exclusive,
and
13
JU / Biostatistics
Sample survey is simply the process of learning about the population on the basis
of a sample drawn from it. Thus in the sampling technique instead of every unit of
the universe only a part of the universe is studied and the conclusions are drawn on
that basis for the entire universe. A sample is a subset of population units. The
process of sampling involves three elements:
a. Selecting the sample.
b. Collecting the information, and
c. Making an inference about the population.
14
JU / Biostatistics
for it. First, it is always possible to determine the extent of sampling errors.
Secondly, other types of errors to which a survey is subject, such as
inaccuracy of information, incompleteness of returns, etc., are likely to be
more serious in a complete census than in a sample survey. This is because
more effective precautions can be taken in a sample survey to ensure that
information is accurate and complete. For these reasons not only the total
error be expected to be smaller in a sample survey but sample result can also
be used with a greater degree of confidence because of our knowledge of the
probable size of error. Thirdly, it is possible to avail of the services of experts
and to impart thorough training to the investigators in a sample survey, which
further reduces the possibility of errors. Follow up work can also be
undertaken much more effectively in the sampling method. Indeed, even a
complete census can only be tested for accuracy by some type of sampling
check.
iv) More Detailed Information: Since the sampling technique saves time and
money, it is possible to collect more detailed information in a sample survey.
For example, if the population consists of 1000 persons in a survey of the
consumption pattern of the people, the two alternative techniques available are
as follows: We may collect the necessary data from each one of the 1000
people through a questionnaire containing, say, 10 questions (census method):
or
We may take a sample of 100 persons (i.e., 10 % of population) and prepare
questionnaire containing as many as 100 questions. The expenses involved in
the latter case would almost be the same as in the former but it will enable ten
times more information to be obtained.
v) Sampling Method is the only Method that can be used in Certain Cases:
There are some cases in which the census method is inapplicable and the only
practicable means is provided by the sample method. For example, if one is
interested in testing the breaking strength of chalks manufactured in a factory
under the census method all the chalks would be broken in the process of
testing. Hence, census method is impracticable and resort must be had to the
sample method.
vi) The Sample Method is often used to Judge the Accuracy of the
Information Obtained on a Census Basis: For example, in the population
15
JU / Biostatistics
census, which is conducted very, often (10 years in our country) the field
officers employ the sample method to determine the accuracy of information
obtained by the enumerators on the census basis.
Demerits
Despite the various advantages of sampling, it is not completely free from
limitations.
i. A sample survey must be carefully planned and executed otherwise the results
obtained may be inaccurate and misleading. Of course, even for a complete
count care must be taken but serious errors may arise in sampling, if the
sampling procedure is not perfect.
ii. Sampling generally requires the services of experts. In the absence of
qualified and experienced persons, the information obtained from sample
surveys cannot be relied upon. In India, shortage of experts in the sampling
field is a serious hurdle in the way of reliable statistics.
iii. At the time when sampling plan is so complicated it may requires more
time, labor and money than a complete count. This is so if size of the sample
is a large proportion of the total population and if complicated weighted
procedures are used. With each additional complication in the survey, the
chances of error multiply and greater care has to be taken, which in turn
needs more timed labor.
iv. If the information is required for each and every unit in the domain of study,
complete enumeration survey is necessary.
16
JU / Biostatistics
Tables
The use of tables for presenting data involves grouping the data into mutually
exclusive categories of the variable and counting the number of occurrences
(frequency) to each category.
17
JU / Biostatistics
Based on the purpose for which the table is designed and the complexity of the
relationship, a table could be either of simple frequency or cross-tabulation.
The simple frequency table is used when the individual observations involve only
to a single variable whereas the cross-tabulation is used to obtain the frequency
distribution of one variable by the subset of another variable.
Principles of Table Construction
1. Tables should be as simple as possible.
2. Tables should be self-explanatory.
3. If data are not original, their source should be given in a footnote.
Examples:
a) Simple Frequency Table (Qualitative Data)
Table 1: Overall Immunization Status of children in Adami Tulu Woreda, Feb. 1995
Immunization Status Number Percent
Not Immunized 75 35.7
Partially Immunized 57 27.1
Fully Immunized 78 37.2
Total 210 100.0
Source: Fikru T. et al. EPI Coverage in Adami Tulu, Eth. J. Health Dev. 1997;
11(2): 109-113
Tables of the above type are also known as simple or one-way tables
c) Cross Tabulations
Table 3: TT Immunization Status by Marital Status of the women of child bearing age,
Assendabo town, 1996
Immunization Status
18
JU / Biostatistics
19
JU / Biostatistics
0,2,3,1,1,3,4,2,0,3,4,2,2,1,0,4,1,2,2,3
Construct:-ungrouped frequency distribution, Relative frequency (Rf), Percentage frequency
(Pf).
No of child/ family Frequency Rf Pf
0 3 3/20 3/20x100
1 4 4/20 4/20x100
2 6 6/20 6/20x100
3 4 4/20 4/20x100
4 3 3/20 3/20x100
Total 20 1 100
When we deal with large sets of data, a good overall picture and sufficient
information can often be conveyed by grouping the data into a number of classes.
To group a set of observations we select a set of contiguous non-overlapping
intervals such that each value in the set of observations can be placed in one, and
only one of the intervals, called class intervals.
One of the first considerations when data are to be grouped is how many intervals
to include. We seldom use fewer than 6 or more than 15 class intervals. If there are
fewer than six intervals the data have been summarized too much and the
information they contain has been lost; the exact number we use in a given
20
JU / Biostatistics
Sturge’s formula for deciding the number of class intervals is given by:
k= 1 + 3.322 log10n,
Where k = number of class intervals
There are two types of frequency distribution
a) Inclusive
b) Exclusive
a) In inclusive type of frequency distribution, the upper limit of one class does
not coincide with the lower limit of the next class.
b) In exclusive type of frequency distribution, the upper limit of one class
coincides with the lower limit of the next class.
Example 2. 1 The following data illustrates the inclusive type of frequency distribution.
Consider the following raw data on weights of 30 adults (aged less than 20):
The range for the above ungrouped data is 49 - 12 = 37. Normally it is desirable to divide the
range into 6 to 10 classes. Consider the class 11 - 15. If an adult has weight of 11 or 15, he/she
will be put in this class. For this class, 11 is the lower limit and 15 is the upper limit and both are
included in the class. But in case of 'exclusive' frequency, mostly one of the limits of class is
excluded from the class; the above frequency distribution can be reformed in the following
exclusive way also:
21
JU / Biostatistics
Terms (Definition): -
1. Class Limit: - separate one class from the other within a certain gap. The two class limit
are:-
■Upper class limit (UCL)
■Lower class limit (LCL)
There is a gap between UCLi &LCLi+1
Unit of measurement (U):- is the smallest possible distance between two consecutive
measures.
U is usually taken as 1.
2. Class Boundary:-have two parts. These are
■Upper class boundary (Ucb)
■Lower class boundary (Lcb).
22
JU / Biostatistics
1. R= Xmax-Xmin, 39-6=33
2. K=1+3.322 log20 =5.32 ~ 5
3. W=R/K , 33/5 = 6.6~ 7
4. Determine LCL1=Xmin=6
23
JU / Biostatistics
6 23 19.5-26.5 12 14
27-33 5 30 26.5-33.5 17 8
34-40 3 37 33.5-40.5 20 3
The choice of the particular form among the different possibilities will depend on
personal choices and/or the type of the data.
Bar chart and pie chart are commonly used for quantitative or qualitative
discrete data
Histograms and frequency polygons are used for quantitative continuous
data.
i) Bar-Chart (Bar diagram): A series of equally spaced bars having equal width (base)
where the height the bar represents the frequency of (amount) associated with each class.
Usually applied for categorical random variables.
A bar chart could be either vertical or horizontal.
There are various types of bar charts
1. Simple Bar Charts: Represent data by a series of bars, the height (length) of
each bar indicating the size of the figure represented.
Example: The following table shows the (arbitrary) number of students in the
faculty of Medical Sciences. Show these numbers graphically, using simple bar
chart.
Year
Sex
I II III IV V
Female 25 20 15 15 10
Male 55 55 50 55 50
Total 80 75 65 70 60
24
JU / Biostatistics
No. of
students
Fig 1: Vertical bar chart for number of students in the Faculty of Medical
Sciences (in 2002
2. Component Bar Charts: are like ordinary bar charts except that the bars are
subdivided into component parts. This sort of chart is constructed when
each total figure is built up from two or more component figures.
25
JU / Biostatistics
A pie chart is a circle divided by radial lines into sections (like slices of a cake
or a pie; hence the name) so that the area of each section is proportional to the
size of the figure presented. It is a convenient way of showing the sizes of
component figures in proportion to each other and to the overall total.
26
JU / Biostatistics
Line graph is appropriate when we need to present the movement or variation in avariable. It is
quite simple to draw and indicates the increase or decrease in a variable over time or across
observations. Line graphs can be used for discrete data. Recall that in the case of continuous data
we assumed that the average value of each class is its midpoint. Thus we can plot the frequencies
for each class against its mid-point and join these points to obtain a line graph.
27
JU / Biostatistics
E.g. Consider the data on time (in hours) that 20 college students devoted to leisure activities
during a typical school week:
Class limit class boundary frequency
6-10 5.5-10.5 1
11-15 10.5-15.5 2
16-20 15.5-20.5 3
21-25 20.5-25.5 5
26-30 25.5-30.5 4
31-35 30.5-35.5 3
36-40 35.5-40.5 2
Total …………………………………………....….20
6
Frequency
5
4
3
2
1
0
5.5 10.5 CLASS BOUNDRY
Fig 6: Histogram
3. Frequency Polygon: Is the
line graph that displays the
data using a line that connects points plotted for the frequencies of the class mark. i.e. the
frequencies represent the height of the class mark.
A frequency polygon can also be super imposed on a histogram.
Frequency
Class boundaries
5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.5
28
JU / Biostatistics
Class boundary
5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.5
class boundary
5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.
29
JU / Biostatistics
Class boundary
Median
Fig 10: Mcf & Lcf with their intersection
Exercise. The following table is a grouped frequency distribution of money spent per visit by a
random sample of 100 customers at a dep’t store.
30
JU / Biostatistics
When we want to make comparison b/n groups of numbers it is good to have a single value,
which is considered to be a good representative of each group. This single value is called the
average of the group. The tendency of statistical data to get concentrated at certain
values is called “Central Tendency” and the various methods of determining the
actual value at which the data tend to concentrate are called measures of central
tendency or average. An average, which is representative, is called typical average and
average which is not representative and has only a theoretical value is called a descriptive
average.
There are different types of measure of central tendency (and measure of position)
o Mean (Arithmetic, Geometric, and Harmonic)
o Median (the middle value)
o Mode (the most frequently appearing value)
o Quantiles (quartiles, Deciles, percentiles)
31
JU / Biostatistics
The choice of the averages depends up on which best fit the property under discussion.
Is defined as the value each item in the distribution would have if all the values
were shared out equally among all the items.
Is the measure to which we usually refer in everyday life when we use the word
“average.”
Obtained by adding all the values in a population or sample and dividing by the
number of values that are added.
Sample Mean: Population Mean:
x 1+ x 2+.... ....+ xn X 1+ X 2+ ........+ Xn
• For Row data , x = n
µ= N
N
1
n ∑ xi
¿ ∑ xi
n i=1 ,
i=1
= N
k
1
• For ungroup frequency distribution x
¿ k ∑ fixi , where k is the number of classes
∑ fi i=1
i=1
and fi is the number of the occurrence of xi.
Where xi is the class mark of the ith class and fi is frequancy of the ith class.
Example: calculate the mean for the following age distribution.
Class Frequency
6-10 35
11-15 23
32
JU / Biostatistics
16-20 15
21-25 12
26-30 9
31-35 6
o Solutions:
o First find the class marks.
o Find the product of frequency and class marks
o Find mean using the formula.
Class fi xi xifi
6-10 35 8 280
11-15 23 13 299
16-20 15 18 270
21-25 12 23 276
26-30 9 28 252
31-35 6 33 198
Total 100 1575
1 k
k 1
x = fixi = 100 (1575) =15.75
∑ fi ∑ i=1
i=1
Special properties of A. M
i) The sum of deviations of a set of items from their mean is always zero i.e.
n
∑ ¿¿ (proof)
i=1
ii. If x 1 if the mean of n1 observation, if x 2 is the mean of n2 observations, ........, if xk
is the mean of nk observations, then the mean of all the observation in all groups often called the
combined mean is given by :-
33
JU / Biostatistics
k
1
xc =
n 1 x 1+ n2 x 2+ …+nk xk
= k ∑ ¿ x́ i
n1+ n2+ …+nk ∑ ¿ i=1
i=1
Example:-In a class there are 30 females and 70 males .If females averaged 60 in an examination
and boys averaged 72, find the mean of the entire class.
✈solutions:-
Females males
x´1= 60 x´2 = =72
n1= 30 n2=70
k
1
´= =
xc
n 1 x 1+ n2 x 2+ …+nk xk
= k ∑ ¿ x́ i = 1800+5040
= 68.4
i=1
n1+ n2+ …+nk ∑¿ 100
i=1
iii. If wrong figure has been used when calculating; the mean of the correct mean can be
obtained without repeating the whole process using:
Correct Value−WrongValue
Correct Mean = Wrong Mean + , where n is the total number of
n
observations
Example: - An average weight of 10 students was calculated to be 65k.g latter it was discovered
that one weight was misread as 40 instead of 80 k.g. calculate the correct average weight.
Correct Value−WrongValue 80−40
Correct Mean = Wrong Mean + = 65 + = 69 k.g.
n 10
iv) Weighted A. M
o When a different importance is desired to be giving to different data a weighted mean is
appropriate.
o Weights are assigned to each item in proportion to its relative importance.
o Let x1, x2 ,…., xn be the values of the items a series and w1,W2,..., Wn their corresponding
weights, the weighted mean denoted by xw is defined as:-
1 n
n
xw = wixi
∑ wi ∑
i=1
i=1
• Example:-A student obtained the following percentage in an examination:- English 60,
biology 75, mathematics 63, physics 59,and chemistry 55. find the students weighted
arithmetic mean if weights 1,2,1,3,3 respectively are allotted to all students.
• Solution :-
1 n
n 60∗1+75∗2+63∗1+ 59∗3+55∗3
xw = ∑ wixi = = 61.5
∑ wi i=1 1+2+1+3+3
i=1
Merit and Demerits of A. M
34
JU / Biostatistics
►Advantages:-
■ It is strictly defined
■ not needed to arrange the observation
■ It is based on all observation
■ It is suitable for further algebraic treatment
■ It is stable average, i.e. it is not affected by fluctuation of sampling to some extent.
■ it is ease to calculate and simple to understand.
► Demerits:-
■ it is much affected by extreme observations.
■ it can be a number, which does not in a series.
■ it can’t be calculated for frequency distribution with open ended classes.
Geometric mean (G.M)
- G.M is defined as the nth root of the product of n items or values of series.
- If there are two items, we take square root; if there are three items, the cube root and so on.
-symbolically, let x1,x2,x3,…,xn be the n values of a variable x, then their G.M is defined as:
G.M= √n x 1. x 2. x 3 … xn for raw data
n
f1 f2 f3 fn 1/N
G.M= (x1 .x2 .x3 … xn ) for frequency distn where, N= ∑ fi
i=1
- If the number of observation is more than three or more, the computation of the nth root very
tedious, to simplify computation, the logarithm are used in terms of log.
1 n
LogG.M = ∑ log xi
n i=1
1 n
Anti log (Log G.M) = Antilog [ ∑ log xi]
n i=1
n
1
G.M = Anti log [ ∑ log xi] for raw data
n i=1
n n
G.M = Anti log[1/N∑ fi log xi] for frequency distn where, N= ∑ fi
i=1 i=1
Example: - Find the geometric mean of 3,9,27
Solution: - G.M = √n x 1. x 2. x 3 … xn = √3 3∗9∗27 = 9
Note: - The geometric mean is useful and appropriate for finding averages of ratios or growth
rates.
Merit and Demerits of Geometric mean (G.M)
►Advantages:-
i) It is least affected by extreme value
ii) It is based on all observation
iii) It is suitable for further algebraic treatment
► Demerits:-
i) Its calculation is somewhat complicated
ii) It can’t be calculated if any of the value is 0
iii) If any one or more of the value are negative, either geometric mean can’t be
calculated or an absurd value will be obtained.
Harmonic Mean (H.M)
35
JU / Biostatistics
∑ xi1
n i=1 xi i=1
And in a case of frequency distribution:
1 n n
=
H.M 1 ∑ fi
k
= k
∑ xifi , n =∑
i=1
fi
n i=1 xi i=1
- If x1, x2, x3,…, xn be the value of the items a series and w1,w2,…,wn their corresponding weights,
the weighted Harmonic Mean denoted by;
1
n
1
H.Mw = n ∑ wi
i =1 xi
∑ wi
i=1
Example:- Find the harmonic mean of the following data. 20,30
2
Solution:- H.M = 1
+ = 24
1
20 30
Note:- The Harmonic Mean is useful and appropriate in finding average speeds and average
rates.
Merit and Demerits of H. M
►Advantages:-
i) It is based on all observation
ii) It is a good mean for a highly variable series
iii) It gives more weighted to the small value & less weighted to a large value.
► Demerits:-
i) Its calculation is complicated
ii) If any value is 0, it can’t be calculated
iii) Its value is generally not a member of the series.
Eg. A driver covers the 300km distance at an average speed of 60 km/hr makes the return trip at
an average speed of 50km/hr. What is his average speed for total distance?
2
H.M= 1 /60+1 /50 =600/110=54.55km/hr.
60+50
Note that A.M = 2 =55km/hr
36
JU / Biostatistics
E.g. Let a and b are any two positive values, then show that their arithmetic mean is greater
than or equal to their geometric mean and their geometric mean is greater than or equal to
their harmonic mean.
a+b 2
X = 2 , G= √ ab , H= 1 /a+1/b , we want to show that X ≥G≥H.
let (√a-√b)2 > 0 a -2√ab+b > 0
a -2√ab+b > 0 a + b > 2√ab
a+b > √ab (a + b)√ab >2√ab√ab
2 √ab > 2ab
X ≥G a+b
G>H
Therefore X ≥G> H
Similarly prove :
√ A . M ∗H . M = G.M, Where A.M and H.M. are the usual abbreviations.
37
JU / Biostatistics
d1
Mode =Lmo+ [ ]
d 1+d 2
W
Where Lmo= LCB of the modal class
d1= fm-fpm,
d2 = fm-fsm, where, fm=frequency of the modal class
fpm = frequency of the class preceding the modal class
fsm = frequency of the class succeeding the modal class
W= class width
Example: - following is the distribution of the size of certain farms selected at random from a
district.
Class( size 5-15 15-25 25-35 35-45 45-55 55-65 65-75
of farms
F 8 12 17 29 31 5 3
Find the modal of the distribution
Solution: - Modal class= 45-55
L =45, fm=31, fpm=29, fsm =5, W =10
31−29
Then the mode = Mode =45+ [ ]
( 31−29 ) +(31−5)
10 =45.71
Merit and Demerit of Mode
Merit
- It is not affected by extreme observations.
- Easy to calculate and understand.
Demerit
- It is not rigid.
- It is not based on all observations.
- It is not suitable for further mathematical treatment.
- It is not stable average. i.e. it is affected by fluctuations of sampling to some extent .
- Often its value is not unique.
Note: being the point of maximum density, mode is especially useful in
finding the most popular size in studies relating to marketing, trade, business,
and industry. It is the appropriate average to be used to find the ideal size.
The Median (~ x)
- In a distribution, median is the value of the variables, which divides it into two equal parts
- In an ordered series of data median is an observation lying exactly in the
middle of the series. It is the middle most value in the sense that the number
of values less than the median is equal to the number of values greater than it.
-If X1, X2, …Xn be the observations, then the numbers arranged in ascending
order will be X[1], X[2], …X[n], where X[i] is ith smallest value.
⇒ X[1]< X[2]< …<X[n]
38
JU / Biostatistics
X [(n+1)/ 2 ] , If n is odd .
~
X= 1
2 {
{X [ n/ 2] + X [ (n / 2)+1] }, If n is even
Remark:
39
JU / Biostatistics
The median class is the class with the smallest cumulative frequency (less than
n
type) greater than or equal to . 2
Example: Find the median of the following distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Solutions:
First find the less than cumulative frequency.
Identify the median class.
Find median using formula.
40-44 7 7
45-49 10 17
50-54 22 39
55-59 15 54
60-64 12 66
65-69 6 72
70-74 3 75
n 75
= =37 . 5
2 2
39 is the first cumulative frequency to be greater than or equal to 37. 5
⇒ 50−54 is the median class .
Lm=49. 5 , w =5
n = 75 , Fpm = 17 , f m = 22
~ w n
⇒ X =Lm + ( −Fpm)
fm 2
5
=49. 5+ (37 . 5−17 ) =54 .16
22
Exe:- Find the median of the following distribution
Class 50-60 60-70 70-80 80-90 90-100 100-110
40
JU / Biostatistics
Fi 20 21 50 40 53 16
<cfi 20 41 91 131 184 200
Demerits:
It is not a good representative of data if the number of items is small.
It is not amenable to further algebraic treatment.
It is susceptible to sampling fluctuations.
Quantiles
When a distribution is arranged in order of magnitude of items, the median is the
value of the middle term. Their measures that depend up on their positions in
distribution quartiles, deciles, and percentiles are collectively called quantiles.
►Quartiles
-Are the three values, which divided the given data in to four equal parts, they are denoted by
Q1,Q2 and Q3.
Q1= lower or first quartile, it covers 25% of the distribution
Q2= the middle or second quartile, it covers 50% of the distribution
Q3= the upper or third quartile, it covers 75% of the distribution.
For row (ungrouped data), first arrange the observations in increasing order of magnitude. Then
the ith quartile is given by
Qi=¿/4]th value ,i=1,2,3
In dividing i(n+1) by 4, there may be a remainder, let q be the quotient r be the remainder of the
division. Then,
Qi = qth value + r/4[(q+1)th value –qth value]
E.g. The following are yields of barley (kg/plot) from 14 plots:
30,32,35,38,40,42,48,49,52,55,58,60,62,&65. Find the first &third quartiles.
Solution: Q1= [1(14+1)/4] =15/4=3 & r=3
Q1 = 35+3/4[4th-35]
= 35+3/4[38-35] = 37.25
th
Q3 = [3(15)/4] = 45/4 =11 & r=1
Q3 = 11th value + ¼(12th value -11th value) = 58+1/4(2) =58.5 upper quartile
Q2=Q3+Q1/2
41
JU / Biostatistics
42
JU / Biostatistics
E.g. for the data given below, compute the quartiles, D3, D7, P15 and P88 interpret.
marks Below 10 10-20 20-40 40-60 60-80 Above
f 10 15 25 30 14 6
<cfi 10 25 50 80 94 100
Solution:-
Q1 – Size of N/4 th item= 25th item. Quartile class Lcf> iN/4 is 10- 20 quartile class
L=10, w=10, fq1=15, Fpq1 =10.
10
[ / ]
Q1 = Lqi + (in/4 –Fpqi) fqi W = 10 + 15 (25-10) =20
Mark of 25% of the students are less than 20.
2N
Q2- size of th item =50% item 20-40 quartile Class
4
L= 20, w=20, fq2=25, Fpq2 =25
20
Q2== 20 + (50 -25) = 40
25
Marks of half of students are below 40.
3N
D3- size th= 30th item 20-40 deciles class
10
L=20, w=20, fq3=25, Fpq3 =25
20
D3 =20 + (30 -25) =24
25
Marks of 30% of the students are below 24.
7N
D7- size th , item= 70th item 40-60 deciles
10
L=40, w=20, fq7=30, Fpq7=50
20
D7= 40 + (70-50) = 53.33
30
Marks of 70% of students is below 53.33
15 N
P15= size th = 15th item 10-20 percentile class
100
L=10, w=10, fq15=15, Fpq15 =10
10
P15= 10 + (15 –10) = 13.3
15
Mark of 15% of the students is below 13.33
88 N
P88 –size ( ¿ th = 88th item 60-80 percentiles class
100
L=60, w=20, fq88=14, Fpq88 =80
P88 = 60+20/14 (88 -80 ) = 71.43
Mark of 88% of students is below 71.43.
Exe: Considering the following distribution
Calculate:
a) All quartiles.
b) The 7th decile.
43
JU / Biostatistics
CHAPTER – FOUR
44
JU / Biostatistics
The degree to which a numerical data tends to spread about an average is called
dispersion or variation of the data
In general the greater the spread from the average the greater the variability.
Objectives of Measuring variation or Dispersion
o To judge the reliability of measure of central tendency,
o To compare two or more groups of numbers in terms of their variability, and
o To further statistical analysis.
To describe how the measurement vary about the center of the distribution. Measures of
variation can be either Absolute or Relative Measures
Absolute Measures of Dispersion
The measures of dispersion which are expressed in terms of the original unit of a series are
termed as absolute measures. Such measures are not suitable for comparing the variability of two
distributions which are expressed in different units of measurement and different average size.
Relative Measures of Dispersion
Relative measures of dispersions are a ratio or percentage of a measure of absolute dispersion to
an appropriate measure of central tendency and are thus pure numbers independent of the units
of measurement. For comparing the variability of two distributions (even if they are measured in
the same unit), we compute the relative measure of dispersion instead of absolute measures of
dispersion.
Types of Measure of Dispersion
There are various measure of dispersions, out of which the most commonly used are:
1. Range (R) and Relative Range (RR)
2. Mean Deviation (M.D) and Coefficient of Mean Deviation (C.M.D)
3. Variance (s2), Standard Deviation (s) and Coefficient of Variation (CV).
1. Range (R)
The range is the largest score minus the smallest score. It is a quick and dirty measure of
variability, although when a test is given back to students they very often wish to know the range
of scores. Because the range is greatly affected by extreme scores, it may give a distorted picture
of the scores.
The following two distributions have the same range, 13, yet appear to differ greatly in the
amount of variability.
Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
For this reason, among others, the range is not the most important measure of variability.
45
JU / Biostatistics
M.D ( x́ ) =
∑ ¿ Xi−x́∨¿
i=1
¿
n
o For the case of Frequency distribution it is given as:
46
JU / Biostatistics
M.D ( x́ ) =
∑ fi∨Xi−x́∨¿
i=1
¿
n
B) Mean Deviation about the Median
n
x) =
M.D (~
∑ ¿ Xi−~x∨¿
i=1
¿
n
o For the case of Frequency distribution it is given as:
n
x) =
M.D (~
∑ fi∨Xi−~x∨¿
i=1
¿
n
c) Mean Deviation about the Mode (^x )
n
M.D ( ^x ) =
∑ ¿ Xi−^x ∨¿
i=1
¿
n
o For the case of Frequency Distribution it is given as:
n
M.D ( ^x ) =
∑ fi∨Xi−^x ∨¿
i=1
¿
n
Examples:
1. The following are the number of visit made by ten mothers to the local doctor’s
surgery
8, 6, 5, 5,7,4,5,9,7,4. Find the mean deviation about mean, median and mode.
Solutions: First calculate the three averages: x́ =6, ~
x =5.5, ^x =5
Then take the deviations of each observation from these averages.
Xi 4 4 5 5 5 6 7 7 8 9 total
|Xi - 6| 2 2 1 1 1 0 1 1 2 3 14
|Xi-5.5| 1.5 1.5 0.5 0.5 0.5 0.5 1.5 1.5 2.5 3.5 14
|Xi - 5| 1 1 0 0 0 1 2 2 3 4 14
10
x) =
M.D (~
∑ ¿ Xi−5.5∨¿ = 14/10 =1.4
i=1
¿
10
10
M.D ( ^x ) =
∑ ¿ Xi−5∨¿ = 14/10= 1.4
i=1
¿
10
47
JU / Biostatistics
2. Find mean deviation about mean, median and mode for the following distribution. (Exercise)
class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Examples:
►Calculate the C.M.D about mean, median and mode for the data in example 1 above.
C.M.D(x́ ) = M . D ¿ ¿ = 1.4/6 = 0.233 , C.M.D(~
x) = M . D ¿ ¿ and
C.M.D(^x ) = M . D ¿ ¿ =1.4/5= 0.28,
3. Variance and Standard Deviation
Variance
Is the “average squared deviation from the mean”
Population variance 1/ N ∑Xi - i=1,2,3,......N
For the case of frequency distribution it is expressed as: 1/ N ∑fiXi -
i=1,2....N
1
Sample variance(s2): s2 =
n−1 ∑
¿ ¿)2, i=1,2,3....n
For the case of frequency distribution it is expressed as:
1
s2 = fi ¿ ¿)2, i=1, 2, 3....k
n−1 ∑
Short- cut formula:
1 1
s2 ¿ ¿Xi2 - nx́ 2) for row data, s2 ¿ ¿fiXi2 - nx́ 2) for freq. distribution.
n−1 n−1
Standard Deviation
There is a problem with variances.
Recall that the deviations were squared. That means the units were also squared.
To get the units back the same as the original data values, the square root must be taken.
= √ and s = √ s 2
Examples: find the variances and standard deviations of the following sample data
5,17,12,10. The data is given in the form of frequency distribution.
Solutions: x́ =11
xi 5 10 12 17 total
(Xi-x́ )2 36 1 1 36 74
48
JU / Biostatistics
1
n−1 ∑
s2 = ¿ ¿)2 = 74/3 =24.67 s == √ s 2 =√ 24.67 = 4.97
class frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
x́= 55
Xi(C.M) 42 47 52 57 62 67 72 total
fi(xi – x́ ¿ 1183
2
640 198 60 588 864 867 4400
1
s2 = fi ¿ ¿)2 = 4400/74 = 59.46 S= √ 59.46 = 7.71
n−1 ∑
City 25 24 23 26 17
-1
Then, City-2 22 21 24 22 20 which city do you think have
the most City-3 32 27 35 24 28 consistent temperature, based
on these data?
49
JU / Biostatistics
2. Two groups of people were trained to perform a certain task and tested to find out which
group is faster to learn the task. For the two groups the following information was given:
Examples:
1. Two sections were given introduction to statistics examinations. The following information
was given.
Value Section 1 Section 2
Mean 78 90
Sd 6 5
Student A from section 1 scored 90 and student B from section 2 scored 95.Relatively speaking
who performed better?
Solutions: Calculate the standard score of both students.
X A − X̄ 1 90−78
Z A= = =2
S1 6
X − X̄ 2 95−90
Z B= B = =1
S2 5
Student A performed better relative to his section because the score of student A is two
standard deviation above the mean score of his section while, the score of student B is only one
standard deviation above the mean score of his section.
2. Two groups of people were trained to perform a certain task and tested to find out which
group is faster to learn the task. For the two groups the following information was given:
50
JU / Biostatistics
b)Suppose a person A from group one take 9.2 minutes while person B
from Group two take 9.3 minutes, who was faster in performing the task? Why?
Solutions:
a) Use coefficient of variation.
S1 1. 2
C . V 1= ∗100= ∗100=11. 54 %
X̄ 1 10 . 4
S2 1. 3
C . V 2= ∗100= ∗100=10. 92 %
X̄ 2 11. 9
Since C.V2 < C.V1, group 2 is more consistent.
b) Calculate the standard score of A and B
X A − X̄ 1 9 . 2−10 . 4
Z A= = =−1
S1 1 .2
X B − X̄ 2 9. 3−11. 9
Z B= = =−2
S2 1.3
Child B is faster because the time taken by child B is two standard deviation shorter than the
average time taken by group 2 while, the time taken by child A is only one standard deviation
shorter than the average time taken by group 1.
3. Compare the performance of the following two students
Candidate Marks in economics Marks in Acct. Total
A 84 75 159
B 74 85 159
Average mark for economics is 60 with standard deviation of 13 & for that of accounting is 50
with standard deviation of 11. Whose performance is better A or B?
84 − 60
Economics = 1 .846
13
75 − 50
Accounting = 2 . 273
Z score for A 11
Total Z score for A = 1.846 + 2.273 = 4.119
51
JU / Biostatistics
74 − 60
Economics = 1.077
13
75 − 50
Accouniting = 3.182
Z score for B 11
Total Z – Score for B = 1.077 + 3.182 = 4.259
Since B’s Z – score is higher, this performance is better than A.
Moments
If X is a variable that assume the value X1, X2,......,Xn, then
n
x 1r + x 2r + x 3 r +...+ xn r 1
The r moment is defined as: x́ r =
th
= ∑ xi r
n n i=1
k
r =1
for the case of frequency distribution this is expressed as : x́ ∑ fixir
n i=1
if r=1. It is simple arithmetic mean, this is called the 1st moment.
The rth moment about the mean( the rth central moment): denoted by µr
and defined as:
n n
( xi− x́ ¿)r ∑ ( xi− x́ ) r
µr = ∑
i=1 n−1
= i=1
¿
n n n−1
For the case of frequency distribution this is expressed as:
n
fi ( xi−x́ ¿)r
µr = ∑
i=1
¿
n
If r=2, it is population variance, this is called the second central
moment.
If we assume n-1~n, it is also the sample variance.
Examples: 1) Find the first two moments for the following set of numbers 2,3,7
2) Find the first three central moments of the numbers in problem 1.
Solutions:1) Use the rth moment formula.
n
1
x́ r = ∑ xir = x́ 1 = (2+3+7)/3 =4, x́ 2 = (22+32+72)/3 = 20.67
n i=1
2) Use the rth central moment formula.
( 2−4 ) + ( 3−4 )+(7−4 )
µ1 = = 0 µ2 =? , µ3=?
3
Measure of Shapes
Skewness Skewness is concerned with the shape the curve not size
Skewness is the degree of asymmetry or departure from symmetry of a distribution.
52
JU / Biostatistics
53
JU / Biostatistics
3. Some characteristics of annually family income distribution (in birr) in two region is
as follows:
- Kurtosis
Kurtosis is the degree of peakdness of a distribution, usually taken relative to a normal
distribution.
A distribution having relatively high peak is®Leptokurtic
if a curve representing a distribution is flat topped ® Platykurtic
The normal distribution which is not very high peaked or flat topped ® Mesokurtic
Measure of Kurtosis
The moment coefficient of kurtosis: denoted by α4
α4 = µ4 = µ4
µ2
2
4 Where:- µ4= is the 4th moment about mean
µ2= is 2nd moment about mean.
is population standard deviation
The peakdness of depends on the value of α4 :
If α4 >3 then the curve is leptokurtic.
If α4 =3 the curve is Mesokurtic
If α4 <3 then the curve is Platykurtic.
54
JU / Biostatistics
Exercise
Compute a measure of kurtosis and give your interpretation
Value(xi) 3 4 5 6 7 8 9 10
Frequency( 4 6 10 26 24 15 10 5
f)
55
JU / Biostatistics
CHAPTER FIVE
ELEMENTARY PROBABILITY
Learning Objectives
At the end of this chapter, the student will be able to
Understand the concepts and characteristics of probabilities
Determine sample spaces and total number of outcomes in a sequence of events, using
the fundamental counting rule.
Compute probabilities of events and conditional probabilities
Introduction
Probability is one of those elusive concepts that virtually everyone knows but which is
nearly impossible to define entirely adequately .
Probability theory is the foundation upon which the logic of inference is built. It helps us
to cope up with uncertainty.
In general, probability is the chance of an outcome of an experiment.
It is the measure of how likely an outcome is to occur.
Definitions of Basic Probability Terms
Experiment: any process which generates well defined results or outcomes.
Random Experiment: It is an experiment that can be repeated any number of times
under similar conditions and it is possible to enumerate the total number of outcomes
without predicting an individual outcome.
Example: If a fair die is rolled once it is possible to list all the possible outcomes i.e.1, 2, 3, 4,
5, 6 but it is not possible to predict which outcome will occur.
Outcome: The result of a single trial of an experiment.
Sample space (S): The set of all possible outcome of an experiment.
Event: Any subset of sample space.
Remark: If S (sample space) has n members then there are exactly 2 n subsets or
events.
Equally Likely Events: events which have the same chance of occurrence.
Complement of an event: the complement of an event A means non-occurrence of A
' c
and is denoted by A , or A ,or Ā contains those points of the sample space which
doesn’t belong to A.
Elementary event: an event having only a single element or sample point.
Mutually Exclusive (ME) Event: two events that can’t occur simultaneously (which
cannot happen at the same time) i.e. no intersection.
For example, if we roll a fair dice, then the experiment is rolling the dice and Sample
space (S) = { 1,2,3,4,5,6 }
If we are interested the outcome of event E 1 getting even numbers and E 2 odd
numbers
E 1 = {2, 4, 6} , E 2 = {1, 3, 5} ,Clearly E 1 intersect E 2 = Φ .
Thus E 1 and E 2 are mutually exclusive events
Independent Event: two events are independent if the occurrence of one event does not
affect the occurrence or non-occurrence of the other event. Otherwise, they are dependent
events.
56
JU / Biostatistics
Addition rule
Suppose that a procedure designated by 1, can be performed in n 1 ways. Assume that second
procedure designated by 2 can be performed in n 2 ways. Suppose further more that it is not
possible both procedures 1 and 2 are performed together. The number of ways in which we can
k
∑ ni
i=1 , assuming that no two procedures performed together.
Example Suppose that we are planning a trip and are deciding between bus and train
transportation. If there are 3 bus routes and 2 train routes to go from A to B, find the available
routes for the trip.
There are 3+2=5 possible routes for someone to go from A to B.
Multiplication Rule
57
JU / Biostatistics
Suppose that procedure 1 can be performed in n 1 ways. Let us assume procedure 2 can be
performed in n 2 ways. Suppose also that each way of doing procedure 2 may be followed
by any way of doing procedure 1 , then the procedure consisting of n1 followed by n2 may be
performed by n 1 * n 2 ways
Example: There are four blood types, A, B, AB, and O. Blood can also be Rh+ and Rh-.
Finally, a blood donor can be classified as either male or female. How many different ways
can a donor have his or her blood labeled?
Solution
Since there are 4 possibilities for blood type, 2 possibilities for Rh factor, and 2
possibilities for the gender of the donor, there are 4 *2 *2= 16, different classification
categories.
Exercise: The digits 0, 1, 2, 3, and 4 are to be used in 4 digit identification card. How many
different cards are possible if
a) Repetitions are permitted.
b) Repetitions are not permitted
Permutation: An arrangement of n objects in a specified order is called permutation of the objects.
Permutation Rules:
1. The number of permutations of n distinct objects taken all together is n!
Where n !=n∗( n−1)∗( n−2 )∗. .. . .∗3∗2∗1
2. The arrangement of n objects in a specified order using r objects at a time is called the
P
permutation of n objects taken r objects at a time. It is written as n r and the
n!
=
formula is n Pr (n−r )!
3. The number of permutations of n objects in which k1 are alike k2 are alike ---- etc is
n!
=
P
n k k 1 !*k 2 !*.. .∗k n !
Example: How many different permutations can be made from the letters in the word
“BIOSTATISTICS”?
Solutions:
Here n=13, of which one is B, Three are I, One is B, Three are S, Three are T, One is A and
13!
one is C. There are =28828800 permutations
1! 3 ! 1 ! 3! 3! 1 ! 1!
Combination:- The selection of objects without considering to order is called combination.
Combination rule:- The number of combinations of r objects selected from n objects is given as
follow:-
58
JU / Biostatistics
Example: How many ways can a 5 injured persons be selected from 10 injured people in a
certain car accident.
Solution:
n=10, r=5 n C r = __n!___ = __10! ___ = 252 ways
(n - r)! r! 5! 5!
Exercise:- Among 15 pack drugs two of them are defectives. In how many ways can a
pharmacologist chose three of the pack drug for inspection so that:
a) There is no restriction,
b) None of defective drug is included,
c) Only one of the defective drug is included,
d) Two of the defective drug is included.
Approaches to measuring Probability
There are four different conceptual approaches to the study of probability theory. These are:
The classical approach.
The frequentist approach.
The axiomatic approach.
The subjective approach.
The classical approach: This approach is used when:
- All outcomes are equally likely.
- Total number of outcome is finite, say N.
Definition: If a random experiment with N equally likely outcomes is conducted and out of these
NA outcomes are favorable to the event A, then the probability that event A occur denoted
P( A ) is defined as:
N A No . of outcomes favourable to A n( A )
P( A )= = =
N Total number of outcomes n (S )
Example: in the rolling of the die , each of the six sides is equally likely to be observed . So, the
probability that a 4 will be observed is equal to 1/6.
Exercise: A box of 80 aspirin consists of 12 defective and 70 non defective aspirin tablets. If 8
of this tablet are selected at random, what is the probability
a) All will be defective.
b) 6 will be non defective
59
JU / Biostatistics
60
JU / Biostatistics
61
JU / Biostatistics
Required p ( A∩B )
a. p ( A∩B )= p ( B / A ) . p ( A )=( 4 /10 ) ( 3 /9 )=2/15
b. p ( A∩B )= p ( A ) . p ( B )=( 4/10 ) ( 4 /10 )=4 /25
Total Probability Theorem
If we know the conditional probabilities of a given event under all conditions, then we can obtain
the un-conditional probability of the same event using the law of total probability
Consider two events – B: Flight is delayed, A: There are severe thunderstorms
Now a flight may be delayed (event B happens) due to many reasons, one of which is
severe thunderstorms (event A happens). Clearly, these two events are related.
In fact, the (unconditional) probability of delay can be consider as the sum of two
conditional probabilities: the conditional probability that there is a delay when there is
severe thunderstorms + the conditional probability that there is a delay when there is no
severe thunderstorms:
No Storm
Storm
Bayes’ theorem
Prior probability →new information →application of bayes theorem →posterior
probability
We now pose the following opposite question: Given that the event B has occurred, what is
the probability that any single one of the events A’s occur?
We call P (Ai|B) as the posterior probability of event Ai, that is, the probability of Ai after event
B is observed. P(Ai) by itself is then the prior probability, the belief we have in the likelihood of
Ai in the absence of any additional information
A direct result of P (A∩B) =P (B∩A) gives us the posterior probability from the conditional
probability and the prior probability
P ( B / Ai ) . P ( Ai )
P ( Ai / B )= n
∑ P ( B / Ai ) .( P ( Ai )
i =1
62
JU / Biostatistics
Exercise:
1. Suppose that it is known that a fraction 0.001 of the people in a town have tuberculosis (TB).
A tuberculosis test is given with the following properties’. If the person does have TB, then the
probability is 0.999. If he does not have TB, then there is a probability 0.002 that the test will
erroneously indicate that he does for one random selected person, the test shows that he has
TB. What is the probability that he really does?
2. Five percent of the people have high blood pressure. Of the people with high blood pressure,
75 percent drink alcohol; whereas, only 50 percent of the people without high blood pressure
drink alcohol. What percent of the drinker have high blood pressure?
3. A drug stores sells three different brands of drugs. Of its drug sales, 50% are drug type 1 (the
least expensive), 30% are drug type 2, and 20% are drug type 3. Each manufacture offers a 1-
year warranty for inspection . It is known that 25% of drug type 1’s require warranty for
inspection, whereas the corresponding percentages for drug type 2 and 3 are 20% and 10%,
respectively.
A. What is the probability that a randomly selected customer has bought a drug type 1 that
will need inspection while under warranty?
B. What is the probability that a randomly selected customer bought a drug that will need
inspection while under warranty?
C. If a customer returns to the store with a drug that will needs warranty inspection, what is
the probability that it is a drug type 1? A drug type 2? A drug type 3?
4. A diagnostic test for a certain disease is 95 percent accurate, in that if a person has the
disease, it will detect it with a probability of 0.95, and if a person does not have the disease, it
will give a negative result with a probability of 0.95. Suppose that only 0.5 percent of the
63
JU / Biostatistics
population has the disease in question. A person is chosen at random from this population. The
test indicates that this person has the disease. What is the (conditional) probability that he or
she does have the disease?
CHAPTER - SIX
RANDOM VARIABLE AND PROBABILITY DISTRIBUTIONS
64
JU / Biostatistics
Continuous random variable: are variables that can assume all values between any two give
values. Continuous random variables can assume an infinite number of values and can be decimal
and fractional values.
Examples:
Definition: a probability distribution consists of a value a random variable can assume and the
corresponding probabilities of the values.
Example: Consider the experiment of tossing a coin three times. Let X is the number of heads.
Construct the probability distribution of X.
Solution:
Calculate the probability of each possible distinct value of X and express X in the form of
frequency distribution.
X =x 0 1 2 3
P ( X=x ) 1/8 3/8 3/8 1/8
Probability distribution is denoted by P for discrete and by f for continuous random variable.
Since the values of a probability distribution are probabilities, they must be numbers in
the interval from 0 to 1.
Since a random variable has to take on one of its values, the sum of all the values of a
probability distribution must be equal to 1.
65
JU / Biostatistics
P( x )≥0 , if X is discrete .
1. f ( x )≥0 , if X is continuous .
∑ P ( X=x ) =1 , if X is discrete .
x
∫ f ( x )dx =1 , if is continuous .
2. x
Introduction to expectation
Definition:1. Let a discrete random variable X assume the values X1, X2, ….,Xn with the probabilities
P(X1), P(X2), ….,P(Xn) respectively. Then the expected value of X, denoted as E(X) is defined as:
n
E ( X )= X 1 P ( X 1 )+ X 2 P ( X 2 )+. .. .+ X n P ( X n )=∑ X i P( X i )
i =1
Let X be a continuous random variable assuming the values in the interval (a, b) such
b
∫ f ( x )dx=1
that a ,then
66
JU / Biostatistics
b
E( X )=∫ x f ( x ) dx
a
Examples: What is the expected value of a random variable X obtained by tossing a coin three
times where X is the number of heads
X =x 0 1 2 3
P ( X=x ) 1/8 3/8 3/8 1/8
Suppose a charity organization is mailing printed return-address stickers to over one million
homes in the Ethiopia. Each recipient is asked to donate$1, $2, $5, $10, $15, or $20. Based on
past experience, the amount a person donates is believed to follow the following probability
distribution:
Solution:
67
JU / Biostatistics
2 2
Variance of X =var ( X )=E( X )−[ E ( X )]
n
2
E( X )=∑ x 2 P ( X=x i ) , if X is discrete
i=1 i
2
=∫ x f ( x )dx , if X is continuous .
Where: x
Example: Let X the number of ears affected by one or more episodes of otitis media ear infection
during the first two years of life. Suppose the probability distribution function for this random
variable is given below. Find the expected and variance of the number of ears affected by ear infection
during the first two years of life is computed as follows:
Interpretation: the mean number of ears affected by otitis media during the first two years of life is
√
1.26. The population standard deviation of X is σx = σx2 or √0.452= 0.673 in our example.
Exercise: Consider the random variable representing the number of episodes of diarrhoea in the first 2
years of life. Suppose this random variable has a probability mass function as below
R 0 1 2 3 4 5 6
P(X=r) 0.129 0.264 0.271 0.185 0.095 0.039 0.017
What is the expected number of episodes of diarrhoea in the first 2 years of life?
Compute the variance and SD for the random variable representing number of episodes of diarrhea
in the first 2 years of life. ?
68
JU / Biostatistics
1. Binomial Distribution: One of the most widely used of all discrete probability distributions is
the binomial distribution. A binomial experiment is a probability experiment that satisfies the
following four requirements called assumptions of a binomial distribution.
P( X= x )= n p x qn−x , x=0,1,2 , .. . ., n
()
x
X ~ Bin(n , p )
And this is sometimes written as:
When using the binomial formula to solve problems, we have to identify three things:
The number of trials ( n )
The probability of a success on any one trial ( p ) and
The number of successes desired ( X ).
Examples:
1. Suppose that an examination consists of six true and false questions, and assume that a student has no
knowledge of the subject matter. The probability that the student will guess the correct answer to the first
question is 30%. Likewise, the probability of guessing each of the remaining questions correctly is also
30%.
a) What is the probability of getting more than three correct answers?
b) What is the probability of getting at least two correct answers?
c) What is the probability of getting at most three correct answers?
d) What is the probability of getting less than five correct answers?
Soln: Let X = the number of correct answers that the student gets.
X ~ Bin( n=6 , p=0 .30 ) a) P( X >3 )=?
n x n−x
⇒ P( X =x )= ()
x
p q , x=0,1,2, . .. 6
¿ ( 6x ) 0 .3 0 .7
x 6−x 69 X=5)+P( X =6 )
⇒ P( X >3)=P( X=4 )+P(
=0 .060+0 .010+0. 001 =0 .071
JU / Biostatistics
Thus, we may conclude that if 30% of the exam questions are answered by guessing, the
probability is 0.071 (or 7.1%) that more than four of the questions are answered correctly by the
student.
a) P( X≥2)=?
P( X≥2)=P( X=2 )+P( X=3 )+P( X=4 )+P( X =5)+P( X =6 )
=0 .324 +0 .185+0. 060+0 . 010+0 . 001 =0. 58
b) P( X≤3)=?
P( X≤3)=P ( X=0 )+P( X =1)+P( X =2)+P( X =3 )
=0 .118 +0 .303+0. 324+0. 185=0 .93
c) P( X <5 )=?
P( X <5 )=1−P( X≥5 )
=1−{P( X=5 )+ P( X=6)} =1−(0 . 010+0 . 001) =0. 989
2. Ten patients are treated surgically. For each person there is a 70% chance of successful surgery (i.e., p=
0.7). What is the probability that at most five surgeries are successful?
P[at most five successful cases]=P[five or fewer successful cases]
= P[five successful cases] + P[four successful cases] + P[three successful cases]
+ P[two successful cases] + P[one successful case] + P[no successful case]
= 0.1029 + 0.0368 + 0.0090 + 0.0014 + 0.0001 + 0.0000
= 0.1502
3. Suppose that 4% of all patients treated with a certain type of drug develop side effects. If eight of these
people are randomly selected from across the country and tested, what is the probability that exactly three
of them develop the side effect? Assume that each patient is treated independently of the others.
In this problem, n=8, X=3, p=0.04, and q=(1-p)=0.96.
Substituting these numbers into the binomial formula (see the above equation) we get:
P(X =3) = P(3) = 0.0003 or 0.03%.
Exercise: Suppose that in a certain malarias area past experience indicates that the probability of a person
with a high fever will be positive for malaria is 0.7. Consider 3 randomly selected patients (with high
fever) in that same area.
1) What is the probability that no patient will be positive for malaria?
2) What is the probability that exactly one patient will be positive for malaria?
70
JU / Biostatistics
3) What is the probability that exactly two of the patients will be positive for malaria?
4) What is the probability that all patients will be positive for malaria?
5) Find the mean and the SD of the probability distribution given above.
If X is a binomial random variable with parameters n and p then
Remark:
E( X )=np , Var ( X )=npq
Poisson Distribution:A random variable X is said to have a Poisson distribution if its probability
distribution is given by:
x −λ
λ e
P( X= x )= , x=0,1,2 ,. . .. .. Where λ=the average number .
x!
- The Poisson distribution depends only on the average number of occurrences per unit
time of space.
- The Poisson distribution is used as a distribution of rare events, such as: Number of
misprints, Natural disasters like earth quake, Accidents, Hereditary, The number of
patients admitted in a hospital emergency room per day, Arrivals,…. etc
- The process that gives rise to such events is called Poisson process.
Note that instead of time, the Poisson random variable may be considered in the experiment
of counting the number x of times a particular event occurs during a given unit of area,
volume, etc.
Examples: If 1.6 accidents can be expected an intersection on any given day, what is the
probability that there will be 3 accidents on any given day?
1. 6 x e−1. 6
X =poisson ( 1. 6 ) ⇒ p ( X =x )=
x!
3 −1 . 6
1. 6 e
p ( X=3 )= =0 .1380
3!
1. On the average, five smokers pass a certain street corners every ten minutes, what is
the probability that during a given 10minutes the number of smokers passing will be
a) 6 or fewer b) 7 or more c) Exactly 8……. (Exercise)
2. Patients arrive at a certain hospital at an average rate of two every 10 minutes. The
number of arrivals is distributed according to a Poisson distribution. What is the
probability that there will be :
71
JU / Biostatistics
Note:
The Poisson probability distribution provides a close approximation to the binomial probability
distribution when n is large and p is quite small or quite large with λ=np .
( np) x e−( np )
P( X= x )= , x=0,1,2, . .. .. .
x!
Where λ=np=the average number .
Usually we use this approximation if np≤5 . In other words, if n>20 and np≤5 [or
n(1− p )≤5 ], then we may use Poisson distribution as an approximation to binomial distribution.
Example:
1. Find the binomial probability P(X=3) by using the Poisson distribution if p=0 . 01
and n=200
Solution:
Introduction: There are many continuous probability distributions, such as, normal distribution, the t
distribution, the chi-square distribution, and F distribution. In this section, we will concentrate on the
normal distribution.
72
JU / Biostatistics
1. Normal Distribution
A random variable X is said to have a normal distribution if its probability density function is given by
1 x− μ 2
1
f (x )= e
( ) , −∞< x<∞ , −∞< μ<∞ , σ > 0
−
2 σ
σ √2 π
Where μ=E( X ) , σ 2 =Variance( X )
μ and σ 2 are the Parameters of the Normal Distribution .
Properties of Normal Distribution:
1. It is bell shaped and is symmetrical about its mean and it is mesokurtic. The maximum ordinate
is at x=μ and is given by
1 1 x− μ 2
f (x )= − ( )
2 σ
σ √2 π e
2. It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean.
3. It is a continuous distribution.
4. It is a family of curves, i.e., every unique pair of mean and standard deviation defines a different
normal distribution. Thus, the normal distribution is completely described by two parameters:
mean and standard deviation.
5. Total area under the curve sums to 1, i.e., the area of the distribution on each side of the mean is
∞
0.5. ⇒ −∞
∫ f (x )dx=1
6. It is unimodal, i.e., values mound up only in the center of the curve.
7. Mean=Median=mod e=μ
8. The probability that a random variable will have a value between any two points is equal to the
area under the curve between those points.
Note: To facilitate the use of normal distribution, the following distribution known as the standard
normal distribution was derived by using the transformation
1
X−μ 1 −2 Z 2
Z= ⇒ f ( z )= e
σ √2 π
Properties of the Standard Normal Distribution:
Mean is zero
Variance is one
73
JU / Biostatistics
- Areas under the standard normal distribution curve have been tabulated in various ways. The
most common ones are the areas between
Z =0 and a positive value of Z .
- Given a normal distributed random variable X with
1. Find the area under the standard normal distribution which lies
Solution:
74
JU / Biostatistics
Solution:
Area=P (Z >−0 . 35)
=P(−0 . 35<Z <0 )+P(Z >0 )
=P(0<Z <0 . 35)+P( Z>0)
=0 . 1368+0 .50=0 . 6368
Solution:
Area=P (Z <−0 . 35)
=1− P( Z >−0 .35 )
=1−0 .6368=0 .3632
75
JU / Biostatistics
Solution
Solution
3. A random variable X has a normal distribution with mean 80 and standard deviation 4.8.
What is the probability that it will take a value
Solution
76
JU / Biostatistics
X− μ 87 .2−μ
P( X <87 . 2)=P ( < )
σ σ
87 . 2−80
=P ( Z < )
4 .8
= P( Z<1 . 5)
= P( Z< 0)+ P( 0< Z <1. 5 )
=0 . 50+0 . 4332=0 . 9332
b)
X−μ 76. 4− μ
P( X >76 . 4 )= P ( > )
σ σ
76 . 4−80
=P ( Z > )
4 .8
=P( Z>− 0. 75 )
=P( Z> 0)+ P( 0< Z <0 .75 )
=0 . 50+0 .2734=0 . 7734
c)
81 .2−μ X −μ 86 . 0−μ
P( 81. 2< X < 86 .0 ) =P( < < )
σ σ σ
81 . 2−80 86 . 0−80
=P( <Z < )
4.8 4 .8
=P( 0 . 25< Z<1 . 25)
= P( 0< Z< 1. 25)−P( 0< Z<1 . 25)
=0 . 3934−0 .0987=0 . 2957
4. A normal distribution has mean 62.4.Find its standard deviation if 20.05% of the area under
the normal curve lies to the right of 72.9
Solution
77
JU / Biostatistics
X −μ 72 . 9− μ
P ( X >72 . 9 )= 0 . 2005 ⇒ P ( > )=0 . 2005
σ σ
72 . 9 −62 . 4
⇒ P( Z> )=0 . 2005
σ
10 . 5
⇒ P(Z > )=0 . 2005
σ
10 . 5
⇒ P( 0< Z < )=0 . 50−0 . 2005= 0 . 2995
σ
And from table P ( 0 < Z < 0 . 84 )= 0 . 2995
10 . 5
⇔ =0 . 84
σ
⇒ σ =12 . 5
5. A random variable has a normal distribution with σ =5 .Find its mean if the probability
that the random variable will assume a value less than 52.5 is 0.6915.
Solution
52. 5−μ
P( Z < z )=P( Z< )=0. 6915
5
⇒ P( 0< Z < z )=0. 6915−0 .50=0 .1915 .
But from the table
⇒ P( 0< Z <0 .5 )=0 . 1915
52. 5−μ
⇔z= =0 . 5
5
⇒ μ=50
6. Of a large group of men, 5% are less than 60 inches in height and 40% are between 60 & 65
inches. Assuming a normal distribution, find the mean and standard deviation of heights.
Solution (Exercise)
The Normal Approximation to the Binomial Distribution: As the sample sizes get larger,
binomial distribution approach the normal distribution in shape regardless of the value of p
(probability of success). For large sample values, the binomial distribution is cumbersome to
analyze without a computer. Fortunately, the normal distribution is a good approximation for
binomial distribution problems for large values of n. The commonly accepted guidelines for
using the normal approximation to the binomial probability distribution is when (n x p) and [n(1
- p)] are both greater than 5.
Example: Suppose that a physician claimed that 70% of his patients returned for annual
examination. In a year in which 80 new (first-time) patients were served at the clinic, what is the
probability that 60 or more of the patients will return for another examination?, ie., P(X >= 60) ?.
The solution to this problem can be illustrated as follows:
First, the two guidelines that (n x p) and [n(1 - p)] should be greater than 5 are satisfied: (n x p) =
(80 x 0.70) = 56 > 5, and [n(1 - p)] = 80(1 - 0.70) = 24 > 5.
Second, we need to find the mean and the standard deviation of the binomial distribution. The
mean is equal to (n x p) = (80 x 0.70) = 56 and standard deviation is square root of [(n x p)(1 -
78
JU / Biostatistics
p)], i.e., square root of 16.8, which is equal to 4.0988. Using the Z equation we get, Z = (X
-mean)/standard deviation = (59.5 - 56)/4.0988 = 0.85. From the table, the probability for this Z
score is 0.3023 which is the probability between the mean (56) and 60. We must subtract this
table value 0.3023 from 0.5 in order to get the answer, i.e., P(X >= 60)
= 0.5 -0.3023 = 0.1977. Therefore, the probability is 19.77% that 60 or more of the 80
first-time patients will return to the clinic for another examination
79
JU / Biostatistics
CHAPTER 7
7. Sampling and Sampling Distribution
Introduction
Given a variable X, if we arrange its values in ascending order and assign probability to each
of the values or if we present Xi in a form of relative frequency distribution the result is called
Sampling Distribution of X.
Definitions:
1. Parameter: Characteristic or measure obtained from a population.
2. Statistic: Characteristic or measure obtained from a sample.
3. Sampling: The process or method of sample selection from the population.
4. Sampling unit: the ultimate unit to be sampled or elements of the population to be
sampled.
Examples: -
-If somebody studies Scio-economic status of the households, households are the
sampling unit.
- If one studies performance of freshman students in some college, the student is the
sampling unit.
5. Sampling frame: is the list of all elements in a population.
Examples: -List of households.
-List of students in the registrar office.
6. Errors in sample survey:
There are two types of errors
a) Sampling error:
- Is the discrepancy between the population value and sample value.
- May arise due to in appropriate sampling techniques applied
b) Non sampling errors: are errors due to procedure bias such as:
- Due to incorrect responses
- Measurement
- Errors at different stages in processing the data.
ÆAdvantages of sampling approach over that of census approach are:-
-Reduced cost
Greater speed
Greater accuracy
Greater scope
More detailed information can be obtained.
80
JU / Biostatistics
81
JU / Biostatistics
- Let
N
N= population size , n=sample size , k = =sampling int erval.
n
- Chose any number between 1 and k . Suppose it is j ( 1≤ j≤k ) .
th th th
- The j unit is selected at first and then ( j+k ) ,( j+2 k ) , .. . .etc until
the required sample size is reached.
2. Non Random Sampling or non-probability sampling.
- It is a sampling technique in which the choice of individuals for a sample depends on the
basis of convenience, personal choice or interest.
Examples:
Judgment sampling.
Convenience sampling
Quota Sampling.
1. Judgment Sampling
- In this case, the person taking the sample has direct or indirect control over which
items are selected for the sample.
2. Convenience Sampling
- In this method, the decision maker selects a sample from the population in a manner
that is relatively easy and convenient.
3. Quota Sampling
- In this method, the decision maker requires the sample to contain a certain number of
items with a given characteristic. Many political polls are, in part, quota sampling.
Note:
let N = population size , n=sample size .
1. Suppose simple random sampling is used
n
We have N possible samples if sampling is with replacement.
N
We have
()
n possible samples if sampling is without replacement.
2. After this on wards we consider that samples are drawn from a given population
using simple random sampling.
Sampling Distribution of the sample mean
- Sampling distribution of the sample mean is a theoretical probability distribution that shows
the functional relationship between the possible values of a given sample mean based on
samples of size n and the probability associated with each value, for all possible samples of
size n drawn from that particular population.
- There are commonly three properties of interest of a given sampling distribution.
Its Mean
Its Variance
82
JU / Biostatistics
Example: Suppose we have a population of size N=5 , consisting of the age of five
children: 6, 8, 10, 12, and 14
⇒ Population mean=μ=10
population Variance=σ 2 =8
Take samples of size 2 with replacement and construct sampling distribution of the sample
mean.
Solution: N=5 , n=2
n 2
We have N =5 =25 possible samples since sampling is with replacement.
Step 1: Draw all possible samples:
6 8 10 12 14
6 (6, 6) (6, 8) (6, 10) (6, 12) (6, 14)
8 (8,6) (8,8) (8,10) (8,12) (8,14)
10 (10,6) (10,8) (10,10) (10,12) (10,14)
12 (12,6) (12,8) (12,10) (12,12) (12,14)
14 (12,6) (14,8) (12,10) (12,12) (12,14)
Step 2: Calculate the mean for each sample:
6 8 10 12 14
6 6 7 8 9 10
8 7 8 9 10 11
10 8 9 10 11 12
12 9 10 11 12 13
14 10 11 12 13 14
83
JU / Biostatistics
μ X̄ =
∑ X̄ i f i =250 =10=μ
∑ f i 25
σ
X̄ , say X̄ 2
b) Find the variance of
2
∑ ( X̄ i−μ X̄ ) f i 100
σ 2= = =4≠σ 2
X̄ ∑ fi 25
Remark:
2
σ
σ 2=
1. In general if sampling is with replacement X̄ n
σ 2 N −n
2. If sampling is without replacement
σ 2=
X̄ n N −1 ( )
3. In any case the sample mean is unbiased estimator of the population mean. i.e.
μ X̄ =μ ⇒ E( X̄ )=μ (Show!)
- Sampling may be from a normally distributed population or from a non -normally
distributed population.
- When sampling is from a normally distributed population, the distribution of X̄
will possess the following property.
1. The distribution of X̄ will be normal
84
JU / Biostatistics
2
σ
approximately normally distributed with mean μ and variance n , when the sample size is
large.
85