You are on page 1of 85

JU / Biostatistics

CHAPTER 1: Introduction
Learning objectives:
After completing this chapter, the student will be able to:
1. Define Statistics and Biostatistics
2. Enumerate the importance and limitations of statistics
3. Define and identify the different types of data and understand why we need
to classifying variables

Statistical thinking has now a day became very essential for different fields of study. Its
usefulness has now spread to such diverse fields as agriculture, business, accounting,
marketing, economics, management, medicine, political science, psychology, sociology,
engineering, journal, metrology, tourism, etc. In biomedical research, meaningful
conclusions can only be drawn based on data collected from a valid scientific design
using appropriate statistical methods. Therefore, the selection of an appropriate study
design is important to provide an unbiased and scientific evaluation of the research
questions. Each design is based on a certain rationale and is applicable in certain
experimental situations.

1.1. Definition of Biostatistics

Biostatistics is the segment of statistics that deals with data arising from biological
processes or medical experiments. Thus biostatistics is the application of statistical
techniques in a health related area (application of statistical methods on biological,
medical and public health data).

Why biostatistics?
Because some statistical methods are more heavily used in health applications than
elsewhere (e.g Survival analysis, longitudinal data analysis)

The word statistics on the other hand, has two meanings. In the more common
usage,
Statistics (plural sense) refers to numerical information (aggregates of facts).
Example includes statistics of births, disease cases, imports, exports, etc. In these
examples statistics are numbers or facts. The subject of statistics (singular sense),
has a much broader meaning than just collecting and publishing numerical
information. Statistics in this sense may be defined as the science of

1
JU / Biostatistics

Collecting, Organizing, Presenting, Analyzing and Interpreting data to assist


in making more effective decisions. This definition points out five stages in any
statistical investigation

i. Collection of data
ii. Organization of data
iii. Presentation of data
iv. Analysis of data and
v. Interpretation of data
Stage 1: Data collection
 Is the process of gathering information or data about the variable of interest for
our specific purpose?
 Constitutes the first step in a statistical investigation.
 At most care must be exercised in collecting data because they form the
foundation of statistical analysis. If the data are faulty, the conclusions drawn
can never be reliable. The data may be available from existing published or
unpublished sources or else may be collected by the investigator himself. i.e
data may obtained either primarily or secondarily.
Stage 2: Organization of data
 Is the process of editing, classification and tabulation of data.
• Editing: is the process of checking and connecting data for omission,
inconsistencies, irrelevant answer and wrong computation in the
collected data.
• Classification: is the task of grouping the collected and edited data in to
different similar categories based on some criteria
• Tabulation: is to put classified data in the form of table.
 Arranging or classification of data in the suitable order makes the information
easier for presentation.
Stage 3: Presentation of data
 The organized data can now be presented in the form of tables and diagram.
At this stage, large data will be presented in tables in a very summarized and
condensed manner.
 The main purpose of data presentation is to facilitate statistical analysis.
 Graphs and diagrams may also be used to give the data a vivid meaning and
make the presentation attractive.

2
JU / Biostatistics

Stage 4: Analysis of data


 This is the stage where we critically study the data to draw conclusions about
the population parameter. The purpose of data analysis is to dig out
information useful for decision making. Analysis usually involves highly
complex and sophisticated mathematical techniques. However, in this
material only the most commonly used methods of statistical analysis are
included. Such as the calculations of averages, the computation of measures
of dispersion, regression and correlation analysis are covered.
Stage 5 : Interprétation
 This is the stage where draw valid conclusions from the results obtained
through data analysis. Interpretation means drawing conclusions from the
data which form the basis for decision making. The interpretation of data is a
difficult task and necessitates a high degree of skill and experience. If data
that have been analyzed are not properly interpreted, the whole purpose of
the investigation may be defected and fallacious conclusion be drawn. So that
great care is needed when making interpretation.

Data: is the measurement or observation (values) for a variable (factor)


 A collection of data values forms a data set
 Each value in the data set is called a data value or datum

1.2. Characteristics of statistical data

Some of the characteristics that any statistical data possesses are;


i) It must be in aggregates. This means that statistics are 'number of facts.' A
single fact, even though numerically stated, cannot be called statistics.
ii) It must be enumerated or estimated according to a reasonable standard of
accuracy. This means that if aggregates of numerical facts are to be called
'statistics' they must be reasonably accurate. This is necessary because
statistical data are to serve as a basis for statistical investigations. If the basis
happens to be incorrect the results are bound to be misleading.
iii) It must have been collected in a systematic manner for a predetermined
purpose. Numerical data can be called statistics only if they have been
compiled in a properly planned manner and for a purpose about which the
enumerator had a definite idea.

3
JU / Biostatistics

iv) It must be comparable. Numerical facts may be placed in relation to each


other either in point of time, space or condition.
1.3. Classification of Statistics
There are two broad branches of statistics:-
Descriptive statistics:-Statistical method that deals with organizing or
summarizing a given set of data in to a meaningful form. Most of the statistical
information in newspapers, magazines, reports and other publications come from
data that has been summarized and presented in a form that is easy for the reader to
understand.
►Here there is no generalization or conclusion about the population.
►It consists of collection, organization and presentation of data.
E.g. Frequency distribution, measure of central tendency (such as mean,
median), measure of dispersion (like range, V, Sd, etc...)
 Descriptive statistics doesn’t go beyond describing the data themselves

Inferential statistics: - Is the process of drawing conclusion (inference) about a


population based on the information obtained from the sample. Because of time,
cost and other constraints data are collected from only small portion of the group
(or sample). The major contribution of statistics is that it enables us to use data
from the sample to make estimates and test claims about the characteristics of a
population. This process is referred as statistical inference which:
- Is performing and testing hypothesis, determining relationships among
variables, and making prediction.
- Used to describe, infer, estimate, approximate the characteristics of the
target population
Examples
 From sample we have 40% patients suggest positive attitude toward the
service of the hospital.(Descriptive)

 As a result of recent reduction in drug production by manufacturers, we


can expect the cost of medication to double up in the next year .(It is
an inference from sample survey).

 As a result of recent survey on residences, most Americans are in


favor of lowering smoke caused risks.(Inference)

4
JU / Biostatistics

1.4. Application Uses and limitation of statistics

Rationale of studying statistics:


 More and more things are now measured quantitatively in medicine and
public health,
 The planning, conduct, and interpretation of much of medical research are
becoming increasingly reliant on statistical technology. Is this new drug or
procedure better than the one commonly in use? How much better? What,
if any, are the risks of side effects associated with its use? In testing a new
drug how many patients must be treated, and in what manner, in order to
demonstrate its worth? What is the normal variation in some clinical
measurement?
 Public health and medicine are becoming increasingly quantitative.

Hence statistics has already become a very important subject area and that various
tools of statistics are being used to solve problems in everyday life, in research,
marketing, planning a production and quality control and other areas. Nevertheless,
statistics has its own limitations and it can also be misused.

Specific Uses:-

 Statistics condenses and summarizes complex data. The original set of data (raw data) is
normally voluminous and disorganized unless it is summarized and expressed in few
numerical values.

 Statistics facilitates comparison of data. Measures obtained from d/t set of data can be
compared to draw conclusion about those sets. Statistical values such as averages,
percentages, ratios, etc, are the tools that can be used for the purpose of comparing sets of
data.

5
JU / Biostatistics

 Statistics helps in predicting future trends. Statistics is extremely useful for analyzing the
past and present data and predicting some future trends.

 Statistics influences the policies of government. Statistical study results in the areas of
taxation, on unemployment rate, on the performance of every sort of military equipment,
etc, may convince a government to review its policies and plans with the view to meet
national needs and aspirations.

 Statistical methods are very helpful in formulating and testing hypothesis and to develop
new theories.

Limitations:-

 Statistics doesn’t deal with single (individual) values. Statistics deals only with aggregate
values. But in some cases single individual is highly important to consider in some
situations. Example, the sun, a deriver of bus, president, etc.

 Statistics can’t directly deal with qualitative characteristics. It only deals with data which
can be quantified. Example, not deal with marital status (married, single, divorced,
widowed) but it deal with number of married, number of single, number of divorced.

 Statistical conclusions are not universally true. Statistical conclusions are true only under
certain condition or true only on average. The conclusions drawn from the analysis of the
sample may, perhaps, differ from the conclusions that would be drawn from the entire
population. For this reason, statistics is not an exact science.

Example, in Island there are 100 males and 2 females are live for one year. From these 2
females married with two male. These means 100% of females married with 2% of
males. Based on this information one can try to make decision as “birth rate of male is
higher than that of female”. This conclusion may or may not true.
 Statistical interpretations require a high degree of skill and understanding of the subject.
It requires extensive training to read and interpret statistics in its proper context. It may
lead to wrong conclusions if inexperienced people try to interpret statistical; results.

6
JU / Biostatistics

 Statistics can be misused. Sometimes statistical figures can be misleading unless they are
carefully interpreted.

Example, the report of head of the minister about Ethio-Somalia terrorist attack mission
dismissed terrorists 25% at first day, 50% at second day, 75% at third day. However, we
doubt about the mechanisms how the mission is measured and quantified. This leads miss
use of statistical figures.

Misuse of Statistics
Statistics can be misused in several ways unless the necessary data are collected
wisely and appropriate methods are applied. Statistics can be misused in the
following ways
a) They can be used for wrong purposes; that is for purposes that are
different from the purpose for which they are collected.
b) They can be collected incorrectly and so are biased
c) They can be analyzed carelessly and the results obtained are misleading.

1.5. Scales of Measurement


Any aspect of an individual that is measured and take any value for different
individuals or cases, like blood pressure, or records, like age, sex is called a
variable. It is helpful to divide variables into different types, as different statistical
methods are applicable to each. The main division is into qualitative (or
categorical) or quantitative (or numerical variables).
Qualitative variable: a variable or characteristic which cannot be measured in
quantitative form but can only be identified by name or categories, for instance
place of birth, ethnic group, type of drug, stages of breast cancer (I, II, III, or IV),
degree of pain (minimal, moderate, severe or unbearable).

Quantitative variable: A quantitative variable is one that can be measured and


expressed numerically and they can be of two types (discrete or continuous).The
values of a discrete variable are usually whole numbers, such as the number of
episodes of diarrhea in the first five years of life. A continuous variable is a
measurement on a continuous scale. Examples include weight, height, blood
pressure, age, etc.

7
JU / Biostatistics

Although the types of variables could be broadly divided into categorical


(qualitative) and quantitative, it has been a common practice to see four basic types
of data (scales of measurement).

Nominal Scale:- Data that represent categories or names. There is no implied order
to the categories of nominal data. In these types of data, individuals are simply
placed in the proper category or group, and the number in each category is
counted. Each item must fit into exactly one category. For example patients
survival status of propanol may be recorded as treated and control patients with
myocardial infarction.
Other E.g.
 Religion: Christianity, Islam, Hinduism, etc.
 Sex: Male, Female
 Eye color: brown, black, etc.

Ordinal Scale:- Whenever observations are not only different from category to
category, but can be ranked according to some criterion. The variables deal with
their relative difference rather than with quantitative differences. The spaces or
intervals between the categories are not necessarily equal.

Ordinal data are data which can have meaningful inequalities. The inequality signs
< or > may assume any meaning like ‘stronger, softer, weaker, better than’, etc.

Example:
Patients may be characterized as unimproved, improved & much improved.
letter grading system, authority, career, etc
Individuals may be classified according to socio-economic as low, medium &
high. It is usually impossible to infer that difference between member of one
category and the next adjacent category.

Interval Scale: With this scale it is not only possible to order measurements, but
also the distance between any two measurements is known but not meaningful
quotients. There is no true zero point but arbitrary zero point. Interval data are the
types of information in which an increase from one level to the next always reflects
the same increase in the characteristic. Possible to add or subtract interval data but
they may not be multiplied or divided.

8
JU / Biostatistics

Example:
Temperature of zero degrees does not indicate lack of heat. The two common
temperature scales; Celsius (C) and Fahrenheit (F). We can see that the same
difference exists between 10oC (50oF) and 20oC (68OF) as between 25oc (77oF) and
35oc (95oF) i.e , the measurement scale is composed of equal-sized interval. But we
cannot say that a temperature of 20oc is twice as hot as a temperature of 10oc.
because the zero point is arbitrary.
Ratio Scale:- Characterized by the fact that equality of ratios as well as equality of
intervals may be determined. Fundamental to ratio scales is a true zero point.
Eg: variables such as age, height, length, volume, rate, time, amount of rainfall,
etc. are requiring ratio scale.

1.6. Definitions of Some Basic Terms

Population: Is the totality of causes (items) under consideration in a given


investigation or research. Ex. the total number of patients in JUSH, workers in a
hospital, etc…

Sample: Is a sub group or part of the population selected by some methods


(sampling techniques) in order to estimate the characteristics of the population
parameter.
E.g. 250 physician of JUSH out of 2000 total physicians.

Elementary unit: Is the specific person, business, product, and so on, with some
characteristics to be measured or categorized (information is recorded on it).
E.g, The weight of particular person in the class, the person is elementary
unit.

Sampling unit: Is the finite number of distinct, non-overlapping & identifiable


unit obtained by dividing the population for the purpose of sample selection.
Eg. To know the average income per family, the head of the family is a
sampling unit.

9
JU / Biostatistics

Variables: A variable in statistics is any characteristic, which can take on different


values when data are collected.
The variables itself can be classified as continuous and discrete variables. The
variables, discrete or continuous, are denoted by capital letters like X & Y.

i) Continuous Variables: - are usually obtained by measurement not by


counting. These are variables which assume or take any decimal value when
collected. Example, variables such as age, height, length, volume, rate, time,
amount of rainfall, etc. are continuous variables.
ii) Discrete Variables: - are obtained by counting. A discrete variable takes
always whole number values that are counted.
Eg. - No of children/ family, no of patients per house, etc…

Parameter: Is an estimated population value (any population constant) it is


numerical result obtained as measuring the population.
Eg. Population means, Population proportion, Population ratio, etc…

Statistic: Is an estimated sample value (characteristics measure obtained from the


sample)
Sampling: Is the method of obtaining sample from the population.

CHAPTER 2: Introduction to Method of Data Collection,


Organization & Presentation
Learning Objectives
At the end of this chapter, the students will be able to:
1. Identify the different methods of data organization and presentation
2. Understand the criterion for the selection of a method to organize and present
data
3. Identify the different methods of data collection and criterion that we use to
select a method of data collection

Types and Method of Data Collection


Collection of data implies a systematic and meaningful assembly of information
for the accomplishment of the objective of a statistical investigation. It refers to the

10
JU / Biostatistics

methods used to gathering the required information from the units under
investigation. The quality of data greatly affects final output of an investigation.
Hence, at most care should be attached to the data collection process and every
possible precaution should be taken to ensure accuracy while collecting data.
Otherwise, with inaccurate and inadequate data, the whole analysis is likely to be
faulty and also the decisions to be taken will also be misleading.

2.1. Source of Data


Statistical data may be obtained either from primary or secondary source. A
primary source is a source from where first hand information is gathered. On the
other hand, secondary source is the one that makes data available, which were
collected by some other agency before & it may be published or unpublished.
Published sources include publications of research institutions, publications of
financial &commercial institutions, different reports, etc…
Unpublished sources include records maintained by private firms &business
houses who may not like to release their data to outsider.

2.2. Method of Data Collection


A) Method of Primary Data Collection
The objective of the survey, the nature of the item of information, the operational
feasibility, & cost level often determines the methods of data collections of various
methods.
Data can be collected any one or more of the following methods

i) Direct Observation
In this approach, an investigator stays the place of survey and notes down the first
hand information. Direct observations can be used to discover a variety of
information including consumer behavior, working methods & other aspects of
social & economic behavior. Direct observation is more experimental and usually
applied in scientific studies. It is time consuming and also costly. Also the method
is highly subjective.
ii) Interview Method-
It is a conversation between two groups, i.e incited by the interviewer in order to
obtain the required information. The interviewer sets a series of questions directly
elected for his/her work in advance & conducts the interview. Interviewing is a

11
JU / Biostatistics

technique that is primarily used to gain an understanding of the underlying reasons


and motivations for people’s attitudes, preferences or behavior. Interviews can be
undertaken on a personal one-to-one basis or in a group. They can be conducted at
work, at home, in the street or in a shopping centre, or some other agreed location.
The interview may be face to face or by telephone
 Face to face interview is advantageous to question a person’s motives &
attitudes about some characteristics or behavior
 Telephone interview is relatively less time consuming
Limitation:
 Respondents are sometimes unwilling & reluctant to supply the
information.
 Respondents differ in ability & motivation in clearly supplying the
information.
 Requires highly experienced & skilled interviewer.
 The personal bias & prejudice of the interview may affect the result.
 It excludes those who don’t have telephone.
iii) Questionnaire Method
Under this method, a list of questions related to the survey is prepared and sent to
the various respondents by hand, post, website, email etc .However; this method
cannot be used if the respondent is illiterate.
The following are the major points that we need to take into account while
preparing the questionnaire. The number of questions should be small. Naturally
respondents are not comfortable with lengthy questionnaires. Lengthy
questionnaire usually bore respondents. If a lengthy questionnaire is unavoidable,
it should preferably be divided in to two or more parts.
The question should be short, clear, simple, and unambiguous. Moreover, the
question must be arranged in to a logical order so that natural and spontaneous
reply to each is induced. For instance it is not appropriate to ask a person how
many packets of cigarette he /she smoke before asking whether he/she smoke or
not.

Questions of sensitive nature should be avoided. Sensitive questions are those


questions that are too personal and pecuniary like source of income, drinking habit,
etc. The logic here is that respondents do not willingly answer sensitive questions.

12
JU / Biostatistics

Such information, if necessary, may be gathered through interviews or through


other indirect questions.
Mail questionnaires should be accomplished by a covering letter, which should
state the purpose of the questionnaire, promise of confidentially of responses, etc.
Furthermore; the questions preferably designed in such can easily be answered as
yes/ no.

Assignment 1: Prepare a questionnaire for collecting data from students at Jimma


University to study their awareness about HIV/AIDS and related factors.

B) Use of documentary sources: Clinical and other personal records, death


certificates, published mortality statistics, census publications, etc. Examples
include:
1. Official publications of Central Statistical Authority
2. Publication of Ministry of Health and Other Ministries
3. News Papers and Journals.
4. International Publications like Publications by WHO, World Bank,
UNICEF
5. Records of hospitals or any Health Institutions.
During the use of data from documents, though they are less time consuming and
relatively have low cost, care should be taken on the quality and completeness of
the data. There could be differences in objectives between the primary author of
the data and the user.

Types of Questions
1. Open-ended Questions: permit free responses that should be recorded in the
respondent’s own words. The respondent is not given any possible answers to
choose from. Such questions are useful to obtain information on:
- Facts with which the researcher is not very familiar,
- Opinions, attitudes, and suggestions of informants, or
- Sensitive issues
2. Closed Questions: offer a list of possible options or answers from which the
respondents must choose. When designing closed questions one should try to:
- Offer a list of options that are exhaustive and mutually exclusive,
and

13
JU / Biostatistics

- Keep the number of options as few as possible.

Types of Survey: Census and Sample Method


Under the census or complete enumeration survey method, data are collected for
each and every unit (patients, household, field, shop, factory etc.) of the population
(universe), which is the complete set of items, which are of interest in any
particular situation.

Sample survey is simply the process of learning about the population on the basis
of a sample drawn from it. Thus in the sampling technique instead of every unit of
the universe only a part of the universe is studied and the conclusions are drawn on
that basis for the entire universe. A sample is a subset of population units. The
process of sampling involves three elements:
a. Selecting the sample.
b. Collecting the information, and
c. Making an inference about the population.

 Advantage of sampling over census


The sampling technique has the following merits over the complete enumeration
survey:
i) Less Time-consuming: Since the sample is a study of a part of the population,
considerable time and labour are saved when a sample survey is carried out.
Time is saved not only in collecting data but also in processing it.
ii) Less Cost: Although the amount of effort and expense involved in collecting
information is always greater per unit of the sample than a complete census,
the total financial burden of a sample survey is generally less than that of a
complete census. This is because of the fact that in sampling, we study only a
part of population and the total expense of collecting data is less than that
required when the census method is adopted. This is a great advantage
particularly in an underdeveloped economy where much of the information
would be difficult to collect by the census method for lack of adequate
resources.
iii) More Reliable Results: Although the sampling technique involves certain
inaccuracies owing to sampling errors, the result obtained is generally more
reliable than that obtained from a complete count. There are several reasons

14
JU / Biostatistics

for it. First, it is always possible to determine the extent of sampling errors.
Secondly, other types of errors to which a survey is subject, such as
inaccuracy of information, incompleteness of returns, etc., are likely to be
more serious in a complete census than in a sample survey. This is because
more effective precautions can be taken in a sample survey to ensure that
information is accurate and complete. For these reasons not only the total
error be expected to be smaller in a sample survey but sample result can also
be used with a greater degree of confidence because of our knowledge of the
probable size of error. Thirdly, it is possible to avail of the services of experts
and to impart thorough training to the investigators in a sample survey, which
further reduces the possibility of errors. Follow up work can also be
undertaken much more effectively in the sampling method. Indeed, even a
complete census can only be tested for accuracy by some type of sampling
check.
iv) More Detailed Information: Since the sampling technique saves time and
money, it is possible to collect more detailed information in a sample survey.
For example, if the population consists of 1000 persons in a survey of the
consumption pattern of the people, the two alternative techniques available are
as follows: We may collect the necessary data from each one of the 1000
people through a questionnaire containing, say, 10 questions (census method):
or
We may take a sample of 100 persons (i.e., 10 % of population) and prepare
questionnaire containing as many as 100 questions. The expenses involved in
the latter case would almost be the same as in the former but it will enable ten
times more information to be obtained.
v) Sampling Method is the only Method that can be used in Certain Cases:
There are some cases in which the census method is inapplicable and the only
practicable means is provided by the sample method. For example, if one is
interested in testing the breaking strength of chalks manufactured in a factory
under the census method all the chalks would be broken in the process of
testing. Hence, census method is impracticable and resort must be had to the
sample method.
vi) The Sample Method is often used to Judge the Accuracy of the
Information Obtained on a Census Basis: For example, in the population

15
JU / Biostatistics

census, which is conducted very, often (10 years in our country) the field
officers employ the sample method to determine the accuracy of information
obtained by the enumerators on the census basis.
Demerits
Despite the various advantages of sampling, it is not completely free from
limitations.
i. A sample survey must be carefully planned and executed otherwise the results
obtained may be inaccurate and misleading. Of course, even for a complete
count care must be taken but serious errors may arise in sampling, if the
sampling procedure is not perfect.
ii. Sampling generally requires the services of experts. In the absence of
qualified and experienced persons, the information obtained from sample
surveys cannot be relied upon. In India, shortage of experts in the sampling
field is a serious hurdle in the way of reliable statistics.

iii. At the time when sampling plan is so complicated it may requires more
time, labor and money than a complete count. This is so if size of the sample
is a large proportion of the total population and if complicated weighted
procedures are used. With each additional complication in the survey, the
chances of error multiply and greater care has to be taken, which in turn
needs more timed labor.

iv. If the information is required for each and every unit in the domain of study,
complete enumeration survey is necessary.

Choosing the Sources of Data

Decision-makers shall to consider the advantage and disadvantage of data source


which lead to information that is cost-effective, relevant, timely and important for
immediate use

 Advantage of Primary Data


 Primary data gives more reliable, accurate and adequate information, which
is suitable the objective of and purpose of an investigation.
 Primary source usually shows data in greater detail.
 Primary data is free from errors that may arise from copying of figures from
publications which is the case in secondary data.

16
JU / Biostatistics

 Disadvantage of Primary data


 The process of collecting primary data is time consuming and costly.
 Often, primary data gives misleading information due to lack of integrity of
investigators and non-cooperation of respondents in providing answer to
certain delicate questions.

 Advantage of Secondary Data


 It is readily available and hence convenient and much quicker to certain
than primary data,
 It reduces time, cost and effort as compared to primary data,
 Secondary data may be available in subjects(cases) where it is impossible to
collect primary data….such a case can be regions where there is war.
 Some Disadvantage of Secondary Data:
 Data obtained may not be sufficiently accurate,
 Data that exactly suit our purpose(according to the want) may not be
found,
 Error may be made while copying figures.

2.3 Data Presentation (cross sectional & time series data)

2.3.1 Tabular Presentation (Frequency Distribution)


Frequency: - is the no of times a certain value of the variable is separated in a
given Class.
Frequency Distribution: - is a table that shows data classified in to a number of
classes with a corresponding no of times falling in each class (frequency).

Types of Frequency Distribution

a) Categorical Frequency Distribution: - here the classification criteria is qualitative,


qualitative random variable is used.

Tables
The use of tables for presenting data involves grouping the data into mutually
exclusive categories of the variable and counting the number of occurrences
(frequency) to each category.

17
JU / Biostatistics

Based on the purpose for which the table is designed and the complexity of the
relationship, a table could be either of simple frequency or cross-tabulation.
The simple frequency table is used when the individual observations involve only
to a single variable whereas the cross-tabulation is used to obtain the frequency
distribution of one variable by the subset of another variable.
Principles of Table Construction
1. Tables should be as simple as possible.
2. Tables should be self-explanatory.
3. If data are not original, their source should be given in a footnote.
Examples:
a) Simple Frequency Table (Qualitative Data)
Table 1: Overall Immunization Status of children in Adami Tulu Woreda, Feb. 1995
Immunization Status Number Percent
Not Immunized 75 35.7
Partially Immunized 57 27.1
Fully Immunized 78 37.2
Total 210 100.0
Source: Fikru T. et al. EPI Coverage in Adami Tulu, Eth. J. Health Dev. 1997;
11(2): 109-113
Tables of the above type are also known as simple or one-way tables

b) Simple Frequency Table (Quantitative Discrete Data)


Table 2: Prevalence of S. mansoni infection among elementary and junior schools children by
age, Wendo Genet, April 1994
Age (in yrs) No. Examined No. of positive Percent
5-9 143 38 26.5
10-14 322 94 29.2
15-19 55 25 46.3
Total 520 157 30.2
Source: Belay, R. et al. Magnitude of Schist soma mansoni and intestinal helminthic infections
among school children in Wendo Genet Zuria, Eth. J. Health Dev. 1997; 11(2):125-129

c) Cross Tabulations
Table 3: TT Immunization Status by Marital Status of the women of child bearing age,
Assendabo town, 1996
Immunization Status

18
JU / Biostatistics

Marital Immunized Non-Immunized


Status Total
No. % No. %
Single 58 24.7 177 75.3 235
Married 156 34.7 294 65.3 450
Divorced 10 35.7 18 64.3 28
Widowed 7 50.0 7 50.0 14
Total 231 31.8 496 68.2 727
Source: Mickael A. et al. Tetanus Toxoid immunization coverage among women of child-bearing
age in Assendabo town, Bulletin of JIHS, 1996, 7(1):13-20

Tables of the above type are also known as Two-way tables.


d) Higher Order Table: They are when it is desired to
represent three or more characteristics in a single table.
Table 4: Distribution of Health Professionals by Sex and Residence
Residence
Profession/se Urban Rural Total
x
Doctor Male 8(10.0) 35(21.0) 43(17.7)
Female 2(3.0) 16(10.0) 18(7.4)
Nurse Male 46(58.0) 36(22.0) 82(33.7)
Female 23(29.0) 77(47.0) 100(41.2)
Total 79(100.0) 164(100.0) 243(100.0)

Advantages of Tabular Layout


 It enables required figures to be located more quickly.
 It enables comparisons between different classes to be made more
easily.
 It reveals patterns within the figures which can’t be seen in the narrative
form.
 It often takes up less room.

b) Numerical Frequency Distribution: - Here the classification criterion is quantitative. It is


grouped in to two. These are: -simple (Ungrouped) frequency distribution & grouped
frequency distribution.
i) Simple (Ungrouped) Frequency Distribution: - is the distribution that use individual data
values along with its distribution.
* Usually used when the data range is small.
E.g. Raw data on the number of children per family on certain community is gives as,

19
JU / Biostatistics

0,2,3,1,1,3,4,2,0,3,4,2,2,1,0,4,1,2,2,3
Construct:-ungrouped frequency distribution, Relative frequency (Rf), Percentage frequency
(Pf).
No of child/ family Frequency Rf Pf
0 3 3/20 3/20x100
1 4 4/20 4/20x100
2 6 6/20 6/20x100
3 4 4/20 4/20x100
4 3 3/20 3/20x100
Total 20 1 100

Cumulative Frequency Distribution: -is a frequency distribution that displays the


sum of frequencies of consecutive classes of above or below a given class.

There are two types of cumulative frequency: -


a) Less than cumulative frequency (Lcf): it used interest focuses on the total
number of observation below a specified value.
b) More than cumulative frequency (Mcf): it used when frequency interest
focuses on the total number of observation above a specified value.
E.g.
Class frequency Lcf Mcf
0 3 3 20
1 4 7 17
2 6 13 13
3 4 17 7
4 3 20 3
Total 20

ii) Grouped Frequency Distribution: -

When we deal with large sets of data, a good overall picture and sufficient
information can often be conveyed by grouping the data into a number of classes.
To group a set of observations we select a set of contiguous non-overlapping
intervals such that each value in the set of observations can be placed in one, and
only one of the intervals, called class intervals.
One of the first considerations when data are to be grouped is how many intervals
to include. We seldom use fewer than 6 or more than 15 class intervals. If there are
fewer than six intervals the data have been summarized too much and the
information they contain has been lost; the exact number we use in a given

20
JU / Biostatistics

situation will depend mainly on the number of measurements or observations we


have to group.

Sturge’s formula for deciding the number of class intervals is given by:
k= 1 + 3.322 log10n,
Where k = number of class intervals
There are two types of frequency distribution
a) Inclusive
b) Exclusive
a) In inclusive type of frequency distribution, the upper limit of one class does
not coincide with the lower limit of the next class.
b) In exclusive type of frequency distribution, the upper limit of one class
coincides with the lower limit of the next class.

Example 2. 1 The following data illustrates the inclusive type of frequency distribution.
Consider the following raw data on weights of 30 adults (aged less than 20):

Class Tally weight Frequency


11 ─ 15 2
16 ─ 20 3
21 ─ 25 3
26 ─ 30 5
31 ─ 35 6
36 ─ 40 6
41 ─ 45 3
46 ─ 50 2
Total 30

The range for the above ungrouped data is 49 - 12 = 37. Normally it is desirable to divide the
range into 6 to 10 classes. Consider the class 11 - 15. If an adult has weight of 11 or 15, he/she
will be put in this class. For this class, 11 is the lower limit and 15 is the upper limit and both are
included in the class. But in case of 'exclusive' frequency, mostly one of the limits of class is
excluded from the class; the above frequency distribution can be reformed in the following
exclusive way also:

21
JU / Biostatistics

 The measure 20 is now included in the


Class (weight) Frequency class
10 ─ 15 2 20 - 25 and not in 15 - 20.
15 ─ 20 2  In the left table, the data is grouped in class
20 ─ 25 4 intervals of 5. It can also be grouped into
25 ─ 30 3 class
30 ─ 35 6 intervals of 10, 15. The table can also
35 ─ 40 6 have
40 ─ 45 4 unequal classes but it is not desirable.
45 ─ 50 3  The very purpose of grouping will be lost if
Total 30 there are too few or too many or unequal
intervals.

Table a (Inclusive fd) Table b (Exclusive fd)


Class (weight) Frequency Class (weight) Frequency
11 ─ 15 2 10.5─ 15.5 2
16 ─ 20 3 15.5 ─ 20.5 3
21 ─ 25 3 20.5 ─ 25.5 3
26 ─ 30 5 can be 25.5 ─ 30.5 5
31 ─ 35 6 converted 30.5 ─ 35.5 6
36 ─ 40 6 as 35.5 ─ 40.5 6
41 ─ 45 3 40.5 ─ 45.5 3
46 ─ 50 2 45.5 ─ 50.5 2
Total 30 Total   30

Terms (Definition): -
1. Class Limit: - separate one class from the other within a certain gap. The two class limit
are:-
■Upper class limit (UCL)
■Lower class limit (LCL)
There is a gap between UCLi &LCLi+1
Unit of measurement (U):- is the smallest possible distance between two consecutive
measures.
U is usually taken as 1.
2. Class Boundary:-have two parts. These are
■Upper class boundary (Ucb)
■Lower class boundary (Lcb).

22
JU / Biostatistics

→Separate one class from the other.


→No gap between Ucbi &Lcbi+1
Lcbi =LCLi -U/2
Ucbi =UCLi+ U/2
E.g.
Class Limit Class Boundary
1-5 0.5-5.5
6-10 5.5-10.5
11-15 10.5-15.5
16-20 15.5-20.5
3. Class Width (w): -Is the difference between lower & upper class boundary of any
Class.
W=Ucbi-Lcbi
Or W = UCLi+1-UCLi
Or W = LCLi+1-LCLi
Or W = Lcbi+1-Lcbi
Or W = Ucbi+1-Ucbi
4. Class Mark (Mid Point =Mi)
Is the midpoint of a class interval or the average value of the lower and upper class limits
i.e Mi= (LCLi+ UCLi)/2 , Mi= (Lcbi+Ucbi)/2

Steps Needed to Construct Grouped Frequency Distribution


1. Calculate the range (R)
R=Xmax- Xmin
2. Calculate the number of class using the sturge’s formula
k= 1+3.322logn, where k-No of classes
n- No of observation
n= Σfi
Here always make it round up. E.g. k=4.5 ~ 5
3. Calculate the class width
W=R/K R& K must be round up the next whole number.
4. Identify the starting point:- LCL1= Xmin
LCL2=Xmin +W
E.g. Construct a grouped frequency distribution for the following raw data.
11, 29, 6, 33, 14, 31, 22, 27, 19, 20,
21, 18, 17, 22, 38, 23, 26, 34, 39, 27

1. R= Xmax-Xmin, 39-6=33
2. K=1+3.322 log20 =5.32 ~ 5
3. W=R/K , 33/5 = 6.6~ 7
4. Determine LCL1=Xmin=6

Class limit Frequency Mi Class boundary Lcf Mcf


6-12 2 9 5.5-12.5 2 20
13-19 4 16 12.5-19.5 6 18
20-26

23
JU / Biostatistics

6 23 19.5-26.5 12 14
27-33 5 30 26.5-33.5 17 8
34-40 3 37 33.5-40.5 20 3

2.3.2. Diagrammatic Presentation of Data


Appropriately drawn graph allows readers to obtain rapidly an overall grasp of the
data presented. The relationship between numbers of various magnitudes can
usually be seen more quickly and easily from a graph than from a table.

The choice of the particular form among the different possibilities will depend on
personal choices and/or the type of the data.
 Bar chart and pie chart are commonly used for quantitative or qualitative
discrete data
 Histograms and frequency polygons are used for quantitative continuous
data.
i) Bar-Chart (Bar diagram): A series of equally spaced bars having equal width (base)
where the height the bar represents the frequency of (amount) associated with each class.
 Usually applied for categorical random variables.
 A bar chart could be either vertical or horizontal.
 There are various types of bar charts
1. Simple Bar Charts: Represent data by a series of bars, the height (length) of
each bar indicating the size of the figure represented.

Example: The following table shows the (arbitrary) number of students in the
faculty of Medical Sciences. Show these numbers graphically, using simple bar
chart.
Year
Sex
I II III IV V
Female 25 20 15 15 10
Male 55 55 50 55 50
Total 80 75 65 70 60

24
JU / Biostatistics

No. of
students

Fig 1: Vertical bar chart for number of students in the Faculty of Medical
Sciences (in 2002
2. Component Bar Charts: are like ordinary bar charts except that the bars are
subdivided into component parts. This sort of chart is constructed when
each total figure is built up from two or more component figures.

Fig 2 : Number of students in the Faculty of Medical Sciences (in 2002)


3. Multiple Bar Charts: The component figures are shown as separate bars
adjoining each other. The height of each bar represents the actual value of
the component figure

25
JU / Biostatistics

Fig 3: Number of students in the Faculty of Medical Sciences (in 2002)


ii) Pie Chart

A pie chart is a circle divided by radial lines into sections (like slices of a cake
or a pie; hence the name) so that the area of each section is proportional to the
size of the figure presented. It is a convenient way of showing the sizes of
component figures in proportion to each other and to the overall total.

Fig 4: Number of students in the Faculty of Medical Sciences (in 2002)


iii) Pictogram: it represents the magnitude of certain things by their pictures.
- It is not frequently used.

26
JU / Biostatistics

2.3.3 Graphic presentation data:


1. Line Graph
Suppose you are provided with data on the number of patients visited a certain health center
(month- wise for the year 2004) as given in Table below
Month Number of visitors
JANUARY 76
February 85
March 86
April 90
May 82
June 98
July 105
August 108
September 110
October 115
November 118
December 106

Line graph is appropriate when we need to present the movement or variation in avariable. It is
quite simple to draw and indicates the increase or decrease in a variable over time or across
observations. Line graphs can be used for discrete data. Recall that in the case of continuous data
we assumed that the average value of each class is its midpoint. Thus we can plot the frequencies
for each class against its mid-point and join these points to obtain a line graph.

Fig 5: Line graph


2. Histogram: usually used to present quantitative data.
Is a graph consists of series of rectangles whose bases are equal to the class width of the
corresponding classes & whose heights are proportional to class frequencies.
 It is constructed from a grouped frequency distribution.
 In histogram we use class boundaries in the X-axis.

27
JU / Biostatistics

E.g. Consider the data on time (in hours) that 20 college students devoted to leisure activities
during a typical school week:
Class limit class boundary frequency
6-10 5.5-10.5 1
11-15 10.5-15.5 2
16-20 15.5-20.5 3
21-25 20.5-25.5 5
26-30 25.5-30.5 4
31-35 30.5-35.5 3
36-40 35.5-40.5 2
Total …………………………………………....….20

6
Frequency

5
4
3
2
1
0
5.5 10.5 CLASS BOUNDRY
Fig 6: Histogram
3. Frequency Polygon: Is the
line graph that displays the
data using a line that connects points plotted for the frequencies of the class mark. i.e. the
frequencies represent the height of the class mark.
 A frequency polygon can also be super imposed on a histogram.
Frequency

Frequency polygon i.e. super imposed on a histogram.

Class boundaries
5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.5

28
JU / Biostatistics

Fig. 7: Frequency Polygon


4. Cumulative Frequency Polygon (Ogive):
This is a line graph obtained by plotting the cumulative frequency distribution(y- axis) against
class boundaries (x-axis).
E.g.
Class boundary fi Lcf Mcf
5.5-10.5 1 1 20
10.5-15.5 2 3 19
15.5-20.5 3 6 17
20.5-25.5 5 11 14
25.5-30.5 4 15 9
30.5-35.5 3 18 5
35.5-40.5 2 20 2
Total……..……....…..20
cf Less than ogive

Class boundary
5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.5

Fig 8: Less than Ogive


cf More than ogive

class boundary
5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.

Fig 9: More than Ogive

29
JU / Biostatistics

Mcf ogive Lcf ogive


cf

Class boundary
Median
Fig 10: Mcf & Lcf with their intersection

Exercise. The following table is a grouped frequency distribution of money spent per visit by a
random sample of 100 customers at a dep’t store.

Amount of spent no of customers


3-7 10
8-12 30
13-17 35
18-22 20
23-27 5
Total 100

i). For each of the above class state:-


a) class limit
b) class boundary
c) the class width
d) the class mark

ii). Construct the relative frequency distribution


iii).Construct a histogram & super imposed the frequency polygon
iv. Construct both less than & more than type of ogive.

30
JU / Biostatistics

CHAPETR–3: Measure of Central Tendency


Learning objectives
At the end of this chapter, the student will be able to:
1. Identify the different methods of data summarization
2. Compute appropriate summary values for a set of data
3. Appreciate the properties and limitations of summary values

Why measure of central tendency:


 To comprehend the data easily (to describe the center of the distribution).
 To facilitate comparison.
 To make further statistical analysis.
3.1. Introduction

When we want to make comparison b/n groups of numbers it is good to have a single value,
which is considered to be a good representative of each group. This single value is called the
average of the group. The tendency of statistical data to get concentrated at certain
values is called “Central Tendency” and the various methods of determining the
actual value at which the data tend to concentrate are called measures of central
tendency or average. An average, which is representative, is called typical average and
average which is not representative and has only a theoretical value is called a descriptive
average.

 A typical average should posses the following.


o It should be strictly defined.
o It should be based on all observation under investigation.
o It should be as little as affected by extreme observations.
o It should be capable of further algebraic treatment.
o It should be ease to calculate and simple to understand.

3.2. Types of Central Tendency

There are different types of measure of central tendency (and measure of position)
o Mean (Arithmetic, Geometric, and Harmonic)
o Median (the middle value)
o Mode (the most frequently appearing value)
o Quantiles (quartiles, Deciles, percentiles)

31
JU / Biostatistics

The choice of the averages depends up on which best fit the property under discussion.

Arithmetic mean (x):

 Is defined as the value each item in the distribution would have if all the values
were shared out equally among all the items.
 Is the measure to which we usually refer in everyday life when we use the word
“average.”
 Obtained by adding all the values in a population or sample and dividing by the
number of values that are added.
Sample Mean: Population Mean:
x 1+ x 2+.... ....+ xn X 1+ X 2+ ........+ Xn
• For Row data , x = n
µ= N
N

1
n ∑ xi
¿ ∑ xi
n i=1 ,
i=1

= N
k
1
• For ungroup frequency distribution x
¿ k ∑ fixi , where k is the number of classes
∑ fi i=1
i=1
and fi is the number of the occurrence of xi.

Example: obtain the mean of the following numbers 2,7,8,2,7,3,7


Solution
xi fi xifi
2 2 4
3 1 3
7 3 21
8 1 8
Total 7 36
n
1
x ¿ n
∑ fixi 36
∑ fi i=1 = 7 = 5.15
i=1
1 k
k
Grouped data x = fixi

∑ fi ∑
i=1
i=1

Where xi is the class mark of the ith class and fi is frequancy of the ith class.
Example: calculate the mean for the following age distribution.
Class Frequency
6-10 35
11-15 23

32
JU / Biostatistics

16-20 15
21-25 12
26-30 9
31-35 6
o Solutions:
o First find the class marks.
o Find the product of frequency and class marks
o Find mean using the formula.

Class fi xi xifi
6-10 35 8 280
11-15 23 13 299
16-20 15 18 270
21-25 12 23 276
26-30 9 28 252
31-35 6 33 198
Total 100 1575

1 k
k 1
x = fixi = 100 (1575) =15.75

∑ fi ∑ i=1
i=1

Exercise: - Marks of 75 students are summarized in the following frequency distribution:


Marks Frequency
40-44 7
45-49 10
50-54 22
55-59 f1
60-64 f2
65-69 6
70-74 3
If 20% of the students have marks b/n 55 and 59,
o Find the missing frequency
o Find the mean of the distribution.

Special properties of A. M
i) The sum of deviations of a set of items from their mean is always zero i.e.
n

∑ ¿¿ (proof)
i=1
ii. If x 1 if the mean of n1 observation, if x 2 is the mean of n2 observations, ........, if xk
is the mean of nk observations, then the mean of all the observation in all groups often called the
combined mean is given by :-

33
JU / Biostatistics

k
1
xc =
n 1 x 1+ n2 x 2+ …+nk xk
= k ∑ ¿ x́ i
n1+ n2+ …+nk ∑ ¿ i=1
i=1
Example:-In a class there are 30 females and 70 males .If females averaged 60 in an examination
and boys averaged 72, find the mean of the entire class.

✈solutions:-
Females males
x´1= 60 x´2 = =72
n1= 30 n2=70
k
1
´= =
xc
n 1 x 1+ n2 x 2+ …+nk xk
= k ∑ ¿ x́ i = 1800+5040
= 68.4
i=1
n1+ n2+ …+nk ∑¿ 100
i=1
iii. If wrong figure has been used when calculating; the mean of the correct mean can be
obtained without repeating the whole process using:
Correct Value−WrongValue
Correct Mean = Wrong Mean + , where n is the total number of
n
observations
Example: - An average weight of 10 students was calculated to be 65k.g latter it was discovered
that one weight was misread as 40 instead of 80 k.g. calculate the correct average weight.
Correct Value−WrongValue 80−40
Correct Mean = Wrong Mean + = 65 + = 69 k.g.
n 10
iv) Weighted A. M
o When a different importance is desired to be giving to different data a weighted mean is
appropriate.
o Weights are assigned to each item in proportion to its relative importance.
o Let x1, x2 ,…., xn be the values of the items a series and w1,W2,..., Wn their corresponding
weights, the weighted mean denoted by xw is defined as:-
1 n
n
xw = wixi
∑ wi ∑
i=1
i=1
• Example:-A student obtained the following percentage in an examination:- English 60,
biology 75, mathematics 63, physics 59,and chemistry 55. find the students weighted
arithmetic mean if weights 1,2,1,3,3 respectively are allotted to all students.
• Solution :-
1 n
n 60∗1+75∗2+63∗1+ 59∗3+55∗3
xw = ∑ wixi = = 61.5
∑ wi i=1 1+2+1+3+3
i=1
Merit and Demerits of A. M

34
JU / Biostatistics

►Advantages:-
■ It is strictly defined
■ not needed to arrange the observation
■ It is based on all observation
■ It is suitable for further algebraic treatment
■ It is stable average, i.e. it is not affected by fluctuation of sampling to some extent.
■ it is ease to calculate and simple to understand.
► Demerits:-
■ it is much affected by extreme observations.
■ it can be a number, which does not in a series.
■ it can’t be calculated for frequency distribution with open ended classes.
Geometric mean (G.M)
- G.M is defined as the nth root of the product of n items or values of series.
- If there are two items, we take square root; if there are three items, the cube root and so on.
-symbolically, let x1,x2,x3,…,xn be the n values of a variable x, then their G.M is defined as:
G.M= √n x 1. x 2. x 3 … xn for raw data
n
f1 f2 f3 fn 1/N
G.M= (x1 .x2 .x3 … xn ) for frequency distn where, N= ∑ fi
i=1
- If the number of observation is more than three or more, the computation of the nth root very
tedious, to simplify computation, the logarithm are used in terms of log.
1 n
LogG.M = ∑ log xi
n i=1
1 n
Anti log (Log G.M) = Antilog [ ∑ log xi]
n i=1
n
1
G.M = Anti log [ ∑ log xi] for raw data
n i=1
n n
G.M = Anti log[1/N∑ fi log xi] for frequency distn where, N= ∑ fi
i=1 i=1
Example: - Find the geometric mean of 3,9,27
Solution: - G.M = √n x 1. x 2. x 3 … xn = √3 3∗9∗27 = 9
Note: - The geometric mean is useful and appropriate for finding averages of ratios or growth
rates.
Merit and Demerits of Geometric mean (G.M)
►Advantages:-
i) It is least affected by extreme value
ii) It is based on all observation
iii) It is suitable for further algebraic treatment
► Demerits:-
i) Its calculation is somewhat complicated
ii) It can’t be calculated if any of the value is 0
iii) If any one or more of the value are negative, either geometric mean can’t be
calculated or an absurd value will be obtained.
Harmonic Mean (H.M)

35
JU / Biostatistics

The harmonic mean of x1, x2, x3… xn is denoted by H.M


1 n
H.M =1∑ 1 =
n n

∑ xi1
n i=1 xi i=1
And in a case of frequency distribution:
1 n n

=
H.M 1 ∑ fi
k
= k

∑ xifi , n =∑
i=1
fi
n i=1 xi i=1
- If x1, x2, x3,…, xn be the value of the items a series and w1,w2,…,wn their corresponding weights,
the weighted Harmonic Mean denoted by;
1
n
1
H.Mw = n ∑ wi
i =1 xi
∑ wi
i=1
Example:- Find the harmonic mean of the following data. 20,30
2
Solution:- H.M = 1
+ = 24
1
20 30
Note:- The Harmonic Mean is useful and appropriate in finding average speeds and average
rates.
Merit and Demerits of H. M
►Advantages:-
i) It is based on all observation
ii) It is a good mean for a highly variable series
iii) It gives more weighted to the small value & less weighted to a large value.
► Demerits:-
i) Its calculation is complicated
ii) If any value is 0, it can’t be calculated
iii) Its value is generally not a member of the series.
Eg. A driver covers the 300km distance at an average speed of 60 km/hr makes the return trip at
an average speed of 50km/hr. What is his average speed for total distance?
2
H.M= 1 /60+1 /50 =600/110=54.55km/hr.
60+50
Note that A.M = 2 =55km/hr

G.M = √ 60×50 =54.7km/hr


In general, A.M ≥G.M≥H.M

36
JU / Biostatistics

E.g. Let a and b are any two positive values, then show that their arithmetic mean is greater
than or equal to their geometric mean and their geometric mean is greater than or equal to
their harmonic mean.
a+b 2
X = 2 , G= √ ab , H= 1 /a+1/b , we want to show that X ≥G≥H.
let (√a-√b)2 > 0 a -2√ab+b > 0
a -2√ab+b > 0 a + b > 2√ab
a+b > √ab (a + b)√ab >2√ab√ab
2 √ab > 2ab
X ≥G a+b
G>H
Therefore X ≥G> H
Similarly prove :
√ A . M ∗H . M = G.M, Where A.M and H.M. are the usual abbreviations.

The mode (^x )


►Mode is a value, which occurs most frequently in a set of values.
►The data set may have more than one mode or no mode at all.
►In case of discrete distribution the values having the maximum frequency in the modal class.
 Modal class is the class interval that contains the highest frequency of observation.
Examples:-
►Find the mode of 5, 3,5,8,9
Mode is =5. It is a unimodal data.
►Find the mode of 8,9,9,7,8,2,5
Mode is=8 and 9. It is a bimodal data.
► Find the mode of 4, 12, 3, 6, and 7.
No mode for this data.
• For ungrouped frequency distribution
■Discrete series: mode equals the value of the variables corresponding to the maximum
frequency.
V 2 3 4 5
F 5 8 12 1
• Mode is 4.
• For grouped data
■continues series: (class frequency distribution).

37
JU / Biostatistics

d1
Mode =Lmo+ [ ]
d 1+d 2
W
Where Lmo= LCB of the modal class
d1= fm-fpm,
d2 = fm-fsm, where, fm=frequency of the modal class
fpm = frequency of the class preceding the modal class
fsm = frequency of the class succeeding the modal class
W= class width
Example: - following is the distribution of the size of certain farms selected at random from a
district.
Class( size 5-15 15-25 25-35 35-45 45-55 55-65 65-75
of farms
F 8 12 17 29 31 5 3
Find the modal of the distribution
Solution: - Modal class= 45-55
L =45, fm=31, fpm=29, fsm =5, W =10
31−29
Then the mode = Mode =45+ [ ]
( 31−29 ) +(31−5)
10 =45.71
Merit and Demerit of Mode
Merit
- It is not affected by extreme observations.
- Easy to calculate and understand.
Demerit
- It is not rigid.
- It is not based on all observations.
- It is not suitable for further mathematical treatment.
- It is not stable average. i.e. it is affected by fluctuations of sampling to some extent .
- Often its value is not unique.
Note: being the point of maximum density, mode is especially useful in
finding the most popular size in studies relating to marketing, trade, business,
and industry. It is the appropriate average to be used to find the ideal size.
The Median (~ x)
- In a distribution, median is the value of the variables, which divides it into two equal parts
- In an ordered series of data median is an observation lying exactly in the
middle of the series. It is the middle most value in the sense that the number
of values less than the median is equal to the number of values greater than it.
-If X1, X2, …Xn be the observations, then the numbers arranged in ascending
order will be X[1], X[2], …X[n], where X[i] is ith smallest value.
⇒ X[1]< X[2]< …<X[n]

- For ungrouped raw data

38
JU / Biostatistics

X [(n+1)/ 2 ] , If n is odd .
~
X= 1
2 {
{X [ n/ 2] + X [ (n / 2)+1] }, If n is even

Example: Find the median of the following numbers.


a) 6, 5, 2, 8, 9, 4.
b) 2, 1, 8, 3, 5, 8.
Solutions:
a) First order the data: 2, 4, 5, 6, 8, 9
Here n=6
~ 1
X = ( X n +X n )
2 [ 2 ] [ 2 +1 ]
1
= ( X [3 ]+ X [4 ] )
2
1
= (5+6)=5 .5
2
b) Order the data :1, 2, 3, 5, 8
Here n=5
~
X = X n+1
[ ]
2
= X [3 ]
=3
Exe:- Find the median of the following numbers:
a) 2, 1,8,3,5
b) 6, 5, 2,8,9,4
 For grouped data (Class Frequency Distribution):

If data are given in the shape of continuous frequency distribution, the


median is defined as:
~ w n
X =Lm+ ( −Fpm)
fm 2
Where:
Lm=lower class boundary of the median class .
w = the size of the median class
n = total number of observations.
Fpm = the cumulative frequency (less than type) preceeding the median class .
fm= thefrequency of the median class .

Remark:

39
JU / Biostatistics

The median class is the class with the smallest cumulative frequency (less than
n
type) greater than or equal to . 2
Example: Find the median of the following distribution.
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
Solutions:
 First find the less than cumulative frequency.
 Identify the median class.
 Find median using formula.

Class Frequency Cum.Freq(less than type)

40-44 7 7
45-49 10 17
50-54 22 39
55-59 15 54
60-64 12 66
65-69 6 72
70-74 3 75

n 75
= =37 . 5
2 2
39 is the first cumulative frequency to be greater than or equal to 37. 5
⇒ 50−54 is the median class .
Lm=49. 5 , w =5
n = 75 , Fpm = 17 , f m = 22
~ w n
⇒ X =Lm + ( −Fpm)
fm 2
5
=49. 5+ (37 . 5−17 ) =54 .16
22
Exe:- Find the median of the following distribution
Class 50-60 60-70 70-80 80-90 90-100 100-110

40
JU / Biostatistics

Fi 20 21 50 40 53 16
<cfi 20 41 91 131 184 200

Merits and Demerits of Median


Merits:-
. Median is a positional average and hence not influenced by extreme
observations.
. It can be calculated in the case of open end intervals.
. Median can be located even if the data are incomplete.

Demerits:
 It is not a good representative of data if the number of items is small.
 It is not amenable to further algebraic treatment.
 It is susceptible to sampling fluctuations.

Quantiles
When a distribution is arranged in order of magnitude of items, the median is the
value of the middle term. Their measures that depend up on their positions in
distribution quartiles, deciles, and percentiles are collectively called quantiles.
►Quartiles
-Are the three values, which divided the given data in to four equal parts, they are denoted by
Q1,Q2 and Q3.
Q1= lower or first quartile, it covers 25% of the distribution
Q2= the middle or second quartile, it covers 50% of the distribution
Q3= the upper or third quartile, it covers 75% of the distribution.
For row (ungrouped data), first arrange the observations in increasing order of magnitude. Then
the ith quartile is given by
Qi=¿/4]th value ,i=1,2,3
In dividing i(n+1) by 4, there may be a remainder, let q be the quotient r be the remainder of the
division. Then,
Qi = qth value + r/4[(q+1)th value –qth value]
E.g. The following are yields of barley (kg/plot) from 14 plots:
30,32,35,38,40,42,48,49,52,55,58,60,62,&65. Find the first &third quartiles.
Solution: Q1= [1(14+1)/4] =15/4=3 & r=3
Q1 = 35+3/4[4th-35]
= 35+3/4[38-35] = 37.25
th
Q3 = [3(15)/4] = 45/4 =11 & r=1
Q3 = 11th value + ¼(12th value -11th value) = 58+1/4(2) =58.5 upper quartile
Q2=Q3+Q1/2

41
JU / Biostatistics

 The ith quartile for grouped frequency distribution is given by


 Qi = Lqi + [ (in/4 –Fpqi)/ fqi] W
Where Qi= the ith quartile
Lqi = the lower boundary of the class in which the ith quartile is located
Fpqi = the cumulative frequency of the class immediately preceding the class containing Qi.
fqi = the frequency of the class containing Qi.
W = class width & n =sample size
Example: Find the 1st & 3rd quartile for above frequency distribution
Class Frequency Cum.Freq(less than
type)
40-44 7 7
45-49 10 17
50-54 22 39
55-59 15 54
60-64 12 66
65-69 6 72
70-74 3 75
Solution: n=75,n/4=75/4=18.75,therefore the 3rd class is the 1st quartile class
Q1= 49.5+[(18.75-17)/22].5=49.89
Similarly do Q3 & Q2 ?
►Deciles
- Deciles are measures that divide the frequency distribution in to ten equal
parts.
- The values of the variables corresponding to these divisions are denoted
D1, D2, .. .D9 often called the first, the second,…, the ninth deciles
respectively.
D1= covers 10% of the distribution , D2= covers 20% of the distribution ,D3= covers 30% 0f
the distribution • • • , D9= covers 90% 0f the distribution
ith deciles can obtained in similar way to quartile except that in terms of 4 replace 10.
► Percentiles
- Percentiles are measures that divide the frequency distribution in to
hundred equal parts.
- The values of the variables corresponding to these divisions are denoted
P1, P2,.. P99 often called the first, the second,…, the ninety-ninth percentile
respectively.
Note that
i) Q1=P25 Q2=D5=P50 =Median, Q3=P75
ii) D1=P10 D2=P20, D3=P30,….D9=P90.

42
JU / Biostatistics

E.g. for the data given below, compute the quartiles, D3, D7, P15 and P88 interpret.
marks Below 10 10-20 20-40 40-60 60-80 Above
f 10 15 25 30 14 6
<cfi 10 25 50 80 94 100
Solution:-
Q1 – Size of N/4 th item= 25th item. Quartile class Lcf> iN/4 is 10- 20 quartile class
L=10, w=10, fq1=15, Fpq1 =10.
10
[ / ]
Q1 = Lqi + (in/4 –Fpqi) fqi W = 10 + 15 (25-10) =20
Mark of 25% of the students are less than 20.
2N
Q2- size of th item =50% item 20-40 quartile Class
4
L= 20, w=20, fq2=25, Fpq2 =25
20
Q2== 20 + (50 -25) = 40
25
Marks of half of students are below 40.
3N
D3- size th= 30th item 20-40 deciles class
10
L=20, w=20, fq3=25, Fpq3 =25
20
D3 =20 + (30 -25) =24
25
Marks of 30% of the students are below 24.
7N
D7- size th , item= 70th item 40-60 deciles
10
L=40, w=20, fq7=30, Fpq7=50
20
D7= 40 + (70-50) = 53.33
30
Marks of 70% of students is below 53.33
15 N
P15= size th = 15th item 10-20 percentile class
100
L=10, w=10, fq15=15, Fpq15 =10
10
P15= 10 + (15 –10) = 13.3
15
Mark of 15% of the students is below 13.33
88 N
P88 –size ( ¿ th = 88th item 60-80 percentiles class
100
L=60, w=20, fq88=14, Fpq88 =80
P88 = 60+20/14 (88 -80 ) = 71.43
Mark of 88% of students is below 71.43.
Exe: Considering the following distribution
Calculate:
a) All quartiles.
b) The 7th decile.

43
JU / Biostatistics

c) The 90th percentile.


Values Frequenc
y
140- 150 17
150- 160 29
160- 170 42
170- 180 72
180- 190 84
190- 200 107
200- 210 49
210- 220 34
220- 230 31
230- 240 16
240- 250 12

CHAPTER – FOUR

MEASURE OF DISPERSION (VARIATION)

44
JU / Biostatistics

 The degree to which a numerical data tends to spread about an average is called
dispersion or variation of the data
 In general the greater the spread from the average the greater the variability.
Objectives of Measuring variation or Dispersion
o To judge the reliability of measure of central tendency,
o To compare two or more groups of numbers in terms of their variability, and
o To further statistical analysis.
 To describe how the measurement vary about the center of the distribution. Measures of
variation can be either Absolute or Relative Measures
Absolute Measures of Dispersion
The measures of dispersion which are expressed in terms of the original unit of a series are
termed as absolute measures. Such measures are not suitable for comparing the variability of two
distributions which are expressed in different units of measurement and different average size.
Relative Measures of Dispersion
Relative measures of dispersions are a ratio or percentage of a measure of absolute dispersion to
an appropriate measure of central tendency and are thus pure numbers independent of the units
of measurement. For comparing the variability of two distributions (even if they are measured in
the same unit), we compute the relative measure of dispersion instead of absolute measures of
dispersion.
Types of Measure of Dispersion
There are various measure of dispersions, out of which the most commonly used are:
1. Range (R) and Relative Range (RR)
2. Mean Deviation (M.D) and Coefficient of Mean Deviation (C.M.D)
3. Variance (s2), Standard Deviation (s) and Coefficient of Variation (CV).
1. Range (R)
The range is the largest score minus the smallest score. It is a quick and dirty measure of
variability, although when a test is given back to students they very often wish to know the range
of scores. Because the range is greatly affected by extreme scores, it may give a distorted picture
of the scores.

The following two distributions have the same range, 13, yet appear to differ greatly in the
amount of variability.

Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45

For this reason, among others, the range is not the most important measure of variability.

R=L−S , L=l arg est observation


S=smallest observation
Range for grouped data:
If data are given in the shape of continuous frequency distribution, the range is
computed as:

45
JU / Biostatistics

R=UCL k−LCL 1 , UCLk is upper class lim it of the last class .


UCL1 is lower class limit of the first class .

Merits and Demerits of range


Merits:
 It is rigidly defined.
 It is easy to calculate and simple to understand.
Demerits:
 It is not based on all observation.
 It is highly affected by extreme observations.
 It is affected by fluctuation in sampling.
 It is not liable to further algebraic treatment.
 It cannot be computed in the case of open end distribution.
 It is very sensitive to the size of the sample.
Relative Range (RR):
-it is also sometimes called coefficient of range and given by:
L−S R
RR= =
L+S L+ S
Exe:
1. Find the relative range of the above two distribution?
2. If the range and relative range of a series are 4 and 0.25 respectively. Then what is the value of:
a) Smallest observation
b) Largest observation

2. Mean Deviation (M.D)


 M.D of the set of items is defined as the arithmetic mean of the values of the absolute
deviations from a given average.
n

∑ ¿ Xi− A∨¿ , where A is any average


M.D (A) = i=1
¿
n
 Depending up on the type of averages used we have different type of mean deviations.
A) Mean deviation about the Mean
n

M.D ( x́ ) =
∑ ¿ Xi−x́∨¿
i=1
¿
n
o For the case of Frequency distribution it is given as:

46
JU / Biostatistics

M.D ( x́ ) =
∑ fi∨Xi−x́∨¿
i=1
¿
n
B) Mean Deviation about the Median
n

x) =
M.D (~
∑ ¿ Xi−~x∨¿
i=1
¿
n
o For the case of Frequency distribution it is given as:
n

x) =
M.D (~
∑ fi∨Xi−~x∨¿
i=1
¿
n
c) Mean Deviation about the Mode (^x )
n

M.D ( ^x ) =
∑ ¿ Xi−^x ∨¿
i=1
¿
n
o For the case of Frequency Distribution it is given as:
n

M.D ( ^x ) =
∑ fi∨Xi−^x ∨¿
i=1
¿
n

Examples:
1. The following are the number of visit made by ten mothers to the local doctor’s
surgery
8, 6, 5, 5,7,4,5,9,7,4. Find the mean deviation about mean, median and mode.
Solutions: First calculate the three averages: x́ =6, ~
x =5.5, ^x =5
Then take the deviations of each observation from these averages.
Xi 4 4 5 5 5 6 7 7 8 9 total
|Xi - 6| 2 2 1 1 1 0 1 1 2 3 14
|Xi-5.5| 1.5 1.5 0.5 0.5 0.5 0.5 1.5 1.5 2.5 3.5 14
|Xi - 5| 1 1 0 0 0 1 2 2 3 4 14
10

∑ ¿ Xi−6∨¿ = 14/10= 1.4


M.D ( x́ ) = i=1
¿
10
10

x) =
M.D (~
∑ ¿ Xi−5.5∨¿ = 14/10 =1.4
i=1
¿
10
10

M.D ( ^x ) =
∑ ¿ Xi−5∨¿ = 14/10= 1.4
i=1
¿
10

47
JU / Biostatistics

2. Find mean deviation about mean, median and mode for the following distribution. (Exercise)
class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3

Coefficient of Mean Deviation (C.M.D)


M.D
C.M.D =
Average
 C.M.D(x́ ) = M . D ¿ ¿ , C.M.D(~
x) = M . D ¿ ¿ , C.M.D(^x ) = M . D ¿ ¿ ,

Examples:
►Calculate the C.M.D about mean, median and mode for the data in example 1 above.
 C.M.D(x́ ) = M . D ¿ ¿ = 1.4/6 = 0.233 , C.M.D(~
x) = M . D ¿ ¿ and
 C.M.D(^x ) = M . D ¿ ¿ =1.4/5= 0.28,
3. Variance and Standard Deviation
Variance
 Is the “average squared deviation from the mean”
 Population variance 1/ N ∑Xi - i=1,2,3,......N

 For the case of frequency distribution it is expressed as:  1/ N ∑fiXi -

 i=1,2....N
1
 Sample variance(s2): s2 =
n−1 ∑
¿ ¿)2, i=1,2,3....n
 For the case of frequency distribution it is expressed as:
1
s2 = fi ¿ ¿)2, i=1, 2, 3....k
n−1 ∑
Short- cut formula:
1 1
s2 ¿ ¿Xi2 - nx́ 2) for row data, s2 ¿ ¿fiXi2 - nx́ 2) for freq. distribution.
n−1 n−1
Standard Deviation
 There is a problem with variances.
 Recall that the deviations were squared. That means the units were also squared.
 To get the units back the same as the original data values, the square root must be taken.
 = √  and s = √ s 2
Examples: find the variances and standard deviations of the following sample data
5,17,12,10. The data is given in the form of frequency distribution.
Solutions: x́ =11

xi 5 10 12 17 total
(Xi-x́ )2 36 1 1 36 74

48
JU / Biostatistics

1
n−1 ∑
s2 = ¿ ¿)2 = 74/3 =24.67  s == √ s 2 =√ 24.67 = 4.97
class frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
x́= 55
Xi(C.M) 42 47 52 57 62 67 72 total
fi(xi – x́ ¿ 1183
2
640 198 60 588 864 867 4400
1
s2 = fi ¿ ¿)2 = 4400/74 = 59.46  S= √ 59.46 = 7.71
n−1 ∑

Coefficient of Variation (CV)


 Is defined as the ratio of standard deviation to the mean usually expressed as percents.
S
 CV= x́ * 100%
Example:
An analysis of the monthly wages paid (in birr) to workers in two firms A and B
belonging to the same industry gives the following results:
Value Firm A Firm B
Mean wage 52.5 47.5
Median wage 50.5 45.5
Variance 100 121
In which firm A or B is there greater variability in individual wage?
Solutions: calculate coefficient of variation for both firms.
S 10 S 11
 CVA= * 100% = * 100% = 19.05% , and CVB= * 100% = * 100% =
x́ 52.5 x́ 47.5
23.16%
 Since C.VA < C.VB, in firm B there is greater variability in individual wages
Exercise:-
1. A meteorologist interested in the consistency of temperatures in three cities during a
given week collected the following data. The temperature for the five days of the week in
the three cities were

City 25 24 23 26 17
-1
Then, City-2 22 21 24 22 20 which city do you think have
the most City-3 32 27 35 24 28 consistent temperature, based
on these data?

49
JU / Biostatistics

2. Two groups of people were trained to perform a certain task and tested to find out which
group is faster to learn the task. For the two groups the following information was given:

Value Group one Group two


Mean 10.4 min 11.9 min
Standard deviation 1.2 min 1.3 min

Relatively speaking, which group is more consistent in its performance?


Standard Scores (Z-scores)
 If X is a measurement from a distribution with mean X̄ and standard deviation S,
then its value in standard units is
X−μ X− X̄
Z= , for population . Z= , for sample
σ , S
 Z gives the deviations from the mean in units of standard deviation
 Z gives the number of standard deviation a particular observation lie above or below
the mean.
 It is used to compare two observations coming from different groups.

Examples:
1. Two sections were given introduction to statistics examinations. The following information
was given.
Value Section 1 Section 2
Mean 78 90
Sd 6 5
Student A from section 1 scored 90 and student B from section 2 scored 95.Relatively speaking
who performed better?
Solutions: Calculate the standard score of both students.
X A − X̄ 1 90−78
Z A= = =2
S1 6
X − X̄ 2 95−90
Z B= B = =1
S2 5
 Student A performed better relative to his section because the score of student A is two
standard deviation above the mean score of his section while, the score of student B is only one
standard deviation above the mean score of his section.
2. Two groups of people were trained to perform a certain task and tested to find out which
group is faster to learn the task. For the two groups the following information was given:

Value Group one Group two

Mean 10.4 min 11.9 min

Stan.dev. 1.2 min 1.3 min

Relatively speaking: a) which group is more consistent in its performance

50
JU / Biostatistics

b)Suppose a person A from group one take 9.2 minutes while person B
from Group two take 9.3 minutes, who was faster in performing the task? Why?

Solutions:
a) Use coefficient of variation.
S1 1. 2
C . V 1= ∗100= ∗100=11. 54 %
X̄ 1 10 . 4
S2 1. 3
C . V 2= ∗100= ∗100=10. 92 %
X̄ 2 11. 9
Since C.V2 < C.V1, group 2 is more consistent.
b) Calculate the standard score of A and B
X A − X̄ 1 9 . 2−10 . 4
Z A= = =−1
S1 1 .2
X B − X̄ 2 9. 3−11. 9
Z B= = =−2
S2 1.3
Child B is faster because the time taken by child B is two standard deviation shorter than the
average time taken by group 2 while, the time taken by child A is only one standard deviation
shorter than the average time taken by group 1.
3. Compare the performance of the following two students
Candidate Marks in economics Marks in Acct. Total
A 84 75 159
B 74 85 159
Average mark for economics is 60 with standard deviation of 13 & for that of accounting is 50
with standard deviation of 11. Whose performance is better A or B?
84 − 60
Economics = 1 .846
13
75 − 50
Accounting = 2 . 273
Z score for A 11
Total Z score for A = 1.846 + 2.273 = 4.119

51
JU / Biostatistics

74 − 60
Economics = 1.077
13

75 − 50
Accouniting = 3.182
Z score for B 11
Total Z – Score for B = 1.077 + 3.182 = 4.259
Since B’s Z – score is higher, this performance is better than A.
Moments
 If X is a variable that assume the value X1, X2,......,Xn, then
n
x 1r + x 2r + x 3 r +...+ xn r 1
 The r moment is defined as: x́ r =
th
= ∑ xi r
n n i=1
k
r =1
 for the case of frequency distribution this is expressed as : x́ ∑ fixir
n i=1
 if r=1. It is simple arithmetic mean, this is called the 1st moment.
 The rth moment about the mean( the rth central moment): denoted by µr
and defined as:
n n
( xi− x́ ¿)r ∑ ( xi− x́ ) r
µr = ∑
i=1 n−1
= i=1
¿
n n n−1
 For the case of frequency distribution this is expressed as:
n
fi ( xi−x́ ¿)r
 µr = ∑
i=1
¿
n
 If r=2, it is population variance, this is called the second central
moment.
 If we assume n-1~n, it is also the sample variance.
Examples: 1) Find the first two moments for the following set of numbers 2,3,7
2) Find the first three central moments of the numbers in problem 1.
Solutions:1) Use the rth moment formula.
n
1
x́ r = ∑ xir = x́ 1 = (2+3+7)/3 =4, x́ 2 = (22+32+72)/3 = 20.67
n i=1
2) Use the rth central moment formula.
( 2−4 ) + ( 3−4 )+(7−4 )
µ1 = = 0 µ2 =? , µ3=?
3
Measure of Shapes
Skewness Skewness is concerned with the shape the curve not size
Skewness is the degree of asymmetry or departure from symmetry of a distribution.

52
JU / Biostatistics

A skewed frequency distribution is one that is not symmetrical.


 If the frequency curve (smoothed frequency polygon) of a distribution has a longer
tail to the right of the central maximum than to the left, the distribution is said to be skewed to
the right or said to be Positive skewness.
If it has a longer tail to the left of the central maximum than to the right, it is said to be Skewed
to the left said to have negative skewness
 For the moderately skewed distribution, the relation holds among the three commonly used
measure of central tendency. Mean – mode =3*(mean – median)
Measure of Skewness: denoted by α3
There are various measure of skewness
1. The Pearsonian coefficient of skewness
x́−^x
α3 = mean –mode = s
Standard deviation
2. The Bowley’s coefficient of skewness(Coefficient of skewness based on quartiles)

α3 = (Q3 –Q2) –(Q2-Q1) = Q3+Q1 -2Q2


Q3-Q1 Q3 - Q1
3. The moment coefficient of skewness
α3 = µ3 = µ3 µ
µ2
3/2
(  
 

o The shape of the curve is determined by the value of α3


 If α3 > 0 then the distribution is positively skewed.
 If α3 =0 then the distribution is symmetric.
 If α3 <0 then the distribution is negatively skewed.
Examples: 1 Suppose the mean, the mode, and the standard deviation of a certain
distribution are 32, 30.5 and 10 respectively. What is the shape of the curve representing
the distribution?
Soln: use the Pearsonian coefficient of skewness
α3 = mean – mode = 32-30.5 = 0.15
Standard deviation 10
α3 > 0  the distribution is positively skewed.
2. In a frequency distribution, the coefficient of the skewness based on the quartiles is
given to be 0.5. If the sum of the upper and lower quartiles is 28 and the median is 11,
find the values of the upper and the lower quartiles.
Soln:
Given: α3 =0.5, median =Q2=11
Q1+Q3= 28....................................... (*) Required Q1 and Q3
α3 = (Q3 –Q2) –(Q2-Q1) = Q3+Q1 -2Q2 = 0.5 substituting the given value
Q3-Q1 Q3- Q1
Q3-Q1=12………………………… (**)
Solving (*) and (**) Q1=8 , Q3=20

53
JU / Biostatistics

3. Some characteristics of annually family income distribution (in birr) in two region is
as follows:

Region Mean Median Standard Deviation


A 6250 5100 960
B 6980 5500 940
a) Calculate coefficient of skewness for each region
b) For which region is the income distribution more skewed. Give your interpretation for
each region.
c) For which region is the income more consistent?(exercise).
4. For a moderately skewed frequency distribution, the mean is 10 and the median is 8.5.
If the coefficient of variation is 20%, find the Pearsonian coefficient of skewness and
the probable mode of the distribution.
5. The sum of fifteen observations, whose mode is 8, was found to be 150 with
coefficients of variation of 20%. Then, calculate the Pearsonian coefficient of
skewness and give appropriate conclusion.

- Kurtosis
 Kurtosis is the degree of peakdness of a distribution, usually taken relative to a normal
distribution.
 A distribution having relatively high peak is®Leptokurtic
 if a curve representing a distribution is flat topped ® Platykurtic
 The normal distribution which is not very high peaked or flat topped ® Mesokurtic

Measure of Kurtosis
 The moment coefficient of kurtosis: denoted by α4
α4 = µ4 = µ4
µ2
2
4 Where:- µ4= is the 4th moment about mean
µ2= is 2nd moment about mean.
is population standard deviation
 The peakdness of depends on the value of α4 :
 If α4 >3 then the curve is leptokurtic.
 If α4 =3 the curve is Mesokurtic
 If α4 <3 then the curve is Platykurtic.

Examples:- If the first four central moments of a distribution are:


µ1=0, µ2=16, µ3= -60, µ4=162
a). Compute a measure of skewness
b) . Compute a measure of kurtosis and give your interpretation.

54
JU / Biostatistics

Solutions:- a). = µ3 = -60 = -0.94 < 0  The distribution is negatively skewed


µ23/2 163/2
b). α4 = µ4 = 162 = 0.6 <3  The curve is Platykurtic
µ22 162

Exercise
Compute a measure of kurtosis and give your interpretation
Value(xi) 3 4 5 6 7 8 9 10
Frequency( 4 6 10 26 24 15 10 5
f)

55
JU / Biostatistics

CHAPTER FIVE
ELEMENTARY PROBABILITY
Learning Objectives
At the end of this chapter, the student will be able to
 Understand the concepts and characteristics of probabilities
 Determine sample spaces and total number of outcomes in a sequence of events, using
the fundamental counting rule.
 Compute probabilities of events and conditional probabilities
Introduction
 Probability is one of those elusive concepts that virtually everyone knows but which is
nearly impossible to define entirely adequately .
 Probability theory is the foundation upon which the logic of inference is built. It helps us
to cope up with uncertainty.
 In general, probability is the chance of an outcome of an experiment.
 It is the measure of how likely an outcome is to occur.
Definitions of Basic Probability Terms
 Experiment: any process which generates well defined results or outcomes.
 Random Experiment: It is an experiment that can be repeated any number of times
under similar conditions and it is possible to enumerate the total number of outcomes
without predicting an individual outcome.
Example: If a fair die is rolled once it is possible to list all the possible outcomes i.e.1, 2, 3, 4,
5, 6 but it is not possible to predict which outcome will occur.
 Outcome: The result of a single trial of an experiment.
 Sample space (S): The set of all possible outcome of an experiment.
 Event: Any subset of sample space.
Remark: If S (sample space) has n members then there are exactly 2 n subsets or
events.
 Equally Likely Events: events which have the same chance of occurrence.
 Complement of an event: the complement of an event A means non-occurrence of A
' c
and is denoted by A , or A ,or Ā contains those points of the sample space which
doesn’t belong to A.
 Elementary event: an event having only a single element or sample point.
 Mutually Exclusive (ME) Event: two events that can’t occur simultaneously (which
cannot happen at the same time) i.e. no intersection.
For example, if we roll a fair dice, then the experiment is rolling the dice and Sample
space (S) = { 1,2,3,4,5,6 }
If we are interested the outcome of event E 1 getting even numbers and E 2 odd
numbers
E 1 = {2, 4, 6} , E 2 = {1, 3, 5} ,Clearly E 1 intersect E 2 = Φ .
Thus E 1 and E 2 are mutually exclusive events
 Independent Event: two events are independent if the occurrence of one event does not
affect the occurrence or non-occurrence of the other event. Otherwise, they are dependent
events.

56
JU / Biostatistics

Example: - what is sample space for the following experiment


1. Toss a dice one time
2. Toss a coin two times.
3.A light bulb is manufactured. It is tested for its life length by time.
Solution
1. S= {1, 2, 3, 4, 5, 6}
2. S= {(HH), (HT), (TH), (TT)}
3. S= {t: t≥0}
5.3 Counting Rules
In order to calculate probabilities, we have to know
 The number of elements of an event
 The number of elements of the sample space.
That is in order to judge what is probable, we have to know what is possible.
 In order to determine the number of outcomes, one can use several rules of counting.
- The addition rule
- The multiplication rule
- Permutation rule
- Combination rule

Addition rule
Suppose that a procedure designated by 1, can be performed in n 1 ways. Assume that second

procedure designated by 2 can be performed in n 2 ways. Suppose further more that it is not
possible both procedures 1 and 2 are performed together. The number of ways in which we can

perform 1 or 2 procedures is n 1 +n 2 ways. This can be generalized as follows if there are k


th
procedures and i procedure may be performed in n i ways, i=1, 2, …, k , then the number

of ways in which we perform procedure 1 or 2 or … or k is given by n 1 +n 2 +…+ n k =

k
∑ ni
i=1 , assuming that no two procedures performed together.
Example Suppose that we are planning a trip and are deciding between bus and train
transportation. If there are 3 bus routes and 2 train routes to go from A to B, find the available
routes for the trip.
There are 3+2=5 possible routes for someone to go from A to B.

Multiplication Rule

57
JU / Biostatistics

Suppose that procedure 1 can be performed in n 1 ways. Let us assume procedure 2 can be
performed in n 2 ways. Suppose also that each way of doing procedure 2 may be followed
by any way of doing procedure 1 , then the procedure consisting of n1 followed by n2 may be
performed by n 1 * n 2 ways
Example: There are four blood types, A, B, AB, and O. Blood can also be Rh+ and Rh-.
Finally, a blood donor can be classified as either male or female. How many different ways
can a donor have his or her blood labeled?
Solution
Since there are 4 possibilities for blood type, 2 possibilities for Rh factor, and 2
possibilities for the gender of the donor, there are 4 *2 *2= 16, different classification
categories.
Exercise: The digits 0, 1, 2, 3, and 4 are to be used in 4 digit identification card. How many
different cards are possible if
a) Repetitions are permitted.
b) Repetitions are not permitted
Permutation: An arrangement of n objects in a specified order is called permutation of the objects.
Permutation Rules:
1. The number of permutations of n distinct objects taken all together is n!
Where n !=n∗( n−1)∗( n−2 )∗. .. . .∗3∗2∗1
2. The arrangement of n objects in a specified order using r objects at a time is called the
P
permutation of n objects taken r objects at a time. It is written as n r and the
n!
=
formula is n Pr (n−r )!
3. The number of permutations of n objects in which k1 are alike k2 are alike ---- etc is
n!
=
P
n k k 1 !*k 2 !*.. .∗k n !
Example: How many different permutations can be made from the letters in the word
“BIOSTATISTICS”?
Solutions:
Here n=13, of which one is B, Three are I, One is B, Three are S, Three are T, One is A and
13!
one is C. There are =28828800 permutations
1! 3 ! 1 ! 3! 3! 1 ! 1!
Combination:- The selection of objects without considering to order is called combination.
Combination rule:- The number of combinations of r objects selected from n objects is given as
follow:-

58
JU / Biostatistics

Example: How many ways can a 5 injured persons be selected from 10 injured people in a
certain car accident.
Solution:
n=10, r=5  n C r = __n!___ = __10! ___ = 252 ways
(n - r)! r! 5! 5!
Exercise:- Among 15 pack drugs two of them are defectives. In how many ways can a
pharmacologist chose three of the pack drug for inspection so that:
a) There is no restriction,
b) None of defective drug is included,
c) Only one of the defective drug is included,
d) Two of the defective drug is included.
Approaches to measuring Probability
There are four different conceptual approaches to the study of probability theory. These are:
 The classical approach.
 The frequentist approach.
 The axiomatic approach.
 The subjective approach.
The classical approach: This approach is used when:
- All outcomes are equally likely.
- Total number of outcome is finite, say N.
Definition: If a random experiment with N equally likely outcomes is conducted and out of these
NA outcomes are favorable to the event A, then the probability that event A occur denoted
P( A ) is defined as:
N A No . of outcomes favourable to A n( A )
P( A )= = =
N Total number of outcomes n (S )

Example: in the rolling of the die , each of the six sides is equally likely to be observed . So, the
probability that a 4 will be observed is equal to 1/6.
Exercise: A box of 80 aspirin consists of 12 defective and 70 non defective aspirin tablets. If 8
of this tablet are selected at random, what is the probability
a) All will be defective.
b) 6 will be non defective

The Frequentist Approach

59
JU / Biostatistics

This is based on the relative frequencies of outcomes belonging to an event.


Definition: The probability of an event A is the proportion of outcomes favorable to A in the long
run when the experiment is repeated under same condition.
NA
P ( A )= lim
N →∞ N
Example: If records show that 60 out of 100,000 bulbs produced are defective. What is the
probability of a newly produced bulb to be defective?
Soln: Let A be the event that the newly produced bulb is defective.
N A 60
P ( A )= lim = = 0. 0006
N →∞ N 100 , 000
Axiomatic Approach:
Let E be a random experiment and S be a sample space associated with E. With each event A a real
number called the probability of A satisfies the following properties called axioms of probability or
postulates of probability.
1. P( A )≥0
2. P(S )=1, S is the sure event .
3. If A and B are mutually exclusive events, the probability that one or the other occur equals
the sum of the two probabilities. i. e.
P( A∪ B )=P( A )+ P( B )
'
4. P( A )=1−P( A )
5. 0≤P ( A )≤1
6. P(ø) =0, ø is the impossible event
Conditional probability and Independency
Conditional Events: If the occurrence of one event has an effect on the next occurrence of the other
event then the two events are conditional or dependent events.
Example: The chance a patient with some disease survives the next year depends on his having
survived to the present time. Such probabilities are called conditional.
Conditional probability of an event: The conditional probability of an event A given that B has

already occurred, denoted p( A / B) is


p( A∩B)
, p( B)≠0
p( A / B) = p( B)
'
Remark: (1) p( A / B)=1− p( A / B)
(2) p(B ' / A )=1− p(B / A )
Examples
1. Suppose in country X the chance that an infant lives to age 25 is 0.95, whereas the chance
that he lives to age 65 is .65. For the latter, it is understood that to survive to age 65 means
to survive both from birth to age 25 and from age 25 to 65. What is the chance that a person
25 years of age survives to age 65?

60
JU / Biostatistics

Solution: Let A = the event that an infant will survive to age 25


B = the event that he/she lives to age 65
Given P( A)=0.95 , P( A B)=0.65
Then, Pr (B / A)=Pr( A n B)/ Pr( A)=.65/.95=.684 .
That is, a person aged 25 has a 68.4 percent chance of living to age 65.
2. The probability of a student enrolling at freshman at certain university is 0.25 that he/she
will get scholarship and 0.75 that he/she will graduate. If the probability is 0.2 that he/she
will get scholarship and will also graduate. What is the probability that a student who get a
scholarship graduate?
Solution: Let A= the event that a student will get a scholarship
B= the event that a student will graduate
given p( A )=0 . 25 , p( B )=0. 75 , p ( A∩B )=0. 20
Re quired p ( B / A )
p ( A∩B ) 0. 20
p ( B/ A )= = =0 . 80
p(A) 0. 25
3. If the probability that a research project will be well planned is 0.60 and the probability that
it will be well planned and well executed is 0.54, what is the probability that it will be well
executed given that it is well planned?
Solution; Let A= the event that a research project will be well Planned
B= the event that a research project will be well Executed
given p( A )=0 . 60 , p ( A ∩B )=0 . 54
Re quired p ( B / A )
p ( A ∩ B ) 0. 54
p ( B / A )= = =0 . 90
p(A) 0. 60
4. In one known university there are 20 foreign and 80 local lecturers from which two
lectures are chosen without replacement. Events A & B are defined as
A = the first selected lecture is foreign,
B = the second selected lecture is foreign
a. What is the probability that both lectures are foreign?
b. What is the probability that the second lecture is foreign?
Solution; Exercise
Note; for any two events A and B the following relation holds.
p ( B ) =p ( B/ A ) . p ( A ) + p ( B/ A ' ) . p ( A ' )
Probability of Independent Events
Two events A and B are independent if and only if p ( A∩B )= p ( A ) . p ( B )
Here p ( A /B )= p ( A ) , P ( B/ A )= p ( B )
Example; A box contains four black and six white balls. What is the probability of getting two
black balls in drawing one after the other under the following conditions?
a. The first ball drawn is not replaced
b. The first ball drawn is replaced
Solution; Let A= first drawn ball is black
B= second drawn is black

61
JU / Biostatistics

Required p ( A∩B )
a. p ( A∩B )= p ( B / A ) . p ( A )=( 4 /10 ) ( 3 /9 )=2/15
b. p ( A∩B )= p ( A ) . p ( B )=( 4/10 ) ( 4 /10 )=4 /25
Total Probability Theorem
If we know the conditional probabilities of a given event under all conditions, then we can obtain
the un-conditional probability of the same event using the law of total probability
Consider two events – B: Flight is delayed, A: There are severe thunderstorms
 Now a flight may be delayed (event B happens) due to many reasons, one of which is
severe thunderstorms (event A happens). Clearly, these two events are related.
 In fact, the (unconditional) probability of delay can be consider as the sum of two
conditional probabilities: the conditional probability that there is a delay when there is
severe thunderstorms + the conditional probability that there is a delay when there is no
severe thunderstorms:

No Storm
Storm

Delay when there is no


storm

Delay when there is a storm Delay

P (delay) = P (delay & storm) + P (delay & no storm)


P (B) =P (B/A) + P (B/A’)
P (B) = P (B/A) P (A) +P (B/A’) P (A’)
Definition (total probability theorem): - Let A1, …AN be N mutually exclusive events, whose
union gives the sample space S. Hence the events A constitute a partition of S. For any event B, a
subset of S, we have
n
P(B )=∑ P (B/ Ai ). P( Ai )
i=1

Bayes’ theorem
Prior probability →new information →application of bayes theorem →posterior
probability
 We now pose the following opposite question: Given that the event B has occurred, what is
the probability that any single one of the events A’s occur?
We call P (Ai|B) as the posterior probability of event Ai, that is, the probability of Ai after event
B is observed. P(Ai) by itself is then the prior probability, the belief we have in the likelihood of
Ai in the absence of any additional information
A direct result of P (A∩B) =P (B∩A) gives us the posterior probability from the conditional
probability and the prior probability
P ( B / Ai ) . P ( Ai )
P ( Ai / B )= n
∑ P ( B / Ai ) .( P ( Ai )
i =1

62
JU / Biostatistics

This is called bayes’ formula


Definition If A1,A2,…,An are mutually disjoint events with P(Ai)≠0,(i=1,2,…,n), the for any
arbitrary event B which is a subset of S such that P(B)>0, we have
P ( B / Ai ) . P ( Ai )
P( Ai / B )= n
∑ P ( B/ Ai ) .( P ( Ai )
i =1
Each term in the Bayes Theorem has a special name, which you should be familiar with
 P (Ai) Prior probability (of class Ai)
 P (Ai/B) Posterior Probability (of class Ai given the observation B)
 P (B/Ai) Likelihood (conditional probability of observation B given class)
 P (B) a normalization constant that does not affect the decision

Exercise:
1. Suppose that it is known that a fraction 0.001 of the people in a town have tuberculosis (TB).
A tuberculosis test is given with the following properties’. If the person does have TB, then the
probability is 0.999. If he does not have TB, then there is a probability 0.002 that the test will
erroneously indicate that he does for one random selected person, the test shows that he has
TB. What is the probability that he really does?
2. Five percent of the people have high blood pressure. Of the people with high blood pressure,
75 percent drink alcohol; whereas, only 50 percent of the people without high blood pressure
drink alcohol. What percent of the drinker have high blood pressure?
3. A drug stores sells three different brands of drugs. Of its drug sales, 50% are drug type 1 (the
least expensive), 30% are drug type 2, and 20% are drug type 3. Each manufacture offers a 1-
year warranty for inspection . It is known that 25% of drug type 1’s require warranty for
inspection, whereas the corresponding percentages for drug type 2 and 3 are 20% and 10%,
respectively.
A. What is the probability that a randomly selected customer has bought a drug type 1 that
will need inspection while under warranty?
B. What is the probability that a randomly selected customer bought a drug that will need
inspection while under warranty?
C. If a customer returns to the store with a drug that will needs warranty inspection, what is
the probability that it is a drug type 1? A drug type 2? A drug type 3?
4. A diagnostic test for a certain disease is 95 percent accurate, in that if a person has the
disease, it will detect it with a probability of 0.95, and if a person does not have the disease, it
will give a negative result with a probability of 0.95. Suppose that only 0.5 percent of the

63
JU / Biostatistics

population has the disease in question. A person is chosen at random from this population. The
test indicates that this person has the disease. What is the (conditional) probability that he or
she does have the disease?

CHAPTER - SIX
RANDOM VARIABLE AND PROBABILITY DISTRIBUTIONS

Definition: A random variable is a numerical description of the outcomes of the experiment or a


numerical valued function defined on sample space, usually denoted by capital letters.
If X is a random variable, then it is a function from the elements of the sample space to the set of
real numbers. i.e
X is a function X: S  R
A random variable takes a possible outcome and assigns a number to it. Usually numbers can be
associated with the outcomes of an experiment.
For example, the number of heads that come up when a coin is tossed four times is 0, 1,2,3 or 4.
Sometimes, we may find a situation where the elements of a sample space are categories. In such
cases, we can assign numbers to the categories . Example: Flip a coin three times, let X be the
number of heads in three tosses.
⇒ S={ ( HHH ) , ( HHT ) , ( HTH ) , ( HTT ) , ( THH ) , ( THT ) , ( TTH ) , ( TTT ) }
⇒ X ( HHH )=3 , X ( HHT )=X ( HTH ) =X ( THH )=2,
X ( HTT )=X ( THT )= X (TTH ) =1
X (TTT )=0
X = {0, 1, 2, 3,}
X assumes a specific number of values with some probabilities.
Random variables are of two types:
Discrete random variable: are variables which can assume only a specific number of values. They
have values that can be counted
Example : Toss a coin n times and count the number of heads.

 Number of children in a family.

 Number of patients injured in car accidents per week.

 Number of malaria infected persons in Gibe Wereda .

64
JU / Biostatistics

 Number of bacteria per two cubic centimeter of water.

Continuous random variable: are variables that can assume all values between any two give
values. Continuous random variables can assume an infinite number of values and can be decimal
and fractional values.
Examples:

 Height of students at certain college.

 the temperature goes from 62 to 78 degree centigrade in a 24-hour period,.

 Life time of light bulbs.

 Length of time required to reduce the progression of chronic kidney disease by


reducing expression of proinflammatory cytokines.

Definition: a probability distribution consists of a value a random variable can assume and the
corresponding probabilities of the values.

Example: Consider the experiment of tossing a coin three times. Let X is the number of heads.
Construct the probability distribution of X.

Solution:

 First identify the possible value that X can assume.

 Calculate the probability of each possible distinct value of X and express X in the form of
frequency distribution.

X =x 0 1 2 3
P ( X=x ) 1/8 3/8 3/8 1/8
Probability distribution is denoted by P for discrete and by f for continuous random variable.

General rules which apply to any probability distribution:

 Since the values of a probability distribution are probabilities, they must be numbers in
the interval from 0 to 1.

 Since a random variable has to take on one of its values, the sum of all the values of a
probability distribution must be equal to 1.

In general, Properties of Probability Distribution are:

65
JU / Biostatistics

P( x )≥0 , if X is discrete .
1. f ( x )≥0 , if X is continuous .
∑ P ( X=x ) =1 , if X is discrete .
x

∫ f ( x )dx =1 , if is continuous .
2. x

Note: 1 If X is a continuous random variable then


b
P( a< X< b )=∫ f ( x )dx
a

3. Probability of a fixed value of a continuous random variable is zero.

⇒ P ( a< X <b )= P ( a≤ X < b )= P ( a< X ≤b )= P ( a≤ X ≤b )

4.If X is discrete random variable the


b−1
P ( a < X <b )= ∑ P( x )
x=a+1
b−1
P ( a≤ X < b)= ∑ p( x )
x =a
b
P ( a < X ≤ b)= ∑ P( x )
x =a+ 1
b
P ( a≤ X ≤b )= ∑ P ( x )
x =a

Probability means area for continuous random variable

Introduction to expectation

Definition:1. Let a discrete random variable X assume the values X1, X2, ….,Xn with the probabilities
P(X1), P(X2), ….,P(Xn) respectively. Then the expected value of X, denoted as E(X) is defined as:
n
E ( X )= X 1 P ( X 1 )+ X 2 P ( X 2 )+. .. .+ X n P ( X n )=∑ X i P( X i )
i =1

 Let X be a continuous random variable assuming the values in the interval (a, b) such
b

∫ f ( x )dx=1
that a ,then

66
JU / Biostatistics

b
E( X )=∫ x f ( x ) dx
a

Examples: What is the expected value of a random variable X obtained by tossing a coin three
times where X is the number of heads

Soln: First construct the probability distribution of X

X =x 0 1 2 3
P ( X=x ) 1/8 3/8 3/8 1/8

⇒ E ( X )=X 1 P( X 1 )+ X 2 P( X 2 )+. . ..+ X n P( X n )


= 0∗1 /8+ 1∗3 /8+ .. .. .+2∗1 /8
=1 . 5

Suppose a charity organization is mailing printed return-address stickers to over one million
homes in the Ethiopia. Each recipient is asked to donate$1, $2, $5, $10, $15, or $20. Based on
past experience, the amount a person donates is believed to follow the following probability
distribution:

X =x $1 $2 $5 $10 $15 $20


P ( X=x ) 0.1 0.2 0.3 0.2 0.15 0.05

What is expected that an


average donor to contribute?

Solution:

X =x $1 $2 $5 $10 $15 $20 Total


P ( X=x ) 0.1 0.2 0.3 0.2 0.15 0.05 1

xP( X =x ) 0.1 0.4 1.5 2 2.25 1 7.25


6
⇒ E ( X ) =∑ x i P( X =x i ) =$ 7 . 25
i=1

Mean and Variance of a random variable

Let X is given random variable:

67
JU / Biostatistics

1. The expected value of X is its mean ⇒ Mean of X=E ( X )


2. The variance of X is given by:

2 2
Variance of X =var ( X )=E( X )−[ E ( X )]
n
2
E( X )=∑ x 2 P ( X=x i ) , if X is discrete
i=1 i
2
=∫ x f ( x )dx , if X is continuous .
Where: x

Example: Let X the number of ears affected by one or more episodes of otitis media ear infection
during the first two years of life. Suppose the probability distribution function for this random
variable is given below. Find the expected and variance of the number of ears affected by ear infection
during the first two years of life is computed as follows:

x P(X = x) xP(X = x) μx (x-μx)2p(X=x)


0 .13 0(.13) 1.26 (0-1.26)2(0.13)
1 .48 1(0.48) 1.26 (1-1.26)2(0.48)
2 .39 2(0.39) 1.26 (2-1.26)2(0.39)
E(X) = 1.26 Var(X) =
0.452

Interpretation: the mean number of ears affected by otitis media during the first two years of life is

1.26. The population standard deviation of X is σx = σx2 or √0.452= 0.673 in our example.
Exercise: Consider the random variable representing the number of episodes of diarrhoea in the first 2

years of life. Suppose this random variable has a probability mass function as below

R 0 1 2 3 4 5 6
P(X=r) 0.129 0.264 0.271 0.185 0.095 0.039 0.017

What is the expected number of episodes of diarrhoea in the first 2 years of life?

Compute the variance and SD for the random variable representing number of episodes of diarrhea
in the first 2 years of life. ?

Common Discrete Probability Distributions

68
JU / Biostatistics

1. Binomial Distribution: One of the most widely used of all discrete probability distributions is
the binomial distribution. A binomial experiment is a probability experiment that satisfies the
following four requirements called assumptions of a binomial distribution.

1. The experiment consists of n identical trials.


2. Each trial has only one of the two possible mutually exclusive outcomes, success or a
failure.
3. The probability of each outcome does not change from trial to trial, and
4. The trials are independent, thus we must sample with replacement.
Note that if the sample size, n, is less than 5% of the population, the independence
assumption is not of great concern. Therefore the acceptable sample size for using the
binomial distribution with samples taken without replacement is [n<5% N] where n is
equal to the sample size, and N stands for the size of the population.
The birth of children (male or female), true-false or multiple-choice questions (correct or
incorrect answers) , The number of lifetime miscarriages experienced by a randomly selected
woman had had 5 lifetime pregnancies over the age of 50, Registering a newly produced product
as defective or non defective are some example of binomial distribution.
Definition: The outcomes of the binomial experiment and the corresponding probabilities of
these outcomes are called Binomial Distribution.
Let P=the probability of success
q=1−p=the probability of failure on any given trial
Then the probability of getting x successes in n trials becomes:

P( X= x )= n p x qn−x , x=0,1,2 , .. . ., n
()
x
X ~ Bin(n , p )
And this is sometimes written as:
When using the binomial formula to solve problems, we have to identify three things:
 The number of trials ( n )
 The probability of a success on any one trial ( p ) and
 The number of successes desired ( X ).
Examples:
1. Suppose that an examination consists of six true and false questions, and assume that a student has no
knowledge of the subject matter. The probability that the student will guess the correct answer to the first
question is 30%. Likewise, the probability of guessing each of the remaining questions correctly is also
30%.
a) What is the probability of getting more than three correct answers?
b) What is the probability of getting at least two correct answers?
c) What is the probability of getting at most three correct answers?
d) What is the probability of getting less than five correct answers?
Soln: Let X = the number of correct answers that the student gets.
X ~ Bin( n=6 , p=0 .30 ) a) P( X >3 )=?
n x n−x
⇒ P( X =x )= ()
x
p q , x=0,1,2, . .. 6

¿ ( 6x ) 0 .3 0 .7
x 6−x 69 X=5)+P( X =6 )
⇒ P( X >3)=P( X=4 )+P(
=0 .060+0 .010+0. 001 =0 .071
JU / Biostatistics

Thus, we may conclude that if 30% of the exam questions are answered by guessing, the
probability is 0.071 (or 7.1%) that more than four of the questions are answered correctly by the
student.
a) P( X≥2)=?
P( X≥2)=P( X=2 )+P( X=3 )+P( X=4 )+P( X =5)+P( X =6 )
=0 .324 +0 .185+0. 060+0 . 010+0 . 001 =0. 58
b) P( X≤3)=?
P( X≤3)=P ( X=0 )+P( X =1)+P( X =2)+P( X =3 )
=0 .118 +0 .303+0. 324+0. 185=0 .93
c) P( X <5 )=?
P( X <5 )=1−P( X≥5 )
=1−{P( X=5 )+ P( X=6)} =1−(0 . 010+0 . 001) =0. 989
2. Ten patients are treated surgically. For each person there is a 70% chance of successful surgery (i.e., p=
0.7). What is the probability that at most five surgeries are successful?
P[at most five successful cases]=P[five or fewer successful cases]
= P[five successful cases] + P[four successful cases] + P[three successful cases]
+ P[two successful cases] + P[one successful case] + P[no successful case]
= 0.1029 + 0.0368 + 0.0090 + 0.0014 + 0.0001 + 0.0000
= 0.1502
3. Suppose that 4% of all patients treated with a certain type of drug develop side effects. If eight of these
people are randomly selected from across the country and tested, what is the probability that exactly three
of them develop the side effect? Assume that each patient is treated independently of the others.
In this problem, n=8, X=3, p=0.04, and q=(1-p)=0.96.
Substituting these numbers into the binomial formula (see the above equation) we get:
P(X =3) = P(3) = 0.0003 or 0.03%.
Exercise: Suppose that in a certain malarias area past experience indicates that the probability of a person
with a high fever will be positive for malaria is 0.7. Consider 3 randomly selected patients (with high
fever) in that same area.
1) What is the probability that no patient will be positive for malaria?
2) What is the probability that exactly one patient will be positive for malaria?

70
JU / Biostatistics

3) What is the probability that exactly two of the patients will be positive for malaria?
4) What is the probability that all patients will be positive for malaria?
5) Find the mean and the SD of the probability distribution given above.
If X is a binomial random variable with parameters n and p then
Remark:
E( X )=np , Var ( X )=npq

Poisson Distribution:A random variable X is said to have a Poisson distribution if its probability
distribution is given by:
x −λ
λ e
P( X= x )= , x=0,1,2 ,. . .. .. Where λ=the average number .
x!

- The Poisson distribution depends only on the average number of occurrences per unit
time of space.
- The Poisson distribution is used as a distribution of rare events, such as: Number of
misprints, Natural disasters like earth quake, Accidents, Hereditary, The number of
patients admitted in a hospital emergency room per day, Arrivals,…. etc
- The process that gives rise to such events is called Poisson process.

Note that instead of time, the Poisson random variable may be considered in the experiment
of counting the number x of times a particular event occurs during a given unit of area,
volume, etc.

Examples: If 1.6 accidents can be expected an intersection on any given day, what is the
probability that there will be 3 accidents on any given day?

Solution; Let X =the number of accidents, λ=1. 6

1. 6 x e−1. 6
X =poisson ( 1. 6 ) ⇒ p ( X =x )=
x!
3 −1 . 6
1. 6 e
p ( X=3 )= =0 .1380
3!
1. On the average, five smokers pass a certain street corners every ten minutes, what is
the probability that during a given 10minutes the number of smokers passing will be
a) 6 or fewer b) 7 or more c) Exactly 8……. (Exercise)
2. Patients arrive at a certain hospital at an average rate of two every 10 minutes. The
number of arrivals is distributed according to a Poisson distribution. What is the
probability that there will be :

a. No arrivals during any period of ten minutes?

71
JU / Biostatistics

b. Exactly one arrival during this time period?

c. More than two arrivals during this time period?

If X is a Poisson random variable with parameters λ then

E( X )=λ , Var ( X )=λ

Note:

The Poisson probability distribution provides a close approximation to the binomial probability
distribution when n is large and p is quite small or quite large with λ=np .

( np) x e−( np )
P( X= x )= , x=0,1,2, . .. .. .
x!
Where λ=np=the average number .

Usually we use this approximation if np≤5 . In other words, if n>20 and np≤5 [or
n(1− p )≤5 ], then we may use Poisson distribution as an approximation to binomial distribution.

Example:

1. Find the binomial probability P(X=3) by using the Poisson distribution if p=0 . 01
and n=200

Solution:

U sin g Poisson , λ=np=0 . 01∗200=2


23 e−2
⇒ P( X =3 )= =0 . 1804
3!
U sin g Binomial , n=200 , p=0 . 01
⇒ P( X =3 )= 200 (0 . 01)3 ( 0. 99 )99=0 . 1814
( )
3

Common Continuous Probability Distribution

Introduction: There are many continuous probability distributions, such as, normal distribution, the t
distribution, the chi-square distribution, and F distribution. In this section, we will concentrate on the
normal distribution.

72
JU / Biostatistics

1. Normal Distribution

A random variable X is said to have a normal distribution if its probability density function is given by

1 x− μ 2
1
f (x )= e
( ) , −∞< x<∞ , −∞< μ<∞ , σ > 0

2 σ
σ √2 π
Where μ=E( X ) , σ 2 =Variance( X )
μ and σ 2 are the Parameters of the Normal Distribution .
Properties of Normal Distribution:
1. It is bell shaped and is symmetrical about its mean and it is mesokurtic. The maximum ordinate
is at x=μ and is given by
1 1 x− μ 2
f (x )= − ( )
2 σ
σ √2 π e
2. It is asymptotic to the axis, i.e., it extends indefinitely in either direction from the mean.
3. It is a continuous distribution.
4. It is a family of curves, i.e., every unique pair of mean and standard deviation defines a different
normal distribution. Thus, the normal distribution is completely described by two parameters:
mean and standard deviation.
5. Total area under the curve sums to 1, i.e., the area of the distribution on each side of the mean is

0.5. ⇒ −∞
∫ f (x )dx=1
6. It is unimodal, i.e., values mound up only in the center of the curve.
7. Mean=Median=mod e=μ
8. The probability that a random variable will have a value between any two points is equal to the
area under the curve between those points.
Note: To facilitate the use of normal distribution, the following distribution known as the standard
normal distribution was derived by using the transformation
1
X−μ 1 −2 Z 2
Z= ⇒ f ( z )= e
σ √2 π
Properties of the Standard Normal Distribution:

Same as a normal distribution, but also...

 Mean is zero
 Variance is one

73
JU / Biostatistics

 Standard Deviation is one

- Areas under the standard normal distribution curve have been tabulated in various ways. The
most common ones are the areas between
Z =0 and a positive value of Z .
- Given a normal distributed random variable X with

Mean μ and s tan dard deviation σ

a−μ X−μ b−μ


P( a< X< b )=P( < < )
σ σ σ
a−μ b−μ
P( a< X< b )=P( < Z< )
⇒ σ σ
Note:

P( a< X< b )=P( a≤X <b ) =P ( a< X ≤b )=P( a≤X ≤b )


Examples:

1. Find the area under the standard normal distribution which lies

a) Between Z =0 and Z=0 . 96

Solution: Area=P (0<Z <0 . 96 )=0 . 3315

b) Between Z =−1. 45 and Z=0

Solution:

Area=P (−1 . 45<Z<0)


=P(0<Z <1 . 45)
=0 . 4265
c) To the right of Z =−0 .35

74
JU / Biostatistics

Solution:
Area=P (Z >−0 . 35)
=P(−0 . 35<Z <0 )+P(Z >0 )
=P(0<Z <0 . 35)+P( Z>0)
=0 . 1368+0 .50=0 . 6368

d) To the left of Z =−0 .35

Solution:
Area=P (Z <−0 . 35)
=1− P( Z >−0 .35 )
=1−0 .6368=0 .3632

e) Between Z =−0 .67 and Z=0.75


Solution:
Area=P (−0 . 67<Z<0 .75 )
=P(−0 . 67<Z<0)+P(0<Z <0 .75 )
=P(0<Z <0 . 67)+P(0<Z <0 . 75)
=0 . 2486+0 . 2734=0 .5220

75
JU / Biostatistics

f) Between Z =0 .25 and Z=1 .25


Solution:
Area=P (0. 25<Z<1. 25 )
=P(0<Z <1 .25 )−P(0<Z <0 . 25)
=0 .3934−0 . 0987=0 .2957
2. Find the value of Z if
a) The normal curve area between 0 and z(positive) is 0.4726

Solution

P(0<Z <z)=0 . 4726 and from table


P(0<Z <1. 92 )=0 . 4726
⇔ z=1. 92. .. . .uniqueness of Areea.
b) The area to the left of z is 0.9868

Solution

P( Z<z )=0 . 9868


=P(Z <0 )+P(0<Z<z )
=0 . 50+P(0<Z< z)
⇒ P (0<Z<z )=0 . 9868−0 . 50=0. 4868
and from table
P(0<Z<2 . 2)=0. 4868
⇔ z=2. 2

3. A random variable X has a normal distribution with mean 80 and standard deviation 4.8.
What is the probability that it will take a value

a) Less than 87.2


b) Greater than 76.4
c) Between 81.2 and 86.0

Solution

76
JU / Biostatistics

X is normal with mean , μ=80 , s tandard deviation, σ=4 . 8


a)

X− μ 87 .2−μ
P( X <87 . 2)=P ( < )
σ σ
87 . 2−80
=P ( Z < )
4 .8
= P( Z<1 . 5)
= P( Z< 0)+ P( 0< Z <1. 5 )
=0 . 50+0 . 4332=0 . 9332

b)

X−μ 76. 4− μ
P( X >76 . 4 )= P ( > )
σ σ
76 . 4−80
=P ( Z > )
4 .8
=P( Z>− 0. 75 )
=P( Z> 0)+ P( 0< Z <0 .75 )
=0 . 50+0 .2734=0 . 7734

c)

81 .2−μ X −μ 86 . 0−μ
P( 81. 2< X < 86 .0 ) =P( < < )
σ σ σ
81 . 2−80 86 . 0−80
=P( <Z < )
4.8 4 .8
=P( 0 . 25< Z<1 . 25)
= P( 0< Z< 1. 25)−P( 0< Z<1 . 25)
=0 . 3934−0 .0987=0 . 2957

4. A normal distribution has mean 62.4.Find its standard deviation if 20.05% of the area under
the normal curve lies to the right of 72.9

Solution

77
JU / Biostatistics

X −μ 72 . 9− μ
P ( X >72 . 9 )= 0 . 2005 ⇒ P ( > )=0 . 2005
σ σ
72 . 9 −62 . 4
⇒ P( Z> )=0 . 2005
σ
10 . 5
⇒ P(Z > )=0 . 2005
σ
10 . 5
⇒ P( 0< Z < )=0 . 50−0 . 2005= 0 . 2995
σ
And from table P ( 0 < Z < 0 . 84 )= 0 . 2995
10 . 5
⇔ =0 . 84
σ
⇒ σ =12 . 5

5. A random variable has a normal distribution with σ =5 .Find its mean if the probability
that the random variable will assume a value less than 52.5 is 0.6915.

Solution

52. 5−μ
P( Z < z )=P( Z< )=0. 6915
5
⇒ P( 0< Z < z )=0. 6915−0 .50=0 .1915 .
But from the table
⇒ P( 0< Z <0 .5 )=0 . 1915
52. 5−μ
⇔z= =0 . 5
5
⇒ μ=50
6. Of a large group of men, 5% are less than 60 inches in height and 40% are between 60 & 65
inches. Assuming a normal distribution, find the mean and standard deviation of heights.
Solution (Exercise)
The Normal Approximation to the Binomial Distribution: As the sample sizes get larger,
binomial distribution approach the normal distribution in shape regardless of the value of p
(probability of success). For large sample values, the binomial distribution is cumbersome to
analyze without a computer. Fortunately, the normal distribution is a good approximation for
binomial distribution problems for large values of n. The commonly accepted guidelines for
using the normal approximation to the binomial probability distribution is when (n x p) and [n(1
- p)] are both greater than 5.
Example: Suppose that a physician claimed that 70% of his patients returned for annual
examination. In a year in which 80 new (first-time) patients were served at the clinic, what is the
probability that 60 or more of the patients will return for another examination?, ie., P(X >= 60) ?.
The solution to this problem can be illustrated as follows:
First, the two guidelines that (n x p) and [n(1 - p)] should be greater than 5 are satisfied: (n x p) =
(80 x 0.70) = 56 > 5, and [n(1 - p)] = 80(1 - 0.70) = 24 > 5.
Second, we need to find the mean and the standard deviation of the binomial distribution. The
mean is equal to (n x p) = (80 x 0.70) = 56 and standard deviation is square root of [(n x p)(1 -

78
JU / Biostatistics

p)], i.e., square root of 16.8, which is equal to 4.0988. Using the Z equation we get, Z = (X
-mean)/standard deviation = (59.5 - 56)/4.0988 = 0.85. From the table, the probability for this Z
score is 0.3023 which is the probability between the mean (56) and 60. We must subtract this
table value 0.3023 from 0.5 in order to get the answer, i.e., P(X >= 60)
= 0.5 -0.3023 = 0.1977. Therefore, the probability is 19.77% that 60 or more of the 80
first-time patients will return to the clinic for another examination

79
JU / Biostatistics

CHAPTER 7
7. Sampling and Sampling Distribution
Introduction
Given a variable X, if we arrange its values in ascending order and assign probability to each
of the values or if we present Xi in a form of relative frequency distribution the result is called
Sampling Distribution of X.
Definitions:
1. Parameter: Characteristic or measure obtained from a population.
2. Statistic: Characteristic or measure obtained from a sample.
3. Sampling: The process or method of sample selection from the population.
4. Sampling unit: the ultimate unit to be sampled or elements of the population to be
sampled.
Examples: -
-If somebody studies Scio-economic status of the households, households are the
sampling unit.
- If one studies performance of freshman students in some college, the student is the
sampling unit.
5. Sampling frame: is the list of all elements in a population.
Examples: -List of households.
-List of students in the registrar office.
6. Errors in sample survey:
There are two types of errors
a) Sampling error:
- Is the discrepancy between the population value and sample value.
- May arise due to in appropriate sampling techniques applied
b) Non sampling errors: are errors due to procedure bias such as:
- Due to incorrect responses
- Measurement
- Errors at different stages in processing the data.
ÆAdvantages of sampling approach over that of census approach are:-

 -Reduced cost
 Greater speed
 Greater accuracy
 Greater scope
 More detailed information can be obtained.

- There are two types of sampling.


1. Random Sampling or probability sampling.
- Is a method of sampling in which all elements in the population have a pre-assigned non-zero
probability to be included in to the sample.
Examples:
 Simple random sampling

80
JU / Biostatistics

 Stratified random sampling


 Cluster sampling
 Systematic sampling
1. Simple Random Sampling:
If a sample of size n is drawn from a population of size N in such a way that every
possible sample of size n has the same probability of being selected, then the sample is
called simple random sample and the method of sampling is called simple random
sampling (SRS). i.e. SRS: is a sampling technique in which every item of the population
has equal chance of being included in the sample.
There are two types of SRS:
@ SRS with Replacement: - here a unit after being selected is put back to the population
before the second selection.
@ SRS without Replacement: - here a given unit does not have a chance to be included
in the sample more than once.
- Simple random sampling can be done either using the lottery method or table of
random numbers.
2. Stratified Random Sampling:
- The population will be divided in to non-overlapping but exhaustive groups called
strata.
- Simple random samples will be chosen from each stratum.
- Elements in the same strata should be more or less homogeneous while different in
different strata.
- It is applied if the population is heterogeneous.
- Some of the criteria for dividing a population into strata are: Sex (male, female); Age
(under 18, 18 to 28, 29 to 39, etc); Occupation (blue-collar, professional, other).
3. Cluster Sampling:
- The population is divided in to non-overlapping groups called clusters.
- A simple random sample of groups or cluster of elements is chosen and all the sampling
units in the selected clusters will be surveyed.
- Clusters are formed in a way that elements within a cluster are heterogeneous, i.e.
observations in each cluster should be more or less dissimilar.
- Cluster sampling is useful when it is difficult or costly to generate a simple random
sample. For example, to estimate the average annual household income in a large city
we use cluster sampling, because to use simple random sampling we need a complete
list of households in the city from which to sample. To use stratified random sampling,
we would again need the list of households. A less expensive way is to let each block
within the city represent a cluster. A sample of clusters could then be randomly
selected, and every household within these clusters could be interviewed to find the
average annual household income.
4. Systematic Sampling:
- A complete list of all elements within the population (sampling frame) is required.
- The procedure starts in determining the first element to be included in the sample.
- Then the technique is to take the kth item from the sampling frame.

81
JU / Biostatistics

- Let
N
N= population size , n=sample size , k = =sampling int erval.
n
- Chose any number between 1 and k . Suppose it is j ( 1≤ j≤k ) .
th th th
- The j unit is selected at first and then ( j+k ) ,( j+2 k ) , .. . .etc until
the required sample size is reached.
2. Non Random Sampling or non-probability sampling.
- It is a sampling technique in which the choice of individuals for a sample depends on the
basis of convenience, personal choice or interest.
Examples:
 Judgment sampling.
 Convenience sampling
 Quota Sampling.
1. Judgment Sampling
- In this case, the person taking the sample has direct or indirect control over which
items are selected for the sample.
2. Convenience Sampling
- In this method, the decision maker selects a sample from the population in a manner
that is relatively easy and convenient.
3. Quota Sampling
- In this method, the decision maker requires the sample to contain a certain number of
items with a given characteristic. Many political polls are, in part, quota sampling.
Note:
let N = population size , n=sample size .
1. Suppose simple random sampling is used
n
 We have N possible samples if sampling is with replacement.
N
 We have
()
n possible samples if sampling is without replacement.
2. After this on wards we consider that samples are drawn from a given population
using simple random sampling.
Sampling Distribution of the sample mean
- Sampling distribution of the sample mean is a theoretical probability distribution that shows
the functional relationship between the possible values of a given sample mean based on
samples of size n and the probability associated with each value, for all possible samples of
size n drawn from that particular population.
- There are commonly three properties of interest of a given sampling distribution.
 Its Mean
 Its Variance

82
JU / Biostatistics

 Its Functional form.

Steps for the construction of Sampling Distribution of the mean


1. From a finite population of size N , randomly draw all possible samples of size n .
2. Calculate the mean for each sample.
3. Summarize the mean obtained in step 2 in terms of frequency distribution or relative
frequency distribution.

Example: Suppose we have a population of size N=5 , consisting of the age of five
children: 6, 8, 10, 12, and 14
⇒ Population mean=μ=10
population Variance=σ 2 =8
Take samples of size 2 with replacement and construct sampling distribution of the sample
mean.
Solution: N=5 , n=2
n 2
 We have N =5 =25 possible samples since sampling is with replacement.
Step 1: Draw all possible samples:
6 8 10 12 14
6 (6, 6) (6, 8) (6, 10) (6, 12) (6, 14)
8 (8,6) (8,8) (8,10) (8,12) (8,14)
10 (10,6) (10,8) (10,10) (10,12) (10,14)
12 (12,6) (12,8) (12,10) (12,12) (12,14)
14 (12,6) (14,8) (12,10) (12,12) (12,14)
Step 2: Calculate the mean for each sample:
6 8 10 12 14
6 6 7 8 9 10
8 7 8 9 10 11
10 8 9 10 11 12
12 9 10 11 12 13
14 10 11 12 13 14

Step 3: Summarize the mean obtained in step 2 in terms of frequency distribution.


X̄ Frequency
6 1
7 2
8 3
9 4
10 5
11 4
a) Find the mean of X̄ , 12 3 say
μ X̄
13 2
14 1

83
JU / Biostatistics

μ X̄ =
∑ X̄ i f i =250 =10=μ
∑ f i 25

σ
X̄ , say X̄ 2
b) Find the variance of
2
∑ ( X̄ i−μ X̄ ) f i 100
σ 2= = =4≠σ 2
X̄ ∑ fi 25
Remark:
2
σ
σ 2=
1. In general if sampling is with replacement X̄ n
σ 2 N −n
2. If sampling is without replacement
σ 2=
X̄ n N −1 ( )
3. In any case the sample mean is unbiased estimator of the population mean. i.e.
μ X̄ =μ ⇒ E( X̄ )=μ (Show!)
- Sampling may be from a normally distributed population or from a non -normally
distributed population.
- When sampling is from a normally distributed population, the distribution of X̄
will possess the following property.
1. The distribution of X̄ will be normal

2. The mean of X̄ is equal to the population mean , i.e.


μ X̄ =μ
3. The variance of X̄ is equal to the population variance divided by the sample size, i.e.
2
σ
σ 2=
X̄ n
σ2
⇒ X̄ ~ N ( μ , )
n
X̄ −μ
⇒ Z= ~ N ( 0,1)
σ / √n
Central Limit Theorem
2
Given a population of any functional form with mean μ and finite variance σ , the
sampling distribution of X̄ , computed from samples of size n from the population will be

84
JU / Biostatistics

2
σ
approximately normally distributed with mean μ and variance n , when the sample size is
large.

85

You might also like