You are on page 1of 37




DR. Harish Kumar Thota

▪ Introduction
▪ Sampling and sampling designs
▪ Collection of Data
▪ Presentation of Data
▪ Measures of central tendency
▪ Measures of variation or dispersion
▪ Normal distribution
▪ Null hypothesis
▪ Tests of statistical significance
▪ Conclusion
▪ References

• “Statistics” comes from an Italian word ‘statista’ meaning statesman or
German word ‘statistik’ which means political state. The science of
statistics is said to have originated from two main sources, viz., (1)
Government records, (2) mathematics. It started as registration of heads
of families in ancient Egypt to roman census on military strength, births,
deaths. Etc
• John Graunt (1620-1674) who was neither a physician nor
mathematician is the Father of Health Sciences.
• Statistics is a field of study concerned with technique or method of
collection of data, classification, summarizing, interpretation, drawing
inference, testing of hypothesis making recommendations.
• Biostatistics is a term used when tools of statistics are applied to data
that is derived from biological sciences. While conducting an oral
examination investigator makes observations according to his judgement
of situation that depends on his skill, knowledge and experience when
the same observer repeats the procedure or by any other investigator,
there me be some variability in opinions. This variability in measurement
can be handled using statistics. Epidemiology and statistics are called as
Sister Sciences. Epidemiology collects facts relating to group of
population in places, times and situations, whereas biostatistics converts
these facts into figures and then translates into facts to interpret the
significance of their results. Statistics is also called as ‘Science of figures.
• Uses of biostatistics-
1.To test whether difference between two population groups is real or by
chance occurrence.
2.To study correlation between attributes in the same population
3.To measure the mortality and morbidity
4.To fix priorities in public health programs
5.To assess the state of oral health in community and to determine the
availability and utilization of dental care facilities.
6.To determine the success or failure of specific oral health care programs
7.To evaluate the achievements of public health program


• Sample- A sample is a part of a population, called the “universe”,
“reference or “parent population”. It is basically a subset of a population.
• Sampling- is nothing but the process of selection of a sample.
• Sampling frame- the set of sampling units from which a sample is to be a list of names, places, ages.

Ideal requirements of sample

1. Efficiency- Ability of a sample to yield the desired information.
2. Representativeness- Sample should not differ from parent population.
3. Measurability- The design of the sample should be made i.e., the
investigator should be able to estimate the extent to which findings from
sample are likely to differ from parent population.
4. Size-the sample should be large enough to minimize sample variability
and to allow estimates of population characteristics to be made with
5. Coverage- Adequate coverage is essential if the sample has to remain
representative. Refusal, non-follow up, withdrawals make the sample
non representative.
6. Goal selection- selection should be oriented towards objectives and
research design.
7. Feasibility- should be simple enough to be carried out in practice.
8. Economy and cost efficiency- The sample design should save time and

 Types of sampling
Purposive or non-probable sampling
It is the procedure of selecting a sample from a population without the use
of probability. Deliberate or purposive selection of individual is done.
Random or probability sampling
It is a sample in which every individual has an equal chance to be selected .

 Sampling Methods

a) Simple random sampling – in this type each and every unit has equal
chance to be selected. Selection is by chance. It is carried out either by
lottery method or table of random numbers. In lottery method units are
numbered on slip and are shuffled and selected by blindfolding the
investigator. In other method 0-9 digits are arranged randomly,
selection is done horizontal or vertical direction.
b) Systemic random sampling – select one unit at random and then
selecting additional units at evenly spaced interval till sample of
required size has been drawn.

c) Stratified random sampling – population is subdivided into groups;

simple random selection is done from each stratum or group.
d) Cluster sampling – population forms the natural groups like village,
children of a school etc. sample is then selected by any above method.
e) Multiphase sampling - The part of the information is collected from the
whole sample and a part of information from sub sample.
e.g., Students from a school are examined, Students with malocclusion
are selected, students with skeleton malocclusion are selected.
f) Multistage Sampling – it is done in various stages; the 1 st stage is to
select the group or clusters samples are taken in many subsequent
1st stage Choice of states within country
2nd stage Choice of towns within each state
3rd stage Choice of neighborhoods within each town.

 Errors of sampling
 Sampling errors – occur due to faulty sample design and small
size of sample.
 Non sampling errors – Coverage errors are due to non-
cooperative and non-responsive of the informant,
Observational errors are due to imperfect experimental
technique and interviewer bias, Processing errors may occur
in statistical analysis.


Data are a set of values of one or more variables recorded on one or mare
individuals. Data consists of discrete observations people or events that
carry little meaning when considered alone. It needs to be transformed
into information by reducing, summarizing and adjusting them in such a
way that comparison over time and place are possible. It is of two types-
1. Qualitative data- When data is collected on the basis of
attributes or qualities like sex, malocclusions, cavity etc.
2. Quantitative data-When the data is collected through
measurement e.g., arch length, arch width. It is of 2
a) Discrete- When the variable under observation takes
only fixed values like whole numbers the data is discrete
e.g., the DMF teeth
b) Continuous- If the variable can take any value in a
given range, or decimal it is called as continuous data
like arch length, mesiodistal width of the erupted teeth.
Sources - data can be collected by
a) Primary Source- It is obtained by the investigator himself can be
collected by
i)Direct personal interviews
ii)Oral health examination
iii)Questionnaire method
b) Secondary Source- It is the data already recorded id utilized to serve
the purpose of the objective of the study
e.g., records of the o.p.d. of the dental clinics.
2 main types of data presentation
Graphic presentation
Charts – bar charts
pie chart
doughnut chart
Diagrams – histograms
• Bar charts: they represent the set of data by the length of a bar which is
proportional to magnitude of the data. They are of 3 types, 1) simple bar
2) multiple bar 3) component bar.

• Pie chart: here instead of comparing the length of bar, the areas of
segments of a circle are compared. The area of each segment depends
upon the percentage, which is converted to angle and drawn.

• Pictogram: here pictures or symbols are used to present the data

• Histogram: is a set of vertical bars whose areas are proportional to
frequencies presented. Class intervals are given on horizontal axis and
frequencies along the vertical axis.

• Line chart: it shows the trends or changes in data varying with the
constant, at even intervals. It emphasizes the flow of a constant and rate
of change, rather than amount of change.

• Frequency curve: it is a graphical display of frequency table. The midpoints

of each frequency bar are located and drawn which are then connected to
form a polygon.
▪ Refers to the middle observation value which serves as a single
estimate of series of data and enable comparison. The objective of
central tendency is - To condense the entire mass of data, to
facilitate comparison. It is of three types Mean Median and Mode
• It is the average value obtained by summing of all observations and
divide by total observations.
• It is the simplest method to measure the central tendency.
 It Easy to understand and calculate.
 It is based on ALL VALUES in data
 It is rigidly defined.
 It is Not much affected by sampling fluctuations.
X It cannot be calculated if any observations are missing.
X It is Affected by extreme values.
X It cannot be located graphically.
X It may be number which is not present in given data.

▪ Data is arranged either in ascending or descending order and the
value of middle observation is located.
 It is rigidly defined.
 It is easy to calculate and understand.
 It is not affected by extreme values.
 It can be located just by an inspection in many cases.
 It can be located on graph.
 It can be calculated for the data based on ordinal scale.

X Not based upon all values of given data.
X In case of larger samples, it’s difficult to arrange in an order.
X It is not capable of further mathematical treatment.

▪ Mode is predominant or commonly occurring value in a distribution
of data. Sometimes there can be no single mode/ bimodal/ trimodal/
 It is easy to understand and calculate.
 It is not affected by extreme values.
 It is even if extreme values are unknown, can be calculated.
 It is applicable for both qualitative and quantitative data.
X It is not rigidly defined.
X It is not based upon all values of data.
X It is not capable of further mathematical treatment.
▪ The scatteredness or the variation of observation from their average is
dispersion. The objective is to study the variability of data and accounting
the variability in data.
▪ Types – Range, Mean deviation, Standard deviation.
▪ The Difference between highest and lowest values in given data is called
range. It is the Simplest measure of dispersion.
 It is Easy to understand.
 It can be quickly calculated.
X Value fluctuates with size of distribution.
X Unstable in repeated sampling.
X Not suitable for precise and accurate studies.
 It is of no practical importance as it does not indicate anything about
dispersion of values between two extreme values.
Example - 1 2 3 4 5 6 7 8 9

lowest value is 1 highest value is 9.

hence the range of these values is 19.

The range is not of much practical importance. It indicates only the

extreme values. Tells nothing about the dispersion of values between

these two extremes value.

The Mean Deviation:

 It is the average of the deviations from the arithmetic mean.

 M.D = Sum of deviation from mean No. Of observations


Mean we calculated is 45/9 = 5


Standard Deviation:

 In simple terms, it is defined as "Root Means Square Deviation." The

standard deviation is the most frequently used measure of deviation.


1. First of all, take the deviation of each value from the arithmetic mean.

2. Then, square each deviation.

3. Add up the squared deviations.

4. Divide the result by the number of observations N [or (N 1) in case the

sample size is less than 30]

5. Then take the square root, which gives the standard deviation.

S. D= √(x-xi)²/n

 It is an abstract number that gives us an idea of ‘spread’ of dispersion.

 Larger the standard deviation, greater the dispersion of values about the



The shape of the curve will depend upon the mean and standard deviation

which in turn will depend upon the number and nature of observations.

In a normal curve

o The area between one standard deviation on either side of the mean

will include approximately 68 % of the values in the distribution. The

area between two standard deviations on either side of the mean will

cover most of the values, approximately 95 per cent of the values, and

the area between three standard deviations will include 99.7 per cent

of the values. These limits on either side of the mean are called

"confidence limits."
Standard normal curve

1. The standard normal curve is a smooth, bell shaped. It is

perfectly symmetrical curve based on an infinitely large number

of observations.

2. The total area of the curve is 1, its mean is 0 and standard

deviation is 1.

 Tests of significance

The different samples drawn from the same population have

different estimates. The difference in the estimates is called

sampling variability. Hence, when dealing with 2 or more samples

one is interested to know whether the difference in the values is

due to sampling variations or not.

Null hypothesis:

The first step in testing of hypothesis is to set up an appropriate

hypothesis with the problem. The null hypothesis asserts that there is

no real difference in the sample and the population in the particular

matter under consideration. The difference found is accidental and

arises out of sampling variations.E.g., to test the association between

thumb sucking and upper anterior proclination the null hypothesis

would be there is no association between thumb sucking and upper

anterior proclination.

Type I and type II errors

Even in the best research there is a possibility that the researcher will

make a mistake regarding the relationship between 2 variables.

There are 2 possible errors.

o Type I

o Type II

Type I error- (false-positive)

 Occurs if an investigator rejects a null hypothesis that is actually

true in the population.

Type II error-(false-negative)

 Occurs if the investigator fails to reject a null hypothesis that is

actually false in the population.

 Tests of statistical significance:

Parametric tests


It is used to test the significance of difference of means for large


Pre requisites-

The sample must be randomly selected, and the data must be

quantitative. The variable measured is assumed to follow a

normal distribution in the population. sample should be greater

than 30.


When sample size is small, ttest is used to test the hypothesis. It

was designed by W.S Gossett, whose pen name was student hence

it is called student t test. T ratio is observed difference between 2

means of small samples to this standard error of difference in the


There are 2 types.

 unpaired t test

 paired t test1

paired t test is applied when there is a pair of data from single

element in an observation. Data collected before and after

intervention so that the same group acts as both case and control,

then the mean of both groups is compared to get the values.

Example: 2 BP measurements on the same person using different


unpaired t test is used to compare the averages of 2

independent or unrelated groups to determine if there is

difference between two.


Analysis of co variance. Many research problems involve comparing

more than two groups. If the design includes only one independent

variable the technique is called one way ANOVA. If there are more

factors within each group two-way ANOVA is considered also known

as N way analysis of variance. In many experiments, the outcome of

a variable depends on the magnitude of the variable before

subjecting the experimental units for experimentation. As such, it

may be necessary to analyze the outcome values in relation to initial

values. In some other cases, the outcome of a particular variable

may be dependent on the outcome of a particular variable may be

dependent on the outcome of another variable. In such cases it is

desired to analyze the significance of the effect of this variable on

the outcome of the experimental variable. This technique combines

features of analysis of variance and regression analysis.


In a study investigator wanted to study the effect of drug A and B

on blood pressure. They were randomly allocated in to 4 groups.

o Those taking drug A alone.

o Those taking drug B alone.

o Taking both A and B

o Taking placebo.

The difference of pretreatment and post treatment systolic blood

pressure is determined. Then mean difference is calculated. The f

test is a kind of super t test that allows investigators to compare

more than two means simultaneously. The null hypothesis for the

Ftest is that the mean change in blood pressure will be same in all

groups indicating that all samples were from the same


In ANOVA test two measures of variance are


one between group variances

two within group variance.

 And is based on the variation within each group.

Fratio = between group variance/ within group variance

If the f ratio is fairly close to 1 the two estimates of variance are similar

and the null hypothesis that all of the means came from same

underlying population is not rejected. If the ratio is much larger than

one there must have been some group differences


The two-way ANOVA compares the mean differences between groups

that have been split on two independent variables (called factors).

The primary purpose of a two-way ANOVA is to understand if there is

an interaction between the two independent variables on the

dependent variable. For example, you could use a two-way ANOVA to

understand whether there is an interaction between gender and

educational level on test anxiety amongst university students, where

gender (males/females) and education level

(undergraduate/postgraduate) are your independent variables, and

test anxiety is your dependent variable. Alternately, you may want to

determine whether there is an interaction between physical activity

level and gender on blood cholesterol concentration in children,

where physical activity (low/moderate/high) and gender

(male/female) are your independent variables, and cholesterol

concentration is your dependent variable. The interaction term in a

two-way ANOVA informs you whether the effect of one of your

independent variables on the dependent variable is the same for all

values of your other independent variable (and vice versa). For

example, is the effect of gender (male/female) on test anxiety

influenced by educational level (undergraduate/postgraduate).2.

Tukey test

 Once an ANOVA model results in the rejection of the null

hypothesis, the only conclusion we have is that not all means are

equal. However, we do not know which means are different.

Determine how many pairs of means there are -> for each pair of

means we have a pair of hypotheses -> repeat for all the pairs of the

means -> calculate the critical value.

Non parametric tests

Chi-square test

 It was developed by Karl Pearson. When data is measured in

terms of attributes or quality and is intended to test whether

the difference in the distribution of attributes in different

groups is due to sampling variation or not, Chi square test is

used. It is used to test the significance of difference between 2

proportions and can be used when there are more than 2

groups to be compared. e.g., If there are two groups, one who

have the habit of thumb sucking and the other who do not.

occurrence of malocclusion

Group present absent Total

Those who did 10 50

not suck their

Those who 40 8 40
Total 42 48 90
sucked their


Test the null hypothesis.

 To test whether there is an association between thumb sucking

and malocclusion the null hypothesis would be “there is no

association between thumb sucking and malocclusion.”

 Among those who did not suck their thumb Expected no with


 Expected no without malocclusion=40×0.53=21.2


 All the observation in two samples is ranked numerically from

smallest to largest without regard the groups. Then identify

the observation for I and II samples. Sum of ranks for I and II

sample determined separately. Take difference of two sum T

=R1 R2 Calculate u value using the formula. If the value is

less than or equal to o.o5 the null hypothesis i.e., the samples

have not been drawn from the same population is rejected4


It is a quantitative approach systematically combining results of

previous research to arrive at conclusions about the body of research.

It collects data from individual data. It identifies heterogeneity in effect

among multiple studies.


 Define the research question and specific hypothesis. Define the

criteria for including and excluding studies. Locate research

studies. Determine which studies are eligible for inclusion. Classify

and code important study characteristics (e.g., sample size, length

of follow up etc.) Select or translate results from each study using

a common metric. Aggregate findings across studies generating

weighted pooled estimates of effect size. Evaluate the statistical

homogeneity of pooled studies5

SPSS -Stands for Statistical Package for the social sciences. It is a

software package used for interactive statistical analysis. Computerized

new way of analysis.


1. Essentials of Public health dentistry- Soben Peter,6th edition.

2. Kim, H.-Y., 2014. Analysis of variance (ANOVA) comparing means

of more than two groups. Restor. Dent. Endod. 39, 74e77.

3. Abdi, H., Williams, L.J., 2010. Honestly significant difference (HSD)

test. In: Salkind, N.J., Dougherty, D.M., Frey, B. (Eds.),

Encyclopedia of Research Design. Sage, Thousand Oaks, CA, USA,

pp. 583e585

4. Mary L McHugh. The Chi-square test of independence. Biochemia

Medica, 23(2):143–149, 2013.

5. Russo MW. How to review a meta-analysis. (N Y).2007;3:637–642 .

You might also like