BASIC STATISTICS (3685)
FAISAL SHAHZAD
PhD Scholar (China), [Link]. Statistics (Norway), [Link]. Statistics (BZU, Pak)
Certificate in Public Health (Sweden), Certificate in Epidemiology (Finland),
Certified Takaful Professional (Pak)
faisalisbest@[Link]
Unit – 1
Introduction
Introduction, Need to study statistics, Nature of Variability, Variance, Covariance and
Correlation
Introduction
Statistics: is a field of study deals with:
1- collecting, summarizing, analyzing and interpreting the data.
2- drawing inferences about a body of data (about population),
when only a part of the data is observed (sample).
Statisticians try to interpret and communicate the results to
others.
Basic Statistics
It is the science which deals with development and application of the
most appropriate methods for the:
Collection of data.
Presentation of the collected data.
Analysis and interpretation of the results.
Making decisions on the basis of such analysis
Role of statisticians
To guide the design of an experiment or survey prior to data
collection
To analyze data using proper statistical procedures and
techniques
To present and interpret the results to researchers and
other decision makers
Organization of Data
• Any raw information / material of Statistics is data.
• Data are categorized into two forms: Primary and Secondary
data.
• We may define data as figures. Figures result from the process of
counting or from taking a measurement.
For example: When a hospital administrator counts the number of
patients (its counting). Whereas when a nurse weights a patient (its
measurement)
Types of data
Constant
Variables
Sources of data
Records Published reports Surveys Experiments
Routine Kept Record by Hospitals Published public reports A set of questions Particular question / Interview
Comprehensive Sample
Sources of Data
• Routine kept record: e.g. Patient info can be obtained from Hospitals
• Published reports: commercially available data banks, literature review
• Surveys: A set of certain questions.
For example: If the administrator of a clinic wishes to obtain information
regarding the mode of transportation used by patients to visit the clinic, then a
survey may be conducted among patients to obtain this information
• Experiments / Interviews: Frequently the data needed to answer a question
are available only as the result of an experiment.
For example: If a nurse wishes to know which of several strategies is best for
maximizing patient compliance, she might conduct an experiment in which the
different strategies of motivating compliance are tried with different patients.
A variable
It is a characteristic that takes on different values in different
persons, places, or things.
For example:
- heart rate,
- the heights of adult males,
- the weights of preschool children,
- the ages of patients seen in a dental clinic.
Variable, Observation, Values
Observations are the units upon which measurements are made.
Observation Rows
Variables are the characteristics being measured.
Variable Columns
Values are realized measurement
Values table cells
Types of variables
Quantitative variables Qualitative variables
Quantitative Qualitative
continuous nominal
Quantitative Qualitative
descrete ordinal
Types of variable
Quantitative Variables Qualitative Variables
It can be measured in the usual Many characteristics are not capable
sense. of being measured. Some of them
For example: can be ordered or ranked.
- the heights of adult males,
- the weights of preschool children, For example:
- the ages of patients seen in a - classification of people into socio-
dental clinic. economic groups,
- social classes based on income,
education, etc.
Quantitative variables types
A discrete variable A continuous variable
is characterized by gaps or can assume any value within a
interruptions in the values that it specified relevant interval of values
can assume. assumed by the variable.
For example: For example:
- The number of daily admissions to - Height,
a general hospital, - weight,
- No. of item produced in a factory - skull circumference.
daily No matter how close together the
observed heights of two people, we
can find another person whose
height falls somewhere in between
Application of Statistics in Research
In order to provide a more concrete picture of what the course will
cover, let us take a brief look at a relatively recent article in the
American Journal of Public Health. Here you will see if you need
statistics or not.. In fact it is very much important for research to
understand the basic concepts and many of the basic tools of
statistics.
Article for Review
1 - 18
1 - 19
1 - 20
1 - 21
1 - 22
1 - 24
1 - 25
1 - 26
1 - 27
1 - 28
The Strategy
The basic strategy is to first focus on the types of data, then
examine tables and graphs and frequency distributions,
types of measurement scales and etc. This is critical
background for understanding samples and populations,
estimates of population parameters and the fundamental
elements of hypothesis testing. We will then move to
hypothesis testing e.g. z-tests, t-tests and many of the other
statistical tools.
Type of statistical analysis
Descriptive Inferential
Techniques for collection, Techniques for making
organization, summarization and generalization about characteristics
presentation of data of a population based on sample
e.g. Mean, Median, Mode, Standard e.g. Hypothesis testing,
deviation or Variance. Frequency correlations, Z, Chi-square or t-
tables and/or graphical testing or ANOVA etc.
representation of data etc.
Parametric: use interval or ratio
data
Non-parametric: use nominal or
ordinal data
Population and Sample
Population:
A set which includes all
measurements of interest
to the researcher
Also called the collection
Of all responses,
Measurements.
Sample:
A subset of the population Population Sample
Descriptive Statistics
Mean
• A measure of central tendency is a single value that attempts to
describe a set of data by identifying the central position within
that set of data. As such, measures of central tendency are
sometimes called measures of central location.
• The mean, median and mode are all valid measures of central
tendency, but under different conditions.
• Note: Mean can badly effect by outliers while Median is not.
The Population Mean:
N
X
i 1
i
= which is usually unknown, then we use the
N
sample mean to estimate it.
The Sample Mean: n
x = x i 1
i
n
Example:
Here is a random sample of size 10 of ages, where
1 = 42, 2 = 28, 3 = 28, 4 = 61, 5 = 31,
x
6 = 23, 7 = 50, 8 = 34, 9 = 32, 10 = 37.
= (42 + 28 + … + 37) / 10 = 36.6
Median
• The median is the middle score for a set of data that has been
arranged in order of magnitude. The median is less affected by
outliers and skewed data.
• If the observations are odd, choose the middle one. But if the
observations are even, then choose two middle obs and average it.
• Example: suppose we have the following data, arrange it first then
choose the middle observation for odd obs. If even, then choose
the middle two and get average of them.
65 55 89 56 35 14 56 55 87 45 92 14 35 45 55 55 56 56 65 87 89 92
Mode
• The mode is the most frequent score in our data set. On a
histogram it represents the highest bar in a bar chart or
histogram. You can, therefore, sometimes consider the mode as
being the most popular option.
• Example: in the previous example, 55 & 56 are the mode. There
can also be 2 or 3 modes or no mode. Only you need to see the
most repeated value in the data set.
65 55 89 56 35 14 56 55 87 45 92
Comparison of Mean, Median and Mode
Measures of Dispersion: Variance
• It measure dispersion relative to the scatter of the values about mean
a) Sample Variance (S 2 ) :
• n
,where x is sample mean
(x x )2 i
S 2
i1
n 1
b) Population Variance ( 2 ) :
• N
,where mu is the Population mean
2
(x i )
2
i1
N
Measures of Dispersion: Variance
• Example:
• Data: 43,66,61,64,65,38,59,57,57,50 x = 56
• Solution:
• S2= [(43-56) 2 +(66-56) 2+…..+(50-56) 2 ]/ 10
= 900/9 = 100
• For population variance divider will be 10 and answer will be
90. which is good for being less spread of data points.
Measures of Dispersion: Standard Deviation
The Standard Deviation:
It shows the variation in data. If the data is close together,
the standard deviation will be small. If the data is spread
out, the standard deviation will be large.
Varince
S.D is the square root of variance=
2
S
a) Sample Standard Deviation = S =
2
b) Population Standard Deviation = σ =
Covariance and Correlation
• It will be attached with unit 4.
Graphical presentation
Graphs drawn using Cartesian coordinates
• Line graph
• Frequency polygon
• Frequency curve
• Histogram
• Bar graph
• Scatter plot
Pie chart
Statistical maps
Line Graph
MMR/1000 Year MMR
60 1960 50
50
40 1970 45
30 1980 26
20
1990 15
10
0 2000 12
Year
1960 1970 1980 1990 2000
Figure (1): Maternal mortality rate of (country), 1960-2000
Frequency polygon
Age Sex Mid-point of interval
(years) Males Females
20 – 29 3 (12%) 2 (10%) (20+30) / 2 = 25
30 – 39 9 (36%) 6 (30%) (30+40) / 2 = 35
40 – 49 7 (8%) 5 (25%) (40+50) / 2 = 45
50 – 59 4 (16%) 3 (15%) (50+60) / 2 = 55
60 – 69 2 (8%) 4 (20%) (60+70) / 2 = 65
Total 25(100%) 20(100%)
Frequency polygon Age
Sex
M-P
M F
20- (12%) (10%) 25
Males Females 30- (36%) (30%) 35
%
40 40- (8%) (25%) 45
50- (16%) (15%) 55
35
30 60-70 (8%) (20%) 65
25
20
15
10
5
0
Age
25 35 45 55 65
Figure (2): Distribution of 45 patients at (place) , in (time) by age and sex
Frequency curve
8 F e m a le
7 M a le
6
Frequency
5
4
0
20- 30- 40- 50- 60-69
A g e in y e a r s
Distribution of a group of cholera patients by age
Histogram Age (years) Frequency %
25- 3 14.3
% 35 30- 5 23.8
40- 7 33.3
30
45- 4 19.0
25
60-65 2 9.5
20 Total 21 100
15
10
5
0
Age (years)
Figure (2): Distribution of 100 cholera patients at (place) , in (time) by age
Bar chart
%
45
40
35
30
25
20
15
10
5
0
Single Married Divorced Widowed
Marital status
Bar chart
%
50
Male
40 Female
30
20
10
0
Single Married Divorced Widowed
Marital status
Pie chart
Deletion
Inversion 3%
18%
Translocation
79%
Doughnut chart
Hos pital B
DM
Hospital A IHD
Renal
Unit 2
Basic Statistical Methods
Normal Distribution
• In this section, we will mainly focus on Normal distribution,
Standard Normal distribution, skewness and kurtosis, Hypothesis
testing, type-I and type-II error etc.
Normal Distribution
It is one of the most important probability distributions in statistics.
It is the limiting form of binomial distribution by increasing ‘n’ (the
no. of trails) to a very large number for a fixed value of p.
The normal density is given by:
1
2
( x ) - ∞ < x < ∞, - ∞ < µ < ∞, σ > 0
f (x)
2
2
e
2
π, e : constants
µ: population mean.
σ : Population standard deviation
Characteristics of Normal Distribution
• In its most general form, under some conditions (which include
finite variance), it states that averages of samples of observations of
random variables independently drawn from independent
distributions converge in distribution to the normal, that is, become
normally distributed when the number of observations is
sufficiently large.
• It is bell shaped
• The mean, median and mode are equal are equal
• It is unimodal (i.e. it has only one mode)
• The curve is symmetrical about the mean, which is equivalent to
saying that its shape is the same on both sides of a vertical line
passing through the center.
Characteristics of Normal Distribution
• The curve is continuous. i.e. there are no gaps or holes. For each value
of x, there is a corresponding value of y.
• The curve never touches the x-axis. Theoretically, no matter how far in
either direction the curve extends, it never meets the x-axis but gets
increasingly closer.
• Total area under the normal distribution curve is equal to 1.00 or 100%.
• The area under the normal curve that lies within one standard deviation
of the mean is approx. 68%, within two standard deviations 95% and
within three standard deviation it 99.7%.
• The normal distribution is completely determined by the parameters µ
and σ.
• All odd order moments from mean is 0.
The normal distribution
depends on the two
parameters and .
determines the location of
the curve.
But, determines the scale of
the curve, i.e. the degree of
flatness or peaked ness of
the curve. Note that:
1. P(µ - 1σ < x < µ + 1σ) = 68%
2. P(µ - 2σ < x < µ + 2σ) = 95%
3. P(µ - 3σ < x < µ + 3σ) = 99%
Application of Normal Dist. Through
68%-95%-99.7% rule
• A researcher measured the percent of body fat of 2000 women,
the resulting dist. Has a mean = 25% fat and a S.D. of 4.0.
therefore, the scores would be distributed in the following
manner.
The Standard Normal distribution:
• Is a special case of normal distribution with mean equal 0 and a
standard deviation of 1.
• The equation for the standard normal distribution is written as
z2
1
f (z) e 2
2 -∞<z<∞
It has the following characteristics:
1- It is symmetrical about 0.
2- The total area under the curve above the x-axis is 1
3- We can use a separate table to find the probabilities and areas.
Tests for Skewness and Kurtosis
Skewness is a measure of symmetry, or the lack of
symmetry. A distribution, or data set is symmetric if it
looks the same to the left and right of the center point.
Tests for Skewness and Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed
or light-tailed relative to a normal distribution.
Rules for Skewness and Kurtosis
Rules for Skewness
▫ Skewness > 0 = positive (right) skewed i.e. Mode<Median<Mean
▫ Skewness < 0 = negative (left) skewed i.e. Mean<Median<Mode
▫ Skewness = 0 is acceptable. i.e. Mean=Median=Mode
If 3 times stand. error of skewness comes equal to skewness statistic value
then you are ok. Otherwise not. In the following case, multiply
0.150*3=0.450 which show huge different from -1.493. similarly for others.
Rules for Kurtosis
▫ Different books different scales: Kurtosis between -3 and 3 is acceptable
(means normal curve). some says b/w -2 and +2. some says even value >0
or <0 is enough to measure kurtosis.
▫ If 3 times stand. error of kurtosis comes equal to kurtosis statistic value
then you are ok. Otherwise not. In the following case, multiply
0.299*3=0.0.897 which show huge different from 3.934. similarly for
others.
Thank you