You are on page 1of 87

Statistics and

Biostatistics
Mrs. Khushbu K. Patel
Assistant professor
Shri Sarvajanik Pharmacy College
What is Statistics?
 Different authors have defined statistics differently. The best definition of statistics is given
by Croxton and Cowden according to whom statistics may be defined as the science, which
deals with collection, presentation, analysis and interpretation of numerical data.

 The science and art of dealing with variation in data through collection, classification, and
analysis in such a way as to obtain reliable results. —(John M. Last, A Dictionary of
Epidemiology )

 Branch of mathematics that deals with the collection, organization, and analysis of numerical
data and with such problems as experiment design and decision making. —(Microsoft
Encarta Premium 2009)
 A branch of mathematic staking and transforming
numbers into useful information for decision makers.

 Methods for processing & analyzing numbers

 Methods for helping reduce the uncertainty inherent


indecision making
What is biostatistics?
 It is the science which deals with development and application of
the most appropriate methods for the:
Collection of data.
Presentation of the collected data.
Analysis and interpretation of the results.
Making decisions on the basis of such analysis
 The methods used in dealing with statistics in the fields of medicine,
biology and public health.
Why study statistics?
 Decision Makers Use Statistics To:
 Present and describe data and information properly

 Draw conclusions about large groups of individuals or information


collected from subsets of the individuals or items.

 Improve processes.
Statistics

Descriptive Statistics Experimental Statistics Inferential Statistics

Drawing conclusions and


Techniques for planning
Methods for processing, / or making decisions
and conducting
summarizing, presenting concerning a population
experiments
and describing data based only on sample
data
DATA
 Definition:-
 A set of values recorded on one or more observational units. Data are
raw materials of statistics.
 Data set : A collection of data is data set
 Data point : A single observation
 Raw data : Information before it arranged and analysed

 Sources of data:-
 Experiments
 Surveys

 Records
 Example of Raw data:

Systolic BP Diastolic BP
120 80
135 90
Blood Pressure
125 85
140 95
138 86
Elements, Variables, and Observations
 The elements are the entities on which data are collected.

 A variable is a characteristic of interest for the elements.

 The set of measurements collected for a particular element is called an


observation.

 The total number of data values in a data set is the number of elements
multiplied by the number of variables.
Data, Data Sets, Elements, Variables, and Observations

Variables
Element
Names Stock Annual Earn/
Company Exchange Sales($M) Share($)

Data Set
Descriptive statistics

◦ Summarizing and describing the data

◦ Uses numerical and graphical summaries to characterize sample


data
Descriptive Statistics

• Collect data
– e.g., Survey
• Present data
– e.g., Tables and graphs
• Characterize data
– e.g., Sample mean =  Xi
n
Inferential Statistics
• Estimation
– e.g., Estimate the population
mean weight using the sample
mean weight
• Hypothesis testing
– e.g., Test the claim that the
population mean weight is 120
pounds

Drawing conclusions about a large group of individuals based on a subset of the large group.
Inferential statistics

It refers to the process of selecting and


using a sample to draw inference about
population from which sample is drawn.

 Two forms of statistical inference


◦ Hypothesis testing
◦ Estimation
Basic Vocabulary of Statistics
 POPULATION : A population consists of all the items or individuals about which
you want to draw a conclusion. Ex: People who live within 25 kms of radius from
centre of the city.

 SAMPLE : A sample is the portion of a population selected for analysis. It has to be


representative.

 PARAMETER : A parameter is a numerical measure that describes a


characteristic of a population.

 STATISTIC : A statistic is a numerical measure that describes a characteristic of a


sample.
Population vs. Sample

Population Sample

Measures used to describe the Measures computed from


population are called parameters sample data are called statistics
Types of data

Quantitative Qualitative
data(numerical) data(categorical)

continuous Discrete Nominal Ordinal

countable in a finite
take forever to count amount of time
Ex: time Ex: count change of
money in your pocket
Type of variables
 Categorical (qualitative) variables have values that can only be placed
into categories, such as “yes” and “no.”

 Numerical (quantitative) variables have values that represent quantities.


Qualitative Data
 Non Numerical
 Categorical
 No numbers are use to describe it
 Word, picture, image
 Ex. Do you smoke? Yes No
Quantitative Data

Numerical

Non Metric Metric

Binary Nominal Ordinal Discrete Continuous


REASONS FOR ASSIGNING NUMBERS
Numbers are usually assigned for two reasons:

 numbers permit statistical analysis of the resulting


data

 numbers facilitate the communication of measurement


rules and results
TYPES OF MEASUREMENT SCALES
Non Metric Scales
 Nominal: (Description)
 Ordinal: (Order)
Metric Scales Ratio

 Interval: (Distance) Interval


 Ratio: (Origin)
Ordinal

Nominal
Nominal
Notes Examples
 Lowest Level of measurement  Gender
 Discrete Categories  0 = Male
 1 = Female
 No natural order
 Categorical or dichotomous
 Group Membership
 1= Experimental
 May be referred to a qualitative
 2 = Placebo
or categorical
 3 = Routine
 Marital Status, Colour, religion,
type of car etc.
Nominal
Nominal sounds like name

Notes Possible Measures


 Lowest Level  Mode
 Classification of data  Model Percentage
 Order is arbitrary  Range
 Gender  Frequency Distribution
 Marital Status
 Religion
 Types of Car Driven
Ordinal
Notes Examples
 Ordered Categories  Likert Scales
 Relative rankings  Socioeconomic status
 Unknown distance between  Size
rankings  Size, ranking of favorite sports,
 Zero arbitrary class rankings, wellness
rankings
Ordinal
The values in an ordinal scale simply express an order

Customers Satisfaction Movie Ratings


Are you
 Very Satisfied
 Satisfied
 Neither satisfied nor
dissatisfied
 Dissatisfied
 Very dissatisfied
Ordinal
Notes Possible Measures
 Order matters  All Nominal level tests
 But not the difference between  Median
values
 Percentile
 Unknown distance between
rankings  Semi quartile range

 Relative rankings  Rank order coefficients of


 Likert scales correlation
 Socioeconomic status
 Pain intensity
 Non numeric concepts
Interval
Notes Examples
 Ordered categories
 Equal distance
 Between values
 An accepted unit of
measurement
 Zero is arbitrary
Interval
Notes Possible Measures
 Ordered categories  All Ordinal tests
 Equal distance  Mean
 Can measure differences  Standard deviation
 Zero is arbitrary  Addition and subtraction
 Temperature
 Celsius or Fahrenheit  Can not multiply or divide
 Elevation
 Time
Ratio
Notes Examples
 Most Precise
 Weight
 Ordered
 Height
 Exact Value
 Pulse
 Equal Intervals
 Blood Pressure
• Natural Zero
 Time
 When variable equals zero it means
there is none of that variable  Degrees Kelvin
 Not Arbitrary zero
Ratio
Note Possible Measures
 Precise, Ordered, Exact  All operations are possible
 Equal intervals  Descriptive and inferential
statistics
 Natural Zero
 Weight
 Can make comparisons
 An 8 kg baby is twice as heavy as
 Time
a 4 kg baby
 Degree Kelvin
 Can add, subtract, multiply,
divide
CHARACTERISTICS OF LEVEL OF MEASUREMENT
Nominal Ordinal Interval Ratio

Labeled Yes Yes Yes Yes

Ordered No Yes Yes Yes

Known No No Yes Yes


difference
Zero is N/A Yes Yes No
arbitrary
Zero Means N/A No No Yes
None
LEVEL OF MEASUREMENT DECISION TREE

Yes, Ratio
Yes, Zero
means none?
Yes, Equally
No, Interval
Spaced
Ordered? No Ordinal
No
Nominal
Number Example Permissible
Scale
system statistics
Unique definition of Roll number of Percentages, Mode,
Nominal numbers students, Numbers Binomial test, Chi-
( 0,1,2,……..9) assign to basket ball Square test
:
players.
Order Numbers Student’s Rank Percentiles, Median,
(0<1<2……….<9) Rank-order co-
Ordinal: relation, Two-way
ANOVA
Equality of Temperature Range, Mean,
differences Standard deviation,
Interval
(2-1 = 7-6) Product Movement
: Correlation t- test and
f -test
Equality of Ratio Weight, height, Geometric Mean,
(5/10 = 3/6) distance Harmonic Mean,
Ratio: Coefficient of
variation
SOME STATISTICAL TESTS
Nominal Ordinal Interval Ratio
Mode Yes Yes Yes Yes
Median No Yes Yes Yes
Mean No No Yes Yes
Frequency Yes Yes Yes Yes
Distribution
Range No Yes Yes Yes
Add and Subtract No No Yes Yes
Multiply and No No No Yes
Divide
Standard No No Yes Yes
Deviation
NOIR
Remember Example Central Notes
Tendency
No order;
Named classifications; Limited in
Nominal Gender Mode
Mutually exclusive categories descriptive
ability
Ordered or Relative rankings;
Not necessarily
Ordinal Numbers are not equidistant; Pain scale Mode, median
equal intervals
Zero is arbitrary
Exact difference
Rank ordering; Approximately between
Exam Mode,
Interval equal intervals; Can have numbers is
marks median, mean
negative numbers known; Zero is
arbitrary
Rank ordering; Equal Length Mode, Zero means
Ratio
intervals; absolute Zero Weight Median, Mean none
Methods of presentation of data
1 Tabular presentation
2 Graphical presentation
Purpose: To display data so that they can be readily understood.

Principle: Tables and graphs should contain enough information to be self-


sufficient without reliance on material within the text of the document of which
they are a part.

•Tables and graphs share some common features, but for any specific situation,
one is likely to be more suitable than the other.
Tabular Presentation
 Types of tables:-
1.list table:- for qualitative data, count the number of observations
( frequencies) in each category.
A table consisting of two columns, the first giving an identification of the
observational unit and the second giving the value of variable for that unit.
Example : number of patients in each hospital department are
Department Number of patients
Medicine 100
Surgery 88
ENT 54
Opthalmology 30
Tabular Presentation
2. Frequency distribution table:- for qualitative and quantitative
data
Simple frequency distribution table:-
Tabular Presentation

 complex frequency distribution table

Lung cancer
Total
Smoking positive negative
No. % No. % No. %
Smoker 15 65.2 8 34.8 23 100
Non smoker 5 13.5 32 86.5 37 100
Total 20 33.3 40 66.7 60 100
Graphical presentation

For quantitative,
For qualitative,
continuous or measured
discrete or counted
data
data
 Histogram
 Bar diagram
 Frequency polygon
 Pie or sector diagram
 Frequency curve
 Spot map
 Line chart
 Scattered or dot diagram
Bar diagram

 It represent the measured value


(or %) by separated rectangles
of constant width and its lengths
proportional to the frequency
C o n d i t i o n s f or W h i c h P a t i e n t s w e r e r e f e r r e d f or t r e a t m e n t

 Use:- discrete qualitative data B ac k and Neck


A rthritis

 Types:- simple A nxie ty

Skin
D ig e s tive

multiple
Condition
Headache
G ynec ol ogi c

component R es pi rat or y
Circulatory
G eneral
Blood
Endocrine

0 20 40 60 80 100 120
N u m b e r of Patients
Bar diagram

Multiple bar chart:- Each


observation has more than one value
represented, by a group of bars.

Component bar chart:-subdivision


of a single bar to indicate the
composition of the total divided into
sections according to their relative
proportion.
Pie diagram
Consist of a circle whose area
represents the total frequency
(100%) which is divided into
segments.
Each segment represents a
proportional composition of the
total frequency
Histogram

it is very similar to the bar chart with


the difference that the rectangles or
bars are adherent (without gaps).
It is used for presenting continuous
quantitative data.
Each bar represents a class and its
height represents the frequency
(number of cases), its width represent
the class interval.
Frequency polygon

Derived from a histogram by


connecting the mid points of the tops
of the rectangles in the histogram.
The line connecting the centers of
histogram rectangles is called
frequency polygon.
We can draw polygon without
rectangles so we will get simpler form
of line graph
Scattered diagram
It is useful to represent the
relationship between two
numeric measurements.
Each observation being
represented by a point
corresponding to its value on
each axis
Organizing Numerical Data: Frequency
Distribution
 The frequency distribution is a summary table in which the data are
arranged in to numerically ordered classes.

 You must give attention to selecting the appropriate number of class groupings
for the table, determining a suitable width of a class grouping, and establishing
the boundaries of each class grouping to avoid overlapping.

 The number of classes depends on the number of values in the data. With a
larger number of values, typically there are more classes. In general, a
frequency distribution should have at least 5 but no more than 15 classes.

 To determine the width of a class interval, you divide the range (Highest
value–Lowest value) of the data by the number of class groupings desired.
 Example: A manufacturer of insulation randomly selects 20 winter days and records
the daily high temperature

 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
 Sort raw data in ascending order:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32,
35, 37, 38, 41, 43, 44, 46, 53, 58
 Find range: 58 -12 = 46
 Select number of classes: 5 (usually between 5 and 15)
 Compute class interval (width): 10 (46/5 then round up)
 Determine class boundaries (limits):
 Class 1: 10 to less than 20
 Class 2: 20 to less than 30
 Class 3: 30 to less than 40
 Class 4: 40 to less than 50
 Class 5: 50 to less than 60
 Compute class midpoints: 15, 25, 35, 45, 55
 Count observations & assign to classes
 Data in ordered array:
 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

1
2
3
4
5
Tabulating Numerical Data: Cumulative Frequency
 Data in ordered array:
 12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53,
58
Why Use a Frequency Distribution?
• It condenses the raw data into a more useful form
• It allows for a quick visual interpretation of the data
• It enables the determination of the major characteristics of the data set
including where the data are concentrated / clustered
Frequency Distributions: Some Tips
 Different class boundaries may provide different pictures for
the same data (especially for smaller data sets)

 Shifts in data concentration may show up when different class


boundaries are chosen

 As the size of the data set increases, the impact of alterations


in the selection of class boundaries is greatly reduced

 When comparing two or more groups with different sample


sizes, you must use either a relative frequency or a
percentage distribution
 How to make distribution table ?
https://www.statisticshowto.com/probability-and-
statistics/descriptive-statistics/frequency-distribution-table/

 Online generate frequency distribution


https://www.socscistatistics.com/descriptive/frequencydistribution/de
fault.aspx

 Practice work
https://www.mathsisfun.com/data/frequency-distribution.html
Measures of central tendacy

• The central tendency is the extent to which all the data values group
around a typical or central value.
.
 The three most commonly used averages are:
• The arithmetic mean
• The Median
• The Mode
Measures of central tendacy
1. Mean:-
◦ The arithmetic average of the variable x.

◦ It is the preferred measure for interval or ratio variables with relatively


symmetric observations.

◦ It has good sampling stability (e.g., it varies the least from sample to
sample), implying that it is better suited for making inferences about
population parameters.

◦ It is affected by extreme values


Measures of Central Tendency: The Median
Median:-
The middle value (Q2, the 50th percentile) of the variable.
In an ordered array, the median is the “middle” number (50%
above, 50% below)
It is appropriate for ordinal measures and for interval or ratio
measures.

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

• Not affected by extreme values


Measures of Central Tendency: The Median
 The rank of median for is (n + 1)/2 if the number of observation is odd
and n/2 if the number is even
 If the number of values is odd, the median is the middle number
 If the number of values is even, the median is the average of the two
middle numbers

Note that is not the value of the median, only the position

of the median in the ranked data.


Median for Grouped Data
Formula for Median is given by

L
(n/2)  m  c
Median =
f
Where
L =Lower limit of the median class
n = Total number of observations =  f ( x )
m = Cumulative frequency preceding the median class
f = Frequency of the median class
c = Class interval of the median class
Median for Grouped Data Example
 Find the median for the following continuous frequency distribution:

Class 0-1 1-2 2-3 3-4 4-5 5-6


Frequency 1 4 8 7 3 2
Solution for the Example
Cumulative
Class Frequency
Frequency
0-1 1 1 L =Lower limit of the median class
1-2 4 5 n = Total number of observations
2-3 8 13
m = Cumulative frequency preceding the
3-4 7 20
4-5 3 23
median class
5-6 2 25 f = Frequency of the median class
Total 25 c = Class interval of the median class
Substituting in the formula the relevant values,
= ,
(n/2)  m
Median = L c (25/ 2)  5
f we have Median = 2 1
8

= 2.9375
Measures of Central Tendency: The Mode
3 Mode:-
◦ The most frequently occurring value in the data set.
◦ May not exist or may not be uniquely defined.
◦ It is the only measure of central tendency that can be used with
nominal variables, but it is also meaningful for quantitative variables
that are inherently discrete.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6

No Mode
Mode = 9
Mode for Grouped Data

d1
Mode = L c
d1  d 2

Where L =Lower limit of the modal class

d1 f1 f0 d2  f1 f2


f 1 = Frequency of the modal class

f0 = Frequency preceding the modal class

f2 = Frequency succeeding the modal class. C = Class Interval of the modal class
Mode for Grouped Data Example
 Example: Find the mode for the following continuous frequency
distribution:

Class 0-1 1-2 2-3 3-4 4-5 5-6


Frequency 1 4 8 7 3 2
Solution for the Example

Class Frequency
0-1 1
1-2 4 d1
L  c
2-3 8 Mode = d1  d 2
3-4 7
L =2
4-5 3 d1  f1 f 0 = 8 - 4 = 4
5-6 2
d2  f1  f2 = 8 - 7 = 1
Total 25 4
C = 1 Hence Mode = 2  1
5
= 2.8
Measure of dispersion

Measures of variability depict how similar observations of a variable tend


to be.
Variability of a nominal or ordinal variable is rarely summarized
numerically.
The measure of dispersion describes the degree of variations or dispersion
of the data around its central values: (dispersion = variation = spread =
scatter).
Range - R
Standard Deviation - SD
Coefficient of Variation -COV
Measures of Variation

Variation

Range Variance Standard Coefficient of


Deviation Variation

 Measures of variation give information on the


spread or variability or dispersion of the data
Same center,
values.
different variation
Measures of Variation: The Range
 Simplest measure of variation
 Difference between the largest and the smallest values:

Range = X largest – X smallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13
Measure of dispersion

 Range:-
It is the difference between the largest and smallest values.
It is the simplest measure of variation.
Disadvantage:- it is based only on two of the observations
and gives no idea of how the other observations are arranged
between these two.
Measures of Variation:
Why The Range Can Be Misleading
 Ignores the way in which data are distributed

7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5

 Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120

Range = 120 - 1 = 119


Measures of Variation: The Variance
• Average (approximately) of squared deviations of values from the mean

– Sample variance: S2  i1


 i
(X  X) 2

n -1

Where X= arithmetic mean

n = sample size
Xi = ith value of the variable X
Measures of Variation: The Standard Deviation
• Most commonly used measure of variation
• Shows variation about the mean
• Is the square root of the variance
• Has the same units as the original data

 i
(X  X) 2

– Sample standard deviation: S  i1


n -1
Measures of Variation: The Standard Deviation
 Steps for Computing Standard Deviation
1. Compute the difference between each value and the mean.
2. Square each difference.
3. Add the squared differences.
4. Divide this total by n-1 to get the sample variance.
5. Take the square root of the sample variance to get the sample standard
deviation.
Measure of Standard Deviation

Uses:-
1. It summarizes the deviations of a large distribution from mean in one figure used as
a unit of variation.
2. Indicates whether the variation of difference of an individual from the mean is
by chance, i.e. natural or real due to some special reasons.
3. It also helps in finding the suitable size of sample for valid conclusions.

https://www.mathsisfun.com/data/standard-deviation.html
Measures of Variation: Sample Standard
Deviation
Example
Sample
Data (Xi) : 10 12 14 15 17 18 18 24

n=8 Mean = X = 16

(10  X) 2  (12  X) 2  (14  X) 2  (24  X) 2


S
n 1

(10 16)2  (12 16)2  (14 16)2  (24 16)2



8 1

130
  4.3095
7
Standard Deviation (Sample) for Grouped Data
Frequency Distribution of Return on Investment of Mutual Funds

Return on Number of Mutual


Investment Funds
5-10 10
10-15 12
15-20 16
20-25 14
25-30 8
Total 60
Solution for the Example
From the spreadsheet of Microsoft Excel in the previous slide, it is easy to see

Mean = X   f X= 1040/60=17.333
n
 f(X  X) 2 2 4 4 8 . 3 3
Standard Deviation = S n 1
= 5 9= 6.44
Assignment
Class Frequency
700-799 4
800-899 7
900 8
1000 10
1100 12
1200 17
1300 13
1400 10
1500 9
1600 7
1700 2
1800-1899 1

Find sample standard deviation S.D.


Measures of Variation: Comparing Standard
Deviations
 The coefficient of variation (CV) is a measure of relative
variability.
 It is the ratio of the standard deviation to the mean (average).
 Always in percentage (%)
 Shows variation relative to mean
 Can be used to compare the variability of two or more sets of data measured in
different
 units

 
CV   S  100%
X
Measure of dispersion
 Coefficient of variation:-
The coefficient of variation expresses the standard deviation as a
percentage of the sample mean.

C. V = SD / mean * 100

C.V is useful when, we are interested in the relative size of the


variability in the data.
Measures of Variation: Comparing Standard
Deviations
Which curve has higher SD?

B
Measures of Variation: Comparing Standard
Deviations
 The coefficient of variation (CV) is a measure of relative variability. It is the ratio of
the standard deviation to the mean (average).
Data A
Mean = 15.5
CV =21.53
11 12 13 14 15 16 17 18 19 20 21
S = 3.338

Data B Mean = 15.5 CV


S = 0.926 =5.97
11 12 13 14 15 16 17 18 19 20 21
Data C
Mean = 15.5
CV =29.48
S = 4.570
11 12 13 14 15 16 17 18 19 20 21
Measures of Variation: Comparing Coefficients
of Variation

• Drug A sale
– Average price last year = $50
– Standard deviation = $5
Both stocks
S  $5 have the same
CVA    100%  100%  10%
X  $50 standard
• Drug B sale: deviation, but
– Average price last year = $100 stock B is less
variable
– Standard deviation = $5 relative to its
price
S $5
CVB    100%  100%  5%
 
X $100

You might also like