You are on page 1of 7

BIOSTATISTICS AND EPIDEMIOLOGY • A population consists of all subjects (human or otherwise) that are being

studied. e.g. 15 million Filipinos, Children ages 6-10 who attend public school
Statistics - is a branch of mathematics working with data collection,
organization, analysis, interpretation, and presentation (Muhrey, 2008).
- is the science of conducting studies to collect, organize, summarize, analyze,
and draw conclusions from data (Bluman, 2012).
- common definition: process of data analysis, not just analyzing data but the
whole process of using scientific method. Includes research design, data
collection, organization, interpretation, and presentation of data.
- main goal: use data in answering questions and making decisions
Two Areas of Statistics:
• Mathematical Statistics – involves the area in mathematics;
involved in development of statistical inference. Application of that A sample is a group of subjects selected from a population. E.g. 450 Filipino
statistical inference is turned into applied statistics. men
• Applied Statistics - a subset of a population
Biostatistics - is the application of statistics to problems in the biological Hypothesis testing - decision-making process for evaluating claims about a
sciences, health, and medicine. population, based on information obtained from samples.
- is a branch of applied statistics for statistical methods, whether existing, new,
or applied to medical sciences or biological sciences, to health and medicine For example, a researcher may wish
• A tool for decision making to know if a new drug will reduce the
VARIABLE AND DATA number of heart attacks in men over
Example: 70 years of age. For this study, two
In your research problem, you want to know if there is an association between groups of men over 70 would be
gender and intellectual level of secondary students studying in a public school selected. One group would be given
- What/who is the subject of your study? Secondary students in public school the drug, and the other would be
- Next thing you must do is gather information about your subjects, this info given a placebo (a substance with no
will serve as your data. medical benefits or harm). Later, the number of heart attacks occurring in each
• Research subjects – ones that provide you data group of men would be counted, a statistical test would be run, and a decision
- What data would you want to obtain or measure? Must be in conjunction with would be made about the effectiveness of the drug.
your research problem Variables and Types of Data
- For you to obtain the data, you must first identify the variable.
Variable -a characteristic or attribute that can assume a different value; e.g.
Gender, Intelligence Quotient
Once you identify your variables, you can now gather data.

Gender IQ
1 Male 9
2 Female 8
3 Male 8
4 Female 7
5 Female 8
Qualitative variables are variables that can be placed into distinct categories,
• Numbers represent subject (Subject 1-5) according to some characteristic or attribute.
• Each row represents the information you gathered from each of - more on nominal data in which mathematics cannot be applied
your subject. This info is your data. - “men” vs ‘women”
• This set of data is your data set - “Patricia””Louie”
• A single data is known as a datum - names. gender, eye color, flavors, brands, etc.
• Data - values (measurements or observations) that a variable can assume Quantitative Data
• Data set - collection of data • Discrete variables assume values that can be counted.
USE OF DATA - usually a whole number
1. Descriptive Statistics - e.g. number of children in the family, number of students in a
2. Inferential Statistics classroom, and numbers of pizza slices
DESCRIPTIVE STATISTICS VS INFERENTIAL STATISTICS • Continuous variables can assume an infinite number of values
DESCRIPTIVE - consists of the collection, organization, summarization, and between any two specific values. They are obtained by measuring.
presentation of data. They often include fractions and decimals.
- describes the data - BMI, glucose level, hemoglobin level
* Before you make analysis of the data, you need first to describe the data. LEVELS OR SCALES OF MEASUREMENT
- main objective: make the data presentable and easy to understand. • The nominal level of measurement classifies data into mutually exclusive
INFERENTIAL - consists of generalizing from samples to populations, (nonoverlapping) categories in which no order or ranking can be imposed on
performing estimations and hypothesis tests, determining relationships among the data.
variables, and making predictions. - includes types (types of cars, cellphones, gender)
- analytical type of statistics - no order or rank, mutual category (walang mas mataas)
- Goal: to test the hypothesis to prove claim - yes or no, positive, or negative
- utilizes tests for relationship, effect, difference (RED) - pass or fail
- Uses probability • The ordinal level of measurement classifies data into categories that can be
Population ranked; however, precise differences between the ranks do not exist.
- the order or ranking matters
- satisfied, moderately satisfied, very satisfied
- cancer stages
- high, moderate, low Categories of Data
- a-grade, b-grade, c-grade • According to source
- no precise measurement between the differences ✓ Primary: data collected first-hand
• The interval level of measurement ranks data, and precise differences - collected through interview, FGD or Focus Group
between units of measure do exist; however, there is no meaningful zero. Discussion, self-administered questionnaire, observation
- with order and has equal differences because of a certain measurement ✓ Secondary: have been previously collected, gathered, which
- no true zero, zero is arbitrary may have been published for some other purposes
- temperature, there is no such thing as no temperature - RRL
• The ratio level of measurement possesses all the characteristics of • According to relationship
interval measurement, and there exists a true zero. In addition, ✓ Independent
true ratios exist when the same variable is measured on two ✓ Dependent
different members of the population. • Use
- there is a true zero ✓ Nominal - classification, no order
- weight, bmi, hemoglobin levels - blood groups, patient ID number
- continuous data ✓ Ordinal-ranking; no absolute value but only order; discrete
- 250 kilometers ✓ Interval-score/ mark; no absolute zero
✓ Ratio - has absolute zero, continuous
Example of Secondary Data Sources
• Census - complete enumeration of population, best source of data
on population size and distribution according to age
• Vital events - civil status and deaths, morbidity, and mortality rate
• Reports of occurrence of notifiable diseases - surveillance/
• Logbooks – recorded file in which you could acquire secondary
data sources
Methods of Data Collection
• Documented sources
• Sample Survey
• Census Survey
• Physical Observation
• Interview
Qualities of a good statistical data
• Timeliness
• Completeness
- Completeness of coverage - geography and inclusion of target
- Completeness of accomplishing
• Accuracy - reflection of true situation
• Precision - repeatability
• Relevance
• Adequacy
Sampling Human Populations
• Act of studying or examining only a segment of the population representing
the whole
Advantages of Sampling
• Cheaper
• Faster
• Better quality
• More comprehensive data may be obtained
Uses of Sampling in Public Health
• Prevalence Survey - evaluating health status of a population
- how many numbers of patients have anemia?
LECTURE 2: DATA COLLECTION AND SAMPLING • Risk Factors Investigation - identify risk factors
Research and Statistics - Framingham Study, relationships: how high blood pressure and
• Research is a problem-solving activity high blood cholesterol be a major factor for you to have CVD
• Research follows scientific method of inquiry • Evaluating effectiveness of Health measures - health programs
• Research involves collection of data to answer scientific inquiry that will lead - effectiveness: is contraceptive still effective in preventing
us into an informed decision or discoveries pregnancy?
• Evaluating reliability and completeness of record systems
- current or past research become a baseline for new or other
- studying only a segment of a population to represent the whole
Population: the entire pool from which statistical sample is drawn
Target Population: total group of individuals from which the sample might be
Data Collection
- a group from which representation, information is desired and to which 2. Accidental/ Haphazard sampling – the sample is made up to those who
interferences will be made come at hand or who is available
Sampling Units: units which are chosen in selecting the sample 3. Quota Sampling – samples of a fixed size are obtained from
- Parilla family as a sample to get glucose level predetermined subdivisions of the population
Sampling frame: collection or list of all the sampling units - counterpart of your stratified population
Elementary units or element: an object or a unit from which a person or which 4. Convenient Sampling – a study unit that is easily accessible are selected
a measurement is taken, or an observation is made sample
- among the Parilla family who will be extracted with blood 5. Snowball technique - for "hidden population"
- the sample is obtained by a process whereby an individual to be
included is identified by a member who was previously included (referrals)
Criteria for Selecting Good Sampling Design
• Sample obtained should be a representative
• Adequate
• Practical and feasible
• Economic and efficient
Sample Size Estimation
Slovin’s Formula – used to calculate the sample size given the population
size in a margin of error.
If a sample is taken from a population, a formula must be used to take into
account confidence levels and margin of error
Margin of error – statistic expressing the amount of random sampling error in
the result
- the larger the margin of error, the less confidence one should have that
result would reflect the result of a survey of the entire population

Probability or random sampling – each member of a population has a known

non-zero chance of being selected as a sample
- every element of a population has an equal chance of being included in the
- more chances of winning
Non-probability or non-random sampling – probability of each member of
the population being selected as part of the sample cannot be determined.
- biased
- should not be used in inferential statistics, should be simple random sampling
- does not make any claim to be representative of the population under study.
Therefore, the generalizability of the result is limited.
Probability Random Sampling
• Simple random – most suitable for inferential statistics
- Toss-coin
- Computer assisted
- Random numbers
- Fish-bowl
• Systematic sampling – sample members from a larger population
are selected according to a random starting point, but with a fixed
periodic interval (sampling interval)
- calculated by dividing the population size by the desired sample
size (K=N/n)
- there is a system; you know who you are going to choose
• Stratified Random Sampling - inclusion of subgroups within the
population e.g., Drug users, teen and adults - Male and Female
Considerations for Sample Size Estimation
- the population is divided into non-overlapping groups or strata
(many layers), along a relevant dimension such as gender,
1. Study Design
ethnicity, political affiliation, and so on. Then the researcher collects
• Cluster > Random sampling
a random sample out of the population members from each stratum
• Longitudinal > Cross-sectional
• Cluster Sampling – the selection of groups of study units instead of
2. Magnitude
the selection of the study units individually
- population is divided into clusters, and some are then chosen at • Rare>common
random or randomly 3. Variability
• Multistage – involves more than one sampling method • Heterogenous > homogenous sample
- community-based type of research 4. Level of precision desired
Non-Probability Sampling Designs 5. Data analysis plan
1. Judgement/Purposive- representative sample of the population is • Multivariate > univariate analysis
selected based on an expert's judgement or pre-specified criteria. • Univariate – analysis of a single variable
– selected or pre-determined • Multivariate – analysis of multiple variables, involves independent
- FGD variables
LECTURE 3: DESCRIPTIVE STATISTICS • Minimum is the smallest value in the data set, denoted as MIN.
• Maximum is the largest value in the data set, denoted as MAX.
INFERENTIAL STATISTICS • Measurements of Hemoglobin (g/dL)
- methods concerned with the analysis of a subset of data leading to
• 7.5 (MIN)
predictions or inferences about the entire set of Data.
- you cannot conclude something because you are yet describing • 8
- no hypothesis testing • 11
- methods concerned w/ collecting, describing, and analyzing a set of data • 14.9 (MAX)
without drawing conclusions (or inferences) About a large group MEASURES OF CENTRAL TENDENCY
- methods concerned with the analysis of a subset of data leading to A single value that is used to identify the "center" of the data
predictions or inferences about the entire set of data. - it is thought of as a typical value of the distribution
- determine associations, either relationships or differences (RED) - precise yet simple
- there is hypothesis testing and conclusion - most representative value of the data
- relationship: association MEAN
- influence: effect •Most common measure of the center
- difference: comparison •Also known as arithmetic average
DESCRIPTIVE STATISTICS DEFINITION - sum of the values divided by the total number of values of data
- main goal is to describe the data •Measurements of Hemoglobin (g/dL)
- Consists of the collection, organization, summarization, and presentation of
• Descriptive Statistics are Used by Researchers to Report on Populations
and Samples
- prevalence and incidence
- involves percentages, frequencies
1. Tabular • 7.5
2. Graphical
3. Numbers
• 8
a. Location • 11
b. Variation • 13.5
c. Distribution • 14.9
Mean = 10.98 or 11
- remember when finding mean of the population use the mu sign
- use the x-bar in finding mean of the sample
Parameter – measurements or observation using the population or from the
- denoted by Greek symbol
Statistic - measurements or observation using the population or from the
- denoted by roman numerals
- presence of outliers, extremely far or abnormal values (14)
• subgroup means can be combined to come up with a group mean
• easily affected by extreme values

Example: you are given an age data of 1-10 so data must go around these
• Divides the observations into two equal parts
- in finding, arrange data from lowest to highest value then select the middle
- If n is odd, the median is the middle number.
- If n is even, the median is the average of the 2 middle numbers.
Tables and Graphs – limited MEDIAN (MD)
Numbers – used for continuous data • may not be an actual observation in the data set
MEASURES OF LOCATION • can be applied in at least ordinal level
•A Measure of Location summarizes a data set by giving a "typical value" • a positional measure; not affected by extreme values
within the range of the data values that describes its location relative to entire
data set.
• Some Common Measures:
• Minimum, Maximum
• Central Tendency (MEAN, MEDIAN, MODE) MODE
• Percentiles, Deciles, Quartiles - occur most often in data set
MINIMUM AND MAXIMUM • occurs most frequently nominal average
• computation of the mode for ungrouped or raw data
- same mean but different distribution of data
S – absolute dispersion
- the closer the value from the mean, the smaller the SD and lesser variation
of data
- the further the data, the higher the SD
•Absolute Measures of Dispersion:
• can be used for qualitative as well as quantitative data
• Range
• may not be unique • Inter-quartile Range (difference between third and first quartile)
• not affected by extreme values • Variance (sigma^2 or SD^2)
• may not exist • Standard Deviation
- unimodal, bimodal, multimodal, no mode • Relative Measure of Dispersion:
- modal class • Coefficient of Variation (divide mean by the SD)
•Use the mean when: - percentage reporting (x100)
-sampling stability is desired RANGE
-other measures are to be computed • The difference between the maximum and minimum value in a data set, i.e.
•Use the median when: R= MAX-MIN
- the exact midpoint of the distribution is desired • Example: Pulse rates of 15 male residents of a certain village
- there are extreme observations
•Use the mode when:
- when the "typical" value is desired
- when the dataset is measured on a nominal scale
- most common way to report relative standing of a number PROPERTIES OF RANGE
Is the percentage of individuals in the data set who are below where your • The larger the value of the range, the more dispersed the
particular number is located observations are.
EXAMPLE • It is quick and easy to understand
•Suppose LJ was told that relative to the other scores on a certain test, • A rough measure of dispersion.
his score was the 90th percentile. STANDARD DEVIATION (SD)
• How to find the percentile? If K (number of scores) = 25 • most important measure of variation
- arrange data in order
• square root of Variance
- (.9 x 25) = 22.5 = 23
- start counting from left to right until you reach the 23 rd number, which is 98 • has the same units as the original data
43, 54, 56, 61, 62, 66, 68, 69, 70,71, 72, 77, 78, 79, 85, 87, 88, 89, 93.
95, 96, 98, 99, 99
- LJ’s score is 98 over 100
• This means that 90% of those who took the test had scores less than or
equal to LJ's score, while 10% had scores higher than LJ's.
- suppose you have 1-100 values
- once you have data array, divide 100 into 10, every tenth value is the decile
• Divide an array into ten equal parts, each part having ten percent of the
distribution of the data values, denoted by Dj.
• The 1st decile is the 10th percentile; the 2nd decile is the 20th
• Divide an array into four equal parts, each part having 25% of the REMARKS ON STANDARD DEVIATION
distribution of the data values, denoted by Qj.
• If there is a large amount of variation, then on average, the data
• The 1st quartile is the 25th percentile; the 2nd quartile (considered as the
values will be far from the mean. Hence, the SD will be large.
median) is the 50th percentile, also the median and the 3rd quartile is the
75th percentile. • If there is only a small amount of variation, then on average, the
- these quartiles as well as minimum and maximum value makes up the box data values will be close to the mean. Hence, the SD will be
plot and whiskers plot small.
• A measure of variation is a single value that is used to describe the
spread of the distribution
• A measure of central tendency alone does not uniquely describe a


• It is the most widely used measure of dispersion (Chebyshev's Inequality)
• It is based on all the items and is rigidly defined.
• It is used to test the reliability of measures calculated from samples.
• The standard deviation is sensitive to the presence of extreme values. TABLES AND GRAPHS
• It is not easy to calculate by hand (unlike the range). Types of descriptive statistics:
A distribution is said to be symmetric about the mean, if the distribution to - frequency distribution’
the left of mean is the "mirror image" of the distribution to the right of the - relative frequency distributions
mean. Likewise, a symmetric distribution has SK=0 since its mean is equal to • Graphs
its median and its mode. - bar chart (categorical) or histogram (continuous)
- obtained using a histogram, a graphical presentation of continuous data - stem and leaf plot
- positively skewed (skewed to the right), symmetric (normal distribution, - frequency polygon
unimodal), negatively skewed FREQUENCY DISTRIBUTION
- mean and median can be seen at center of distribution
- normal distribution is the standard for parametric tests

- organizes data in tabular form

- two types: categorical and group
Absolute frequency – actual count, number of subjects obtained for a specific
Relative frequency – divide the number of subjects in a specific category by
the total number of subjects (you can present it in percentage form)

•Describes the extent of peakedness or flatness of the distribution of the
• Measured by coefficient of kurtosis (K) computed as,

After obtaining the frequencies, present them in graphs

Categorical – either pie chart or bar chart

Zero – mesokurtic (normally distributed)

- you can do parametric tests
Platykurtic – far from mean
- nonparametric
Leptokurtic – very close to the mean
- nonparametric In using bar graph, make sure to begin with zero, not using zero can lead to
overestimation of data
- how are you going to describe categorical data?
Histogram – graph to represent distribution
- continuous data (measurement between two variables exists)
- continuous vertical bars
Height – frequency or numbers of subjects obtained for each class
Width – difference between lower class boundary and upper-class boundary


Cumulative relative frequency – accumulation of values

- add previous values to the current value
- one dimensional graph of numerical data that is based on 5-number summary
Q1: 25th percentile
Q2: 50th percentile
Q3: 75th percentile
Interquartile range: Q3-Q1
Identify first the 5-number summary, create line, put scale
Locate the 5-number summary
Make a box between Q1 and Q3, make a line sa median for Q2
Wiskers: Min and Max values (outliers – outside)

How to Navigate JASP

• In preparing data set, arrange it in a manner wherein each column

is expected to be one variable (vertical), and one row is equivalent
to one respondent (horizontal).
• Unmerge cells containing variable names
• Recode data that are not numerical (nominal and ordinal data) –
Ogive – cumulative frequency graph (ascending graph) assign numbers to data
X axis – category or class 1. To make it easy in case you are given large amount of
Y axis – frequency data, press control F (Ctrl F) then ‘Replace.’
Larger values are grouped into categories (CLASS) 2. Fill in
Lower class limit (10) and upper-class limit (14) Example:
Find what: M
Replace with: 1
3. Click ‘Match case’ and ‘Find entire cells only’ to avoid
system in changing data in different cells
4. Click replace all
• Excel file should be clean, delete unnecessary cells then file is
ready for JASP utility
- better to duplicate excel so that you can keep track of the coding
• Save as a .csv file, better locate it in a folder
• Open the JASP application
• Click menu, open file, go to computer
• Find the .csv file then open
• Observe the icons then change the levels of measurement if
needed, or if JASP identified the data incorrectly
Another graph to represent frequency distribution is frequency polygon (similar • Assign labels to recoded data (e.g., 1 with M)
to ogive) • Select variables you want to test
Midpoint formula: add the upper- and lower-class limit then divide by 2 (x-axis) • If you want to rearrange table, click ‘transpose’
Ogive – cumulative frequency • Nominal and ordinal data are not best described using mean and
Frequency polygon – relative or absolute count standard deviation because in terms of distribution, they are not
HISTOGRAM normally distributed. SO instead of mean and SD, we can choose
frequency table.
• If you want to use charts, you can find it in ‘Distribution plots’
• If you want to split your data or compare, select the variables and
the variable you want to split, separately.

You might also like