Slides chp01 Stats 20221

biometry – bio220
chapter 1
statistics and samples
1
outline
• data and statistics
• populations and samples
• parameter vs. sample estimate
• hypothesis testing
• good sampling
• sampling units and independence
• missing data and bias
• bias and accuracy
• sampling error and precision
• causes of sampling bias
• missing data
• random sampling strategies
• types of variables
• frequency & probability distributions
• observational vs experimental studies
2
biological data
3
biological data
• collect and analyze data, in order to:
• describe living systems, their evolution and
interactions,
• test hypotheses,
• develop models,
• make predictions
4
statistics
• collection,
• analysis,
• interpretation,
• effective communication & presentation of
data
• Taking uncertainty into account
• statistics + computer science  data science
5
statistics
• Origin of Species vs. a 2016 genetics paper
6
statistics
• biology today has become a highly
quantitative science, similar to physics
7
statistics
statistics can be used to:
• describe data
• make estimations (guess unknown quantities)
• test hypotheses
• make inferences or predictions
8
statistics
• estimations about populations we cannot
observe, made from samples,
– e.g. number of derived Sars-Cov-2 variants in
Ankara
• inferences about past events
– e.g. when did Sars-Cov-2 start spreading?
• predictions about future events
– e.g. how likely are new mutations next season?
9
sampling
10
https://www.sigmamagic.com/blogs/online-sample-size-calculators/
estimations about populations
from samples
• population – all individuals of interest, usually

theoretical
– in many instances, data cannot be collected from the full
population
– e.g. all RNA molecules in your throat, or all Sars-Cov-2 in
Ankara
• sample – individuals from which data are collected in a

study, empirical
– sample represents the population but with some uncertainty
11
parameter
• a quantity describing a population
– a proportion
– an average (e.g. mean)
– a measure of variation (e.g. standard deviation)
– a measure of a relationship (e.g. correlation
coefficient)
• parameters = unknown population
characteristics
12
parameter
• a quantity describing a population
– e.g. proportion of females among METU students
– e.g. average number of lice on METU dogs
– e.g. variance in height of Ankarans
• parameters estimated based on sample data
13
parameter vs. sample estimate
• parameter  the unknown truth
• sample estimate (sample statistic) 

approximation of the truth, subject to error
• one major goal in statistics: determining the

error in estimation and keep it to minimum
14
hypothesis testing
• in biology, we usually do not prove statements
– as opposed to mathematics
• we develop scientific models that explain
phenomena, using:
– theoretical work (synthesizing models based on
earlier established facts and evidence)
– hypothesis testing (comparing data with models)
which approaches should be used is an ongoing
discussion in philosophy of science
15
hypothesis testing
• the result of a hypothesis test depends on:
– the test method
– the data (sample size, variance, etc)
– how much the data fit either hypotheses
• inferential (classical): involves comparing null vs.

alternative models
16
classical hypothesis testing
• null hypothesis – usually the simpler one
– e.g. “no effect”, “no difference”, etc.
– e.g. “smoking does not affect lifespan”
• alternative hypothesis – may contain an
additional effect, or the effect that is assumed
– e.g. “smoking affects lifespan” or “smoking
shortens lifespan”
17
classical hypothesis testing
• if data are incompatible with null hypothesis
(rejection)  support for the alternative
hypothesis
• if data are compatible with null hypothesis (no
rejection)  no support for the alternative
hypothesis
18
sampling populations
• good samples vital for interpretation and
reproducibility
• what makes a sample a good sample?
19
cat fall example
20
cat fall example
conclusion of 1987 paper:

extreme height increases survival chance (perhaps affecting muscle tension)
21
how to collect good samples?
• good sample = units are independent (e.g. no
relatives)
• good sample = random sample, without bias
– deviates from population only by chance
• good sample = large sample size
– contains large amount of information
22
sampling units and independence
• “individual”, “unit”, “replicate”, “subject”
• units should be independent

• because regular statistical methods assume
observations are independent
23
• if sample contains non-independent units (= if
units are related)
•  should be treated specially
• solution 1: use the average per related group
• solution 2: use special methods, where you
input info about related units (e.g. regression
methods)
24
• a unit may also contain multiple observations
from the same subject – e.g. repeated
measures
• biological replicates (from different individuals)
include both technical & biological variation
• technical replicates (from the same individual)
allow measuring technical variation  specially
treated in analysis
25
bias and accuracy
• missing observations:
• sampling might be non-random, and cause bias
• e.g. you wish to estimate influenza load in METU

students, and you sample students from health
center
• e.g. you wish to estimate voting preferences of
Turkish citizens but you only conduct
questionnaires in Izmir
26
bias and accuracy
• bias: how much a sample systematically
deviates from the population
• accuracy: how unbiased sample estimates are

relative to the true population parameter
– accurate = unbiased
27
sampling error and precision
• sampling error – difference among sample
estimates (from each other)
• precision – similarity of sample estimates to
each other
– sampling error = imprecision
28
accuracy vs. precision
• yellow area = population parameter

• black points = sample estimates
29
30
sampling error and precision
• larger samples 
– less sampling error = higher precision
– important: larger samples cannot avoid bias
• if no bias exists, sample estimates will deviate

from the population parameter only due to
sampling error
31
• two types of error: bias & sampling error
– which one generally considered worse?
• goal of sampling - reduce both sampling error

and bias
– every unit has equal chance of being sample =
population represented
32
how to collect good samples?
1. ensure independence of observations
– non-independence  most methods do not
work
2. random sampling = avoid bias
– every unit has equal chance of being sample =
population represented
– avoid missing data effects
3. large sample size  reduce sampling error
33
causes of sampling bias
• sample of convenience (simple sampling, w/o
taking into account structure)  can cause
biased samples
• volunteer bias
• missing data
34
random sampling strategies
• how to take a random sample?
– how to randomly choose 100 indv. from METU?
• list full population, decide on sample size, use

a random number generator
– random number generator: computer algorithm
that produces pseudo-random numbers
35
36
37
• if full population information (of all
individuals) unavailable  randomly choose
units
– e.g. regions, provinces, time periods, quadrats
• collect data within each unit
• have to deal with non-independence
– can treat each unit as an individual
– or use complex statistical models
38
• if sample turns out biased, what to do?
a) throw away data
b) repeat data collection to compensate for the
bias (e.g. observe other conditions)
c) describe how sampling is done, including the
bias
 reproducibility most important
39
types of variables
• variable - an attribute that differs from subject to
subject
• random variable – the value depends on random
events, e.g. your height
• assigned variables: e.g. treated vs. control = not

random
• sample estimates (e.g. sample mean) – also
random  also variables
40
categorical variables
• we apply different methods to different types
of variables
• categorical variables - qualitative
• nominal (without order): e.g. species
– fruit fly, nematode, human
• ordinal (with order): e.g. developmental stage
– egg, larva, pupa, adult
41
numerical variables
• numerical variables - quantitative
• discrete (counts, in fixed units): e.g. number of
offspring per mom, 0,1,2,3..
– e.g. 500,1000,1500
• continuous: e.g. height, 1.76584, 1.67450, …
42
types of variables
43
types of variables
• other types:
• logical (Boolean): TRUE/FALSE
• missing values: NA
44
explanatory vs. response variables
• many methods relate one variable to another
• explanatory vs response variables
– also sometimes called, independent vs dependent
• drug/placebo treatment vs survived/died
• country of origin vs income
• height vs weight
– explanatory on x-axis, response on y-axis
– use explanatory to predict the response values
45
explanatory vs. response variables
46
http://slideplayer.com/slide/6010052/
frequency & probability distributions
• frequency - number of times of observation in

a sample
– e.g. allele frequency
• frequency distribution - frequency of each
value of the variable
47
48
• probability: frequency the outcome would

occur in a very long series of repetitions /
observation.
49
• what is the probability of finding a bird with

beak <8mm? or with <6 mm?
• we can answer by:
1- use the empirical (sample) freq dist
2- if we can assume beak size is normally
distributed, using the normal probability
distribution
– for other types of variables, like income, you can
use other theoretical distributions
50
51
probability distribution
• a math function describing the probabilities of
observing certain values of a variable, in the
population
– covers all possible values (sample space)
• theoretical probability distributions - can
approximate empirical frequency distributions
– e.g. normal distribution
52
probability distribution
• integral: probability of observing values in that
range
https://www.slideserve.com/brendan-austin/
53
continuous-probability-distributions
observational vs. experimental studies
• observational study - groups are created in

nature
– smokers vs non-smokers
– rats vs mice
– can identify correlations
• experimental study - clinical trial - you create
the groups
– assign subjects to variables (drug vs placebo)
– (close to) full control over the result
54
observational vs. experimental studies
• observational  only identifies associations.

cannot identify causality
• experimental  can determine causality
• experimental studies in humans: drug trials,
psychological experiments
• with other species: model organisms, or
common garden experiments
55
“pioneers of modern statistics”
• many pioneers of modern statistics: motivated by studied
evolution & heredity
• Galton, Pearson, Fisher
• created vast amount of methods we use today + genetic
theory (esp. Fisher)
• all were eugenics supporters
– “humans are physically / mentally sick because of their genes 
remove those genes”
• Galton & Fisher believed poor people were poor because of
their genetics
• Read Interleaf I for more details
56
exercise
• please go through all questions at the end of
the chapter 1.
• You can ask your questions at office hours.
57

Slides chp01 Stats 20221

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides chp01 Stats 20221

Uploaded by

Copyright:

Available Formats

biometry – bio220

• population – all individuals of interest, usually

• sample – individuals from which data are collected in a

• parameters estimated based on sample data

• sample estimate (sample statistic) 

• one major goal in statistics: determining the

• inferential (classical): involves comparing null vs.

conclusion of 1987 paper:

• units should be independent

• e.g. you wish to estimate influenza load in METU

• accuracy: how unbiased sample estimates are

• yellow area = population parameter

• if no bias exists, sample estimates will deviate

• goal of sampling - reduce both sampling error

• list full population, decide on sample size, use

• assigned variables: e.g. treated vs. control = not

• frequency - number of times of observation in

• probability: frequency the outcome would

• what is the probability of finding a bird with

• observational study - groups are created in

• observational  only identifies associations.

You might also like