Professional Documents
Culture Documents
chapter 1
statistics and samples
1
outline
• data and statistics
• populations and samples
• parameter vs. sample estimate
• hypothesis testing
• good sampling
• sampling units and independence
• missing data and bias
• bias and accuracy
• sampling error and precision
• causes of sampling bias
• missing data
• random sampling strategies
• types of variables
• frequency & probability distributions
• observational vs experimental studies
2
biological data
3
biological data
• collect and analyze data, in order to:
• describe living systems, their evolution and
interactions,
• test hypotheses,
• develop models,
• make predictions
4
statistics
• collection,
• analysis,
• interpretation,
• effective communication & presentation of
data
• Taking uncertainty into account
• statistics + computer science data science
5
statistics
• Origin of Species vs. a 2016 genetics paper
6
statistics
• biology today has become a highly
quantitative science, similar to physics
7
statistics
statistics can be used to:
• describe data
• make estimations (guess unknown quantities)
• test hypotheses
• make inferences or predictions
8
statistics
• estimations about populations we cannot
observe, made from samples,
– e.g. number of derived Sars-Cov-2 variants in
Ankara
• inferences about past events
– e.g. when did Sars-Cov-2 start spreading?
• predictions about future events
– e.g. how likely are new mutations next season?
9
sampling
10
https://www.sigmamagic.com/blogs/online-sample-size-calculators/
estimations about populations
from samples
11
parameter
• a quantity describing a population
– a proportion
– an average (e.g. mean)
– a measure of variation (e.g. standard deviation)
– a measure of a relationship (e.g. correlation
coefficient)
• parameters = unknown population
characteristics
12
parameter
• a quantity describing a population
– e.g. proportion of females among METU students
– e.g. average number of lice on METU dogs
– e.g. variance in height of Ankarans
13
parameter vs. sample estimate
• parameter the unknown truth
14
hypothesis testing
• in biology, we usually do not prove statements
– as opposed to mathematics
• we develop scientific models that explain
phenomena, using:
– theoretical work (synthesizing models based on
earlier established facts and evidence)
– hypothesis testing (comparing data with models)
which approaches should be used is an ongoing
discussion in philosophy of science
15
hypothesis testing
• the result of a hypothesis test depends on:
– the test method
– the data (sample size, variance, etc)
– how much the data fit either hypotheses
16
classical hypothesis testing
• null hypothesis – usually the simpler one
– e.g. “no effect”, “no difference”, etc.
– e.g. “smoking does not affect lifespan”
• alternative hypothesis – may contain an
additional effect, or the effect that is assumed
– e.g. “smoking affects lifespan” or “smoking
shortens lifespan”
17
classical hypothesis testing
• if data are incompatible with null hypothesis
(rejection) support for the alternative
hypothesis
• if data are compatible with null hypothesis (no
rejection) no support for the alternative
hypothesis
18
sampling populations
• good samples vital for interpretation and
reproducibility
• what makes a sample a good sample?
19
cat fall example
20
cat fall example
21
how to collect good samples?
• good sample = units are independent (e.g. no
relatives)
• good sample = random sample, without bias
– deviates from population only by chance
• good sample = large sample size
– contains large amount of information
22
sampling units and independence
• “individual”, “unit”, “replicate”, “subject”
23
sampling units and independence
• if sample contains non-independent units (= if
units are related)
• should be treated specially
• solution 1: use the average per related group
• solution 2: use special methods, where you
input info about related units (e.g. regression
methods)
24
sampling units and independence
• a unit may also contain multiple observations
from the same subject – e.g. repeated
measures
• biological replicates (from different individuals)
include both technical & biological variation
• technical replicates (from the same individual)
allow measuring technical variation specially
treated in analysis
25
bias and accuracy
• missing observations:
• sampling might be non-random, and cause bias
27
sampling error and precision
• sampling error – difference among sample
estimates (from each other)
• precision – similarity of sample estimates to
each other
– sampling error = imprecision
28
accuracy vs. precision
29
accuracy vs. precision
30
sampling error and precision
• larger samples
– less sampling error = higher precision
– important: larger samples cannot avoid bias
31
accuracy vs. precision
• two types of error: bias & sampling error
– which one generally considered worse?
32
how to collect good samples?
1. ensure independence of observations
– non-independence most methods do not
work
2. random sampling = avoid bias
– every unit has equal chance of being sample =
population represented
– avoid missing data effects
3. large sample size reduce sampling error
33
causes of sampling bias
• sample of convenience (simple sampling, w/o
taking into account structure) can cause
biased samples
• volunteer bias
• missing data
34
random sampling strategies
• how to take a random sample?
– how to randomly choose 100 indv. from METU?
35
random sampling strategies
36
random sampling strategies
37
random sampling strategies
• if full population information (of all
individuals) unavailable randomly choose
units
– e.g. regions, provinces, time periods, quadrats
• collect data within each unit
• have to deal with non-independence
– can treat each unit as an individual
– or use complex statistical models
38
random sampling strategies
• if sample turns out biased, what to do?
a) throw away data
b) repeat data collection to compensate for the
bias (e.g. observe other conditions)
c) describe how sampling is done, including the
bias
reproducibility most important
39
types of variables
• variable - an attribute that differs from subject to
subject
• random variable – the value depends on random
events, e.g. your height
41
numerical variables
• numerical variables - quantitative
• discrete (counts, in fixed units): e.g. number of
offspring per mom, 0,1,2,3..
– e.g. 500,1000,1500
• continuous: e.g. height, 1.76584, 1.67450, …
42
types of variables
43
types of variables
• other types:
• logical (Boolean): TRUE/FALSE
• missing values: NA
44
explanatory vs. response variables
• many methods relate one variable to another
• explanatory vs response variables
– also sometimes called, independent vs dependent
• drug/placebo treatment vs survived/died
• country of origin vs income
• height vs weight
– explanatory on x-axis, response on y-axis
– use explanatory to predict the response values
45
explanatory vs. response variables
46
http://slideplayer.com/slide/6010052/
frequency & probability distributions
47
frequency & probability distributions
48
frequency & probability distributions
49
frequency & probability distributions
51
probability distribution
• a math function describing the probabilities of
observing certain values of a variable, in the
population
– covers all possible values (sample space)
• theoretical probability distributions - can
approximate empirical frequency distributions
– e.g. normal distribution
52
probability distribution
• integral: probability of observing values in that
range
https://www.slideserve.com/brendan-austin/
53
continuous-probability-distributions
observational vs. experimental studies
55
“pioneers of modern statistics”
• many pioneers of modern statistics: motivated by studied
evolution & heredity
• Galton, Pearson, Fisher
• created vast amount of methods we use today + genetic
theory (esp. Fisher)
• all were eugenics supporters
– “humans are physically / mentally sick because of their genes
remove those genes”
• Galton & Fisher believed poor people were poor because of
their genetics
• Read Interleaf I for more details
56
exercise
• please go through all questions at the end of
the chapter 1.
• You can ask your questions at office hours.
57