You are on page 1of 57

biometry – bio220

chapter 1
statistics and samples

1
outline
• data and statistics
• populations and samples
• parameter vs. sample estimate
• hypothesis testing
• good sampling
• sampling units and independence
• missing data and bias
• bias and accuracy
• sampling error and precision
• causes of sampling bias
• missing data
• random sampling strategies
• types of variables
• frequency & probability distributions
• observational vs experimental studies

2
biological data

3
biological data
• collect and analyze data, in order to:
• describe living systems, their evolution and
interactions,
• test hypotheses,
• develop models,
• make predictions

4
statistics
• collection,
• analysis,
• interpretation,
• effective communication & presentation of
data
• Taking uncertainty into account
• statistics + computer science  data science

5
statistics
• Origin of Species vs. a 2016 genetics paper

6
statistics
• biology today has become a highly
quantitative science, similar to physics

7
statistics
statistics can be used to:
• describe data
• make estimations (guess unknown quantities)
• test hypotheses
• make inferences or predictions

8
statistics
• estimations about populations we cannot
observe, made from samples,
– e.g. number of derived Sars-Cov-2 variants in
Ankara
• inferences about past events
– e.g. when did Sars-Cov-2 start spreading?
• predictions about future events
– e.g. how likely are new mutations next season?

9
sampling

10
https://www.sigmamagic.com/blogs/online-sample-size-calculators/
estimations about populations
from samples

• population – all individuals of interest, usually


theoretical
– in many instances, data cannot be collected from the full
population
– e.g. all RNA molecules in your throat, or all Sars-Cov-2 in
Ankara

• sample – individuals from which data are collected in a


study, empirical
– sample represents the population but with some uncertainty

11
parameter
• a quantity describing a population
– a proportion
– an average (e.g. mean)
– a measure of variation (e.g. standard deviation)
– a measure of a relationship (e.g. correlation
coefficient)
• parameters = unknown population
characteristics

12
parameter
• a quantity describing a population
– e.g. proportion of females among METU students
– e.g. average number of lice on METU dogs
– e.g. variance in height of Ankarans

• parameters estimated based on sample data

13
parameter vs. sample estimate
• parameter  the unknown truth

• sample estimate (sample statistic) 


approximation of the truth, subject to error

• one major goal in statistics: determining the


error in estimation and keep it to minimum

14
hypothesis testing
• in biology, we usually do not prove statements
– as opposed to mathematics
• we develop scientific models that explain
phenomena, using:
– theoretical work (synthesizing models based on
earlier established facts and evidence)
– hypothesis testing (comparing data with models)
which approaches should be used is an ongoing
discussion in philosophy of science
15
hypothesis testing
• the result of a hypothesis test depends on:
– the test method
– the data (sample size, variance, etc)
– how much the data fit either hypotheses

• inferential (classical): involves comparing null vs.


alternative models

16
classical hypothesis testing
• null hypothesis – usually the simpler one
– e.g. “no effect”, “no difference”, etc.
– e.g. “smoking does not affect lifespan”
• alternative hypothesis – may contain an
additional effect, or the effect that is assumed
– e.g. “smoking affects lifespan” or “smoking
shortens lifespan”

17
classical hypothesis testing
• if data are incompatible with null hypothesis
(rejection)  support for the alternative
hypothesis
• if data are compatible with null hypothesis (no
rejection)  no support for the alternative
hypothesis

18
sampling populations
• good samples vital for interpretation and
reproducibility
• what makes a sample a good sample?

19
cat fall example

20
cat fall example

conclusion of 1987 paper:


extreme height increases survival chance (perhaps affecting muscle tension)

21
how to collect good samples?
• good sample = units are independent (e.g. no
relatives)
• good sample = random sample, without bias
– deviates from population only by chance
• good sample = large sample size
– contains large amount of information

22
sampling units and independence
• “individual”, “unit”, “replicate”, “subject”

• units should be independent


• because regular statistical methods assume
observations are independent

23
sampling units and independence
• if sample contains non-independent units (= if
units are related)
•  should be treated specially
• solution 1: use the average per related group
• solution 2: use special methods, where you
input info about related units (e.g. regression
methods)

24
sampling units and independence
• a unit may also contain multiple observations
from the same subject – e.g. repeated
measures
• biological replicates (from different individuals)
include both technical & biological variation
• technical replicates (from the same individual)
allow measuring technical variation  specially
treated in analysis

25
bias and accuracy
• missing observations:
• sampling might be non-random, and cause bias

• e.g. you wish to estimate influenza load in METU


students, and you sample students from health
center
• e.g. you wish to estimate voting preferences of
Turkish citizens but you only conduct
questionnaires in Izmir
26
bias and accuracy
• bias: how much a sample systematically
deviates from the population

• accuracy: how unbiased sample estimates are


relative to the true population parameter
– accurate = unbiased

27
sampling error and precision
• sampling error – difference among sample
estimates (from each other)
• precision – similarity of sample estimates to
each other
– sampling error = imprecision

28
accuracy vs. precision

• yellow area = population parameter


• black points = sample estimates

29
accuracy vs. precision

30
sampling error and precision
• larger samples 
– less sampling error = higher precision
– important: larger samples cannot avoid bias

• if no bias exists, sample estimates will deviate


from the population parameter only due to
sampling error

31
accuracy vs. precision
• two types of error: bias & sampling error
– which one generally considered worse?

• goal of sampling - reduce both sampling error


and bias
– every unit has equal chance of being sample =
population represented

32
how to collect good samples?
1. ensure independence of observations
– non-independence  most methods do not
work
2. random sampling = avoid bias
– every unit has equal chance of being sample =
population represented
– avoid missing data effects
3. large sample size  reduce sampling error

33
causes of sampling bias
• sample of convenience (simple sampling, w/o
taking into account structure)  can cause
biased samples
• volunteer bias
• missing data

34
random sampling strategies
• how to take a random sample?
– how to randomly choose 100 indv. from METU?

• list full population, decide on sample size, use


a random number generator
– random number generator: computer algorithm
that produces pseudo-random numbers

35
random sampling strategies

36
random sampling strategies

37
random sampling strategies
• if full population information (of all
individuals) unavailable  randomly choose
units
– e.g. regions, provinces, time periods, quadrats
• collect data within each unit
• have to deal with non-independence
– can treat each unit as an individual
– or use complex statistical models

38
random sampling strategies
• if sample turns out biased, what to do?
a) throw away data
b) repeat data collection to compensate for the
bias (e.g. observe other conditions)
c) describe how sampling is done, including the
bias
 reproducibility most important

39
types of variables
• variable - an attribute that differs from subject to
subject
• random variable – the value depends on random
events, e.g. your height

• assigned variables: e.g. treated vs. control = not


random
• sample estimates (e.g. sample mean) – also
random  also variables
40
categorical variables
• we apply different methods to different types
of variables
• categorical variables - qualitative
• nominal (without order): e.g. species
– fruit fly, nematode, human
• ordinal (with order): e.g. developmental stage
– egg, larva, pupa, adult

41
numerical variables
• numerical variables - quantitative
• discrete (counts, in fixed units): e.g. number of
offspring per mom, 0,1,2,3..
– e.g. 500,1000,1500
• continuous: e.g. height, 1.76584, 1.67450, …

42
types of variables

43
types of variables
• other types:
• logical (Boolean): TRUE/FALSE
• missing values: NA

44
explanatory vs. response variables
• many methods relate one variable to another
• explanatory vs response variables
– also sometimes called, independent vs dependent
• drug/placebo treatment vs survived/died
• country of origin vs income
• height vs weight
– explanatory on x-axis, response on y-axis
– use explanatory to predict the response values

45
explanatory vs. response variables

46
http://slideplayer.com/slide/6010052/
frequency & probability distributions

• frequency - number of times of observation in


a sample
– e.g. allele frequency
• frequency distribution - frequency of each
value of the variable

47
frequency & probability distributions

48
frequency & probability distributions

• probability: frequency the outcome would


occur in a very long series of repetitions /
observation.

49
frequency & probability distributions

• what is the probability of finding a bird with


beak <8mm? or with <6 mm?
• we can answer by:
1- use the empirical (sample) freq dist
2- if we can assume beak size is normally
distributed, using the normal probability
distribution
– for other types of variables, like income, you can
use other theoretical distributions
50
frequency & probability distributions

51
probability distribution
• a math function describing the probabilities of
observing certain values of a variable, in the
population
– covers all possible values (sample space)
• theoretical probability distributions - can
approximate empirical frequency distributions
– e.g. normal distribution

52
probability distribution
• integral: probability of observing values in that
range

https://www.slideserve.com/brendan-austin/
53
continuous-probability-distributions
observational vs. experimental studies

• observational study - groups are created in


nature
– smokers vs non-smokers
– rats vs mice
– can identify correlations
• experimental study - clinical trial - you create
the groups
– assign subjects to variables (drug vs placebo)
– (close to) full control over the result
54
observational vs. experimental studies

• observational  only identifies associations.


cannot identify causality
• experimental  can determine causality
• experimental studies in humans: drug trials,
psychological experiments
• with other species: model organisms, or
common garden experiments

55
“pioneers of modern statistics”
• many pioneers of modern statistics: motivated by studied
evolution & heredity
• Galton, Pearson, Fisher
• created vast amount of methods we use today + genetic
theory (esp. Fisher)
• all were eugenics supporters
– “humans are physically / mentally sick because of their genes 
remove those genes”
• Galton & Fisher believed poor people were poor because of
their genetics
• Read Interleaf I for more details

56
exercise
• please go through all questions at the end of
the chapter 1.
• You can ask your questions at office hours.

57

You might also like