You are on page 1of 36

Week 1:Introduction to statistics

MA5820
Statistical methods for data science

Dr Olivia Rowley

not normal!
You’re

jcu.edu.au
House keeping

jcu.edu.au
Assessments

jcu.edu.au
Session Overview
1. Introduction: What is statistics?
2. The statistical process
3. Types of statistical experiment
I. Experimental vs observational studies
II. Pros and cons
4. Error reduction
5. Data description
i. Measures of central tendency
ii. Measures of data spread
6. Review

jcu.edu.au
What is statistics?
“Statistics is the practise of collecting and analysing numerical data”

In reality statistics is:


- Question formation
- Experimental design
- Data collection
- Data collation
- Data visualisation
- Data analysis
- Result communication

jcu.edu.au
A statistical example: Lizard Stats
Aim: Investigate lizards on island X

1) Form our statistical question


On Island X does lizard colour and/or leg length change
in response to terrain type

jcu.edu.au
A statistical example: Lizard Stats
Aim: Investigate lizards on island X

1) Form our statistical question


On Island X does lizard colour and/or leg length change
in response to terrain type
2) Define our population
All lizards on Island X
Populations vary based on the question you are asking

What is the population if we were asking?


i. On the GBR does lizard colour and/or leg length change
in response to terrain type?
ii. Does lizard colour and/or leg length change in response
to terrain type?

jcu.edu.au
A statistical example: Lizard Stats
Aim: Investigate lizards on island X

1) Form our statistical question


On Island X does lizard colour and/or leg length change
in response to terrain type
2) Define our population
All lizards on Island X
3) Define our sample
150 Lizards
What do you think?
i.(50 How
– rock, 50 terrain
many – beach & 50
types dotreetops)
you see?
ii. Should we sample them all?
iii. What do you think our overall sample size should be and
why? How should we split this between our terrain
types?

jcu.edu.au
A statistical example: Lizard Stats
Aim: Investigate lizards on island X

1) Form our statistical question


On Island X does lizard colour and/or leg length change
in response to terrain type
2) Define our population
All lizards on Island X
3) Define our sample
150 Lizards
(50 – rock, 50 – beach & 50 treetops)

4) Define our variable


Lizard colour
What are & Lizardfor
the variables legour
length
lizard question and what type
of data are they?

jcu.edu.au
Question:
A vet wants to test the effectiveness of a new flee treatment drug
on house cats. They give 14 cats two treatments of the drug 3
days apart and measure how many times they scratch in a set 1
hour period. Another 14 cats were given a placebo treatment.

What is the population of this investigation?


What is the sample of this investigation?
Can you tell me what the variable is?

Do you think this is a representative sample?


jcu.edu.au
A statistical example: Lizard Stats
Study types:
5) Define our study type
Observational:
No experimental units (no treatment imposed)
The researchers go out and measure the leg length of
the necessary sample of lizards in each terrain type and
record lizard colour
Experimental:
Implements experimental units(treatment imposed)
Take 400 ‘new’ Lizards and contain 50 to each terrain
type. Then after ‘a period’ of time take the necessary
measurements.

Some questions lend themselves to a particular study type!

jcu.edu.au
Study types: Other examples
Analysing the success of people who quit smoking:
- Observational: find X number of people who are trying to quit smoking record their
methods and rates of success.
- Experimental: find X number of people who are trying to quit smoking for n=x give
them patches, for n=x give them gum and for x=n give them no intervention. record
their methods and rates of success.

jcu.edu.au
Question:
A researcher wants to investigate the drivers of Jellyfish
occurrence in Cairns Australia. Each day she goes out and samples
jellyfish for one hour. She also records wind direction, rainfall and
salinity with the aim to mathematically model the association
between jellyfish occurrence and these environmental variables.

What type of study is this?


What are her variables?
Could this study be experimental/observational?

jcu.edu.au
Question:
The researcher decided catching jellyfish is too hard. Instead, she
decides to measure the growth rates of some she has in captivity.
She does this by feeding each jellyfish either a high, medium or
low density of food. She measures the bell width of each jellyfish
at the start and end of a 14-day period to track growth.

What type of study is this?


What are her variable/s?
Could this study be experimental/observational?

jcu.edu.au
A statistical example: Lizard Stats
Study types pros and cons:
Observational:
No experimental units (no treatment imposed)
- Less intrusive and can produce more ‘natural’
results
- Ethics J
Experimental:
Implements experimental units(treatment imposed)
- Often considered the ‘gold standard for producing
reliable results
- Simple experiments with tight controls can produce
pretty conclusive results.

jcu.edu.au
A statistical example: Lizard Stats
Study types pros and cons:
Observational:
No experimental units (no treatment imposed)
- Results are often open for dispute as cause and
effect is not always clear
- But sometimes the only option
Experimental:
Implements experimental units(treatment imposed)
- Can be expensive and time-consuming
- Ethics L

jcu.edu.au
Error reduction
All statistical investigations are susceptible to error. The key is being aware of the
sources of error and either acknowledging or, better still, compensating for them.

1) Replication
Having multiple data points.
Terrain Leg colour Terrain Leg colour
type length type length
Tree 28 Green Beach 10 Yellow
Tree 25 Green Beach 15 Yellow
Tree 10 Green Beach 11 Yellow
Tree 29 Green Beach 8 Yellow

Beware of pseudoreplication!!!

jcu.edu.au
Error reduction
All statistical investigations are susceptible to error. The key is being aware of the
sources of error and either acknowledging or, better still, compensating for them.

1) Replication
Having multiple data points.
Beware of pseudoreplication
2) Ballance
Having an equal number of samples
Terrain type n Mean leg
length (cm)
Tree 50 34
Beach 50 8
Rock 1 9

jcu.edu.au
Error reduction
All statistical investigations are susceptible to error. The key is being aware of the
sources of error and either acknowledging or, better still, compensating for them.

1) Replication
Having multiple data points.
Beware of pseudoreplication
2) Ballance
Having an equal number of samples
3) Blocks
Grouping samples or subjects in response to
confounding variables.
How do lizard treatments affect insect abundance on the beach?
- High, med, and low lizard abundance Per block: Total:
- Naturally insects are more abundant next to the forest edge High = n10 High = n50
- But want to understand the effect of lizards without spatial Medium = n10 Medium = n50
variation in insects affecting the results. Low = n10 Low = n50
jcu.edu.au
Error reduction
All statistical investigations are susceptible to error. The key is being aware of the
sources of error and either acknowledging or, better still, compensating for them.

1) Replication
Having multiple data points.
Beware of pseudoreplication
2) Ballance
Having an equal number of samples
3) Blocks
Grouping samples or subjects in response to
confounding variables.

4) Covariates
Capturing baseline characteristics about experimental
subjects.
For lizards what other factors may effect leg length and colour other than terrain type?
jcu.edu.au
Question: Replication or pseudoreplication
1) A researcher wants to test how fertiliser affects tree height.
They measure 50 trees in a given area and analyse the results.
2) A PE teacher wants to measure how well their students are
responding to fitness training and so takes a sample of 19
students measuring how fast they can run 100m before and
after a six-week program.
3) A doctor wants to test the effect of iron tablets on patient’s
blood iron levels. He takes 100ml blood from 5 patients,
splitting each patient's blood into two 50ml samples. He then
measures the levels in all 10 samples.
jcu.edu.au
Question: Balanced or not balanced?
1) A car manufacturer testing the longevity of tyre types - tested
50 cars with type A types, 50 with type B tyres and 30 with type
C tyres.
2) A consumer wanting the most marshmallows in their cereal-
opens 5 boxes of brand A, 5 boxes of brand B and 5 boxes of
brand C and averages the marshmallow coverage in each type.
3) Doctor testing a drug – 100 patients get the drug and 100
patients get a placebo. Within the drug group, 25 get a very low
dose, 50 get a medium dose and 25 get a high dose.

jcu.edu.au
Question: Covariance
A doctor wants to measure the effectiveness of a weight drug on
their patients. To represent the population they doctor decided
to sample 100 patients from a viable pool.

What other covariates should the doctor measure to enable the


best result for this statistical investigation?

jcu.edu.au
Data description – Central tendency
1) Mode Table 1. Lizard colour vs frequency
Value or category that occurs the most often Colour Frequency
Green 80
Brown 280
Beige 30
Black 10

jcu.edu.au
Data description – Central tendency
1) Mode
Value or category that occurs the most often

2) Median
Midpoint of a sample

Frequency

Leg length – tree Lizards (cm)

jcu.edu.au
Data description – Central tendency
1) Mode
Value or category that occurs the most often Table 2. Lizard location vs leg length
Location Leg length (cm)
2) Median Tree 44
Midpoint of a sample Tree 32
Tree 40
3) Mean Tree 38
Equal to the sum of the numbers contained in the
Mean = 38.5cm
sample, divided by the number of observations in the
sample.

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

Difference between the lowest and highest measure

Table 2. Lizard location vs leg length


Location Leg length (cm)
Tree 44
Tree 32
Tree 41
Tree 38
Range = 12cm

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

Quartiles divide a sample of data into four groups, of roughly the


same size.

Q1 – 25.75 Q3 – 29.7

Q2 – 28.35
Tree Lizard leg length (cm)

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

consists of the smallest observation, the first quartile, the median, the
third quartile, and the largest observation

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

Leg length (cm)

beach Tree

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

A statistical measurement of the spread between numbers in a data


set. More specifically, variance measures how far each number in the
set is from the mean (average)

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

a statistic that measures the dispersion of a dataset relative to its


mean and is calculated as the square root of the variance (same unit
as dataset)

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

a 'percentage spread', which allows us to compare the spread of two


different distributions, even if they don’t share a common unit

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

z-score tells us how many standard deviations an observation is away


from the mean

jcu.edu.au
Data description – Spread
Measures of spread describe how similar or varied the set of observed values
are for a particular variable

Measures how far the sample mean of the data is likely to be from the
true population mean

jcu.edu.au
Session Review
1. Introduction: What is statistics?
2. The statistical process
3. Types of statistical experiment
I. Experimental vs observational studies
II. Pros and cons
4. Error reduction
5. Data description
i. Measures of central tendency
ii. Measures of data spread
6. Review

jcu.edu.au

You might also like