Professional Documents
Culture Documents
Colm O’Dushlaine
1
Overview
Descriptive Statistics & Graphical Presentation of
Data
Statistical Inference
Hypothesis Tests & Confidence Intervals
T-tests (Paired/Two-sample)
Regression (SLR & Multiple Regression)
ANOVA/ANCOVA
Intended as an interview. Will provide slides after
lectures
What’s in the lectures?...
2
Lecture 1 Lecture 2 Lecture 3 Lecture 4
Descriptive Statistics and Graphical
Presentation of Data
1. Terminology
2. Frequency Distributions/Histograms
3. Measures of data location
4. Measures of data spread
5. Box-plots
6. Scatter-plots
7. Clustering (Multivariate Data)
3
Lecture 1 Lecture 2 Lecture 3 Lecture 4
Statistical Inference
4
Lecture 1 Lecture 2 Lecture 3 Lecture 4
Sample Inferences
1. Two-Sample Inferences
Paired t-test
Two-sample t-test
2. Inferences for more than two samples
One-way ANOVA
Two-way ANOVA
Interactions in Two-way ANOVA
3. DataDesk demo
5
Lecture 1 Lecture 2 Lecture 3 Lecture 4
1. Regression
2. Correlation
3. Multiple Regression
4. ANCOVA
5. Normality Checks
6. Non-parametrics
7. Sample Size Calculations
8. Useful tools and websites
6
FIRST, A REALLY USEFUL SITE
Explanations of outputs
Videos with commentary
Help with deciding what test
to use with what data
7
1. Terminology
Populations & Samples
Population: the complete set of individuals,
objects or scores of interest.
Often too large to sample in its entirety
It may be real or hypothetical (e.g. the results from an
experiment repeated ad infinitum)
8
Variables
Variables are the quantities measured in a
sample.They may be classified as:
Quantitative i.e. numerical
Continuous (e.g. pH of a sample, patient
cholesterol levels)
Discrete (e.g. number of bacteria colonies in a
culture)
Categorical
Nominal (e.g. gender, blood group)
Ordinal (ranked e.g. mild, moderate or severe
illness). Often ordinal variables are re-coded to be
quantitative.
9
Variables
11
2. Frequency Distributions
12
Example – Serum CK
13
Serum CK Data for 36 male volunteers
Frequency
25.0% quart
10.0%
4 2.5%
0.5%
0.0% minimu
Relative Frequency
i.e. 42%. 0.15 50.0% med
Right tail 25.0% quar
10.0%
(skewed) 2.5%
0.10 0.5%
0.0% minim
Left tail
0.05
18
The Mean
n
x 1
n xi
i 1
19
Example
The mean is x
151 124 132 170 146 124 113
7
137.14
20
The Median and Mode
21
Example 1 – n is odd
22
Example 2 – n is even
The Median is half way between the middle two readings, i.e.
(274+292) 2 = 283.
Two men have the same cholesterol level- the Mode is 274.
23
Mean versus Median
Large sample values tend to inflate the mean. This will happen if
the histogram of the data is right-skewed.
24
4. Measures of Dispersion
25
Range
the sample Range is the difference
between the largest and smallest
observations in the sample
easy to calculate;
Blood pressure example: min=113 and
max=170, so the range=57 mmHg
useful for “best” or “worst” case scenarios
sensitive to extreme values
26
Sample Variance
s i 1
2
n 1
>
27
Standard Deviation
n
xi x
2
i 1
s
n 1
28
Example
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73
113 -24.14 582.88
Sum = 960.0 Sum = 0.00 Sum = 2304.86
x 137.14 29
Example (contd.)
x x 2304.86
2
i
i 1
Therefore, 2304.86
s
7 1
19.6
30
Coefficient of Variation
The coefficient of variation (CV) or relative
standard deviation (RSD) is the sample standard
deviation expressed as a percentage of the mean,
i.e.
s
CV 100%
x
The CV is not affected by multiplicative changes in
scale
Consequently, a useful way of comparing the
dispersion of variables measured on different scales
31
Example
19.6
CV 100 %
137.1
14.3%
i.e., the standard deviation is 14.3% as large as
the mean.
32
Inter-quartile range
33
Example
Q1 Q3
34
60% of slides complete!
35
5. Box-plots
36
Example 1
37
Example 1: Box-plot
38
Example 2: Box-plots of intensities from 11
gene expression arrays
14
12
10
8
outliers
42
6. Scatter-plot
43
Example 1: Age versus Systolic Blood
Pressure in a Clinical Trial
44
Example 2: Up-regulation/Down-regulation of gene
expression across an array (Control Cy5 versus Disease
Cy3)
45
Example of a Scatter-plot matrix (multiple
pair-wise plots)
46
Other graphical representations
47
Multivariate Data
48
7. Clustering
Aim: Find groups of samples or variables sharing
similiarity
49
Clustering
Clustering can be applied to rows or columns of a data set
(matrix) i.e. to the samples or variables
50
UPGMA
51
Contrived Example
5 genes of interest on 3 replicates arrays/gels
Array1 Array2 Array3
p53 9 3 7
mdm2 10 2 9
bcl2 1 9 4
cyclinE 6 5 5
caspase 8 1 10 3
52
Example
53
{caspase-8 &
p53 mdm2 cyclin E
bcl-2}
p53 0 2.5 4.12 10.9
mdm2 0 6.4 9.1
cyclin E 0 6.9
{caspase-8 &
0
bcl-2}
54
Example (contd)
55
Example of a gene expression dendrogram
56
Variety of approaches to clustering
• Clustering techniques
– agglomerative -start with every element in its own cluster, and
iteratively join clusters together
– divisive - start with one cluster and iteratively divide it into
smaller clusters
• Distance Metrics
– Euclidean (as-the-crow-flies)
– Manhattan
– Minkowski (a whole class of metrics)
– Correlation (similarity in profiles: called similarity metrics)
• Linkage Rules
– average: Use the mean distance between cluster members
– single: Use the minimum distance (gives loose clusters)
– complete: Use the maximum distance (gives tight clusters)
– median: Use the median distance
– centroid: Use the distance between the “average” member or
each cluster
57
Clustering Summary
58