Professional Documents
Culture Documents
Statistics
Colm ODushlaine
1
Overview
Descriptive Statistics & Graphical Presentation of
Data
Statistical Inference
Hypothesis Tests & Confidence Intervals
T-tests (Paired/Two-sample)
Regression (SLR & Multiple Regression)
ANOVA/ANCOVA
Intended as an interview. Will provide slides after
lectures
Whats in the lectures?...
2
Lecture 1 Lecture 2 Lecture 3
Lecture 4
Descriptive Statistics and Graphical
Presentation
1. Terminology of Data
2. Frequency Distributions/Histograms
3. Measures of data location
4. Measures of data spread
5. Box-plots
6. Scatter-plots
7. Clustering (Multivariate Data)
3
Lecture 1 Lecture 2 Lecture 3
Lecture 4 Statistical Inference
4
Lecture 1 Lecture 2 Lecture 3
Lecture 4 Sample Inferences
1. Two-Sample Inferences
Paired t-test
Two-sample t-test
2. Inferences for more than two samples
One-way ANOVA
Two-way ANOVA
Interactions in Two-way ANOVA
3. DataDesk demo
5
Lecture 1 Lecture 2 Lecture 3
Lecture 4
1. Regression
2. Correlation
3. Multiple Regression
4. ANCOVA
5. Normality Checks
6. Non-parametrics
7. Sample Size Calculations
8. Useful tools and websites
6
FIRST, A REALLY USEFUL SITE
Explanations of outputs
Videos with commentary
Help with deciding what test
to use with what data
7
1. Terminology
Populations & Samples
Population: the complete set of individuals,
objects or scores of interest.
Often too large to sample in its entirety
It may be real or hypothetical (e.g. the results from an
experiment repeated ad infinitum)
9
Variables
11
2. Frequency Distributions
13
Serum CK Data for 36 male
volunteers
Frequency
10.0%
4 2.5%
0.5%
0.0% minimu
Relative Frequency
i.e. 42%. 0.15 50.0% med
Right tail 25.0% quar
10.0%
(skewed) 2.5%
0.10 0.5%
0.0% minim
Left tail
0.05
18
The Mean
n
x 1
n xi
i 1
19
Example
x
151 124 132 170 146 124 113
The mean is
7
137.14
20
The Median and Mode
21
Example 1 n is odd
22
Example 2 n is even
The Median is half way between the middle two readings, i.e.
(274+292) 2 = 283.
Two men have the same cholesterol level- the Mode is 274.
23
Mean versus Median
Large sample values tend to inflate the mean. This will happen if
the histogram of the data is right-skewed.
24
4. Measures of Dispersion
25
Range
the sample Range is the difference
between the largest and smallest
observations in the sample
easy to calculate;
Blood pressure example: min=113 and
max=170, so the range=57 mmHg
useful for best or worst case scenarios
sensitive to extreme values
26
Sample Variance
s i 1
2
n 1
>
27
Standard Deviation
n
xi x
2
i 1
s
n 1
x x
2
i 2304.86
i 1
Therefore, 2304.86
s
7 1
19.6
30
Coefficient of Variation
The coefficient of variation (CV) or relative
standard deviation (RSD) is the sample standard
deviation expressed as a percentage of the mean,
i.e.
s
CV 100%
x
The CV is not affected by multiplicative changes in
scale
Consequently, a useful way of comparing the
dispersion of variables measured on different scales
31
Example
19.6
CV 100 %
137.1
14.3%
i.e., the standard deviation is 14.3% as large as
the mean.
32
Inter-quartile range
The Median divides a distribution into two halves.
33
Example
Q1 Q3
34
60% of slides complete!
35
5. Box-plots
36
Example 1
37
Example 1: Box-plot
38
Example 2: Box-plots of intensities
from 11 gene expression arrays
14
12
10
8
outliers
42
6. Scatter-plot
44
Example 2: Up-regulation/Down-
regulation of gene expression across an
array (Control Cy5 versus Disease Cy3)
45
Example of a Scatter-plot matrix
(multiple pair-wise plots)
46
Other graphical representations
47
Multivariate Data
50
UPGMA
Unweighted Pair-Group Method Average
Most commonly used clustering method
Procedure:
1. Each observation forms its own cluster
2. The two with minimum distance are grouped into a single
cluster representing a new observation- take their average
3. Repeat 2. until all data points form a single cluster
51
Contrived Example
5 genes of interest on 3 replicates arrays/gels
Array1 Array2 Array3
p53 9 3 7
mdm2 10 2 9
bcl2 1 9 4
d xy ( x1 y1 ) ( x2 y2 ) ( x3 y3 )
2 2 2
cyclinE 6 5 5
caspase 8 1 10 3
52
Example
53
{caspase-8 &
p53 mdm2 cyclin E
bcl-2}
p53 0 2.5 4.12 10.9
mdm2 0 6.4 9.1
cyclin E 0 6.9
{caspase-8 &
0
bcl-2}
54
Example (contd)
55
Example of a gene expression dendrogram
56
Variety of approaches to clustering
Clustering techniques
agglomerative -start with every element in its own cluster, and
iteratively join clusters together
divisive - start with one cluster and iteratively divide it into
smaller clusters
Distance Metrics
Euclidean (as-the-crow-flies)
Manhattan
Minkowski (a whole class of metrics)
Correlation (similarity in profiles: called similarity metrics)
Linkage Rules
average: Use the mean distance between cluster members
single: Use the minimum distance (gives loose clusters)
complete: Use the maximum distance (gives tight clusters)
median: Use the median distance
centroid: Use the distance between the average member or
each cluster 57
Clustering Summary
58