Introduction To Statistics 1 COD

Introduction to Statistics
Colm O’Dushlaine
Neuropsychiatric Genetics, TCD

codushlaine@gmail.com
1
Overview
 Descriptive Statistics & Graphical Presentation of
Data
 Statistical Inference
 Hypothesis Tests & Confidence Intervals
 T-tests (Paired/Two-sample)
 Regression (SLR & Multiple Regression)
 ANOVA/ANCOVA
 Intended as an interview. Will provide slides after
lectures
 What’s in the lectures?...
2
Lecture 1 Lecture 2 Lecture 3 Lecture 4
Descriptive Statistics and Graphical
Presentation of Data
1. Terminology
2. Frequency Distributions/Histograms
3. Measures of data location
4. Measures of data spread
5. Box-plots
6. Scatter-plots
7. Clustering (Multivariate Data)
3
Statistical Inference
1. Distributions & Densities

2. Normal Distribution
3. Sampling Distribution & Central Limit Theorem
4. Hypothesis Tests
5. P-values
6. Confidence Intervals
7. Two-Sample Inferences
8. Paired Data
4
Sample Inferences
1. Two-Sample Inferences
 Paired t-test
 Two-sample t-test
2. Inferences for more than two samples
 One-way ANOVA
 Two-way ANOVA
 Interactions in Two-way ANOVA
3. DataDesk demo
5
1. Regression
2. Correlation
3. Multiple Regression
4. ANCOVA
5. Normality Checks
6. Non-parametrics
7. Sample Size Calculations
8. Useful tools and websites
6
FIRST, A REALLY USEFUL SITE
Explanations of outputs
Videos with commentary
Help with deciding what test
to use with what data
7
1. Terminology
Populations & Samples
 Population: the complete set of individuals,
objects or scores of interest.
 Often too large to sample in its entirety
 It may be real or hypothetical (e.g. the results from an
experiment repeated ad infinitum)
 Sample: A subset of the population.

 A sample may be classified as random (each member
has equal chance of being selected from a population)
or convenience (what’s available).
 Random selection attempts to ensure the sample is
representative of the population.
8
Variables
 Variables are the quantities measured in a
sample.They may be classified as:
 Quantitative i.e. numerical
 Continuous (e.g. pH of a sample, patient
cholesterol levels)
 Discrete (e.g. number of bacteria colonies in a
culture)
 Categorical
 Nominal (e.g. gender, blood group)
 Ordinal (ranked e.g. mild, moderate or severe
illness). Often ordinal variables are re-coded to be
quantitative.
9
Variables
 Variables can be further classified as:

 Dependent/Response. Variable of primary interest
(e.g. blood pressure in an antihypertensive drug trial).
Not controlled by the experimenter.
 Independent/Predictor
 called a Factor when controlled by experimenter. It
is often nominal (e.g. treatment)
 Covariate when not controlled.
 If the value of a variable cannot be predicted in

advance then the variable is referred to as a
random variable
10
Parameters & Statistics
 Parameters: Quantities that describe a
population characteristic. They are usually
unknown and we wish to make statistical
inferences about parameters. Different to
perimeters.
 Descriptive Statistics: Quantities and

techniques used to describe a sample
characteristic or illustrate the sample data
e.g. mean, standard deviation, box-plot
11
2. Frequency Distributions
 An (Empirical) Frequency Distribution or

Histogram for a continuous variable presents the
counts of observations grouped within pre-
specified classes or groups
 A Relative Frequency Distribution presents the

corresponding proportions of observations within
the classes
 A Barchart presents the frequencies for a

categorical variable
12
Example – Serum CK
 Blood samples taken from 36 male

volunteers as part of a study to determine the
natural variation in CK concentration.
 The serum CK concentrations were

measured in (U/I) are as follows:
13
Serum CK Data for 36 male volunteers
121 82 100 151 68 58

95 145 64 201 101 163
84 57 139 60 78 94
119 104 110 113 118 203
62 83 67 93 92 110
25 123 70 48 95 42
14
Relative Frequency Table
Serum CK Frequency Relative Cumulative Rel.
(U/I) Frequency Frequency
20-39 1 0.028 0.028
40-59 4 0.111 0.139
60-79 7 0.194 0.333
80-99 8 0.222 0.555
100-119 8 0.222 0.777
120-139 3 0.083 0.860
140-159 2 0.056 0.916
160-179 1 0.028 0.944
180-199 0 0.000 0.944
200-219 2 0.056 1.000
Total 36 1.000
15
Frequency Distribution
Distributions
CK-concentration-(U/l)
Quantiles
8 100.0% maximu
99.5%
97.5%
90.0%
6 75.0% quart
50.0% media
Frequency
25.0% quart
10.0%
4 2.5%
0.5%
0.0% minimu
20 40 60 80 100 120 140 160 180 200 220

16
Relative Frequency Distribution
Distributions
CK-concentration-(U/l)
Quantiles
Mode
Shaded area is 100.0% maxim
percentage of 99.5%
males with CK 0.20 97.5%
values between 90.0%
60 and 100 U/l, 75.0% quar
Relative Frequency
i.e. 42%. 0.15 50.0% med
Right tail 25.0% quar
10.0%
(skewed) 2.5%
0.10 0.5%
0.0% minim
Left tail
0.05
20 40 60 80 100 120 140 160 180 200 220 17

3. Measures of Central Tendency
(Location)
Measures of location indicate where on the number
line the data are to be found. Common measures of
location are:
(i) the Arithmetic Mean,

(ii) the Median, and
(iii) the Mode
18
The Mean
 Let x1,x2,x3,…,xn be the realised values of a

random variable X, from a sample of size n.
The sample arithmetic mean is defined as:
n
x 1
n  xi
i 1
19
Example
Example 2: The systolic blood pressure of

seven middle aged men were as follows:
151, 124, 132, 170, 146, 124 and 113.
The mean is x 
151  124  132  170  146  124  113
7
 137.14
20
The Median and Mode
 If the sample data are arranged in increasing

order, the median is
(i) the middle value if n is an odd number, or
(ii) midway between the two middle values if n is
an even number
 The mode is the most commonly occurring
value.
21
Example 1 – n is odd
The reordered systolic blood pressure data seen

earlier are:
113, 124, 124, 132, 146, 151, and 170.
The Median is the middle value of the ordered data,

i.e. 132.
Two individuals have systolic blood pressure = 124

mm Hg, so the Mode is 124.
22
Example 2 – n is even
Six men with high cholesterol participated in a study to investigate

the effects of diet on cholesterol level. At the beginning of the study,
their cholesterol levels (mg/dL) were as follows:
366, 327, 274, 292, 274 and 230.
Rearrange the data in numerical order as follows:
230, 274, 274, 292, 327 and 366.
The Median is half way between the middle two readings, i.e.
(274+292)  2 = 283.
Two men have the same cholesterol level- the Mode is 274.
23
Mean versus Median
 Large sample values tend to inflate the mean. This will happen if
the histogram of the data is right-skewed.
 The median is not influenced by large sample values and is a better

measure of centrality if the distribution is skewed.
 Note if mean=median=mode then the data are said to be

symmetrical
 e.g. In the CK measurement study, the sample mean = 98.28. The

median = 94.5, i.e. mean is larger than median indicating that mean
is inflated by two large data values 201 and 203.
24
4. Measures of Dispersion
 Measures of dispersion characterise how

spread out the distribution is, i.e., how variable
the data are.
 Commonly used measures of dispersion
include:
1. Range
2. Variance & Standard deviation
3. Coefficient of Variation (or relative standard
deviation)
4. Inter-quartile range
25
Range
 the sample Range is the difference
between the largest and smallest
observations in the sample
 easy to calculate;
 Blood pressure example: min=113 and
max=170, so the range=57 mmHg
 useful for “best” or “worst” case scenarios 
 sensitive to extreme values 
26
Sample Variance
 The sample variance, s2, is the arithmetic

mean of the squared deviations from the
sample mean:
n
 xi  x 
2
s  i 1
2
n 1
>
27
Standard Deviation
 The sample standard deviation, s, is the

square-root of the variance
n
 xi  x 
2
i 1
s
n 1
 s has the advantage of being in the same units

as the original variable x
28
Example
Data Deviation Deviation2
151 13.86 192.02
124 -13.14 172.73
132 -5.14 26.45
170 32.86 1079.59
146 8.86 78.45
124 -13.14 172.73
113 -24.14 582.88
Sum = 960.0 Sum = 0.00 Sum = 2304.86
x  137.14 29
Example (contd.)
 x  x   2304.86
2
i
i 1
Therefore, 2304.86
s
7 1
 19.6
30
Coefficient of Variation
 The coefficient of variation (CV) or relative
standard deviation (RSD) is the sample standard
deviation expressed as a percentage of the mean,
i.e.
  s
CV    100%
x
 The CV is not affected by multiplicative changes in
scale
 Consequently, a useful way of comparing the
dispersion of variables measured on different scales
31
Example
The CV of the blood pressure data is:
 19.6 
CV  100   %
 137.1 
 14.3%
i.e., the standard deviation is 14.3% as large as
the mean.
32
Inter-quartile range
 The Median divides a distribution into two halves.
 The first and third quartiles (denoted Q1 and Q3) are

defined as follows:
 25% of the data lie below Q1 (and 75% is above Q1),
 25% of the data lie above Q3 (and 75% is below Q3)
 The inter-quartile range (IQR) is the difference

between the first and third quartiles, i.e.
IQR = Q3- Q1
33
Example
The ordered blood pressure data is:

113 124 124 132 146 151 170
Q1 Q3
Inter Quartile Range (IQR) is 151-124 = 27
34
60% of slides complete!
35
5. Box-plots
 A box-plot is a visual description of the

distribution based on
 Minimum
 Q1
 Median
 Q3
 Maximum
 Useful for comparing large sets of data
36
Example 1
The pulse rates of 12 individuals arranged in

increasing order are:
62, 64, 68, 70, 70, 74, 74, 76, 76, 78, 78, 80
Q1=(68+70)2 = 69, Q3=(76+78)2 = 77
IQR = (77 – 69) = 8
37
Example 1: Box-plot
38
Example 2: Box-plots of intensities from 11
gene expression arrays
14
12
10
8
AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel

39
Outliers
 An outlier is an observation which does not

appear to belong with the other data
 Outliers can arise because of a measurement
or recording error or because of equipment
failure during an experiment, etc.
 An outlier might be indicative of a sub-
population, e.g. an abnormally low or high
value in a medical test could indicate presence
of an illness in the patient.
40
Outlier Boxplot
 Re-define the upper and lower limits of the

boxplots (the whisker lines) as:
Lower limit = Q1-1.5IQR, and
Upper limit = Q3+1.5IQR
 Note that the lines may not go as far as these

limits
 If a data point is < lower limit or > upper limit,
the data point is considered to be an outlier.
41
Example – CK data
outliers
42
6. Scatter-plot
 Displays the relationship between two

continuous variables
 Useful in the early stage of analysis when

exploring data and determining is a linear
regression analysis is appropriate
 May show outliers in your data
43
Example 1: Age versus Systolic Blood
Pressure in a Clinical Trial
44
Example 2: Up-regulation/Down-regulation of gene
expression across an array (Control Cy5 versus Disease
Cy3)
45
Example of a Scatter-plot matrix (multiple
pair-wise plots)
46
Other graphical representations
 Dot-Plots, Stem-and-leaf plots

 Not visually appealing
 Pie-chart
 Visually appealing, but hard to compare two datasets. Best
for 3 to 7 categories. A total must be specified.
 Violin-plots
 =boxplot+smooth density
 Nice visual of data shape
47
Multivariate Data
 Clustering is useful for visualising multivariate

data and uncovering patterns, often reducing its
complexity
 Clustering is especially useful for high-

dimensional data (p>>n): hundreds or perhaps
thousands of variables
 An obvious areas of application are gel

electrophoresis and microarray experiments
where the variables are protein abundances or
gene expression ratios
48
7. Clustering
 Aim: Find groups of samples or variables sharing
similiarity
 Clustering requires a definition of distance between

objects, quantifying a notion of (dis)similarity
 Points are grouped on the basis on minimum distance
apart (distance measures)
 Once a pair are grouped, they are combined into a

single point (using a linkage method) e.g. take their
average. The process is then repeated.
49
Clustering
 Clustering can be applied to rows or columns of a data set
(matrix) i.e. to the samples or variables
 A tree can be constructed with branch length proportional to

distances between linked clusters, called a Dendrogram
 Clustering is an example of unsupervised learning: No use is

made of sample annotations i.e. treatment groups, diagnosis
groups
50
UPGMA
 Unweighted Pair-Group Method Average

 Most commonly used clustering method
 Procedure:
 1. Each observation forms its own cluster
 2. The two with minimum distance are grouped into a single
cluster representing a new observation- take their average
 3. Repeat 2. until all data points form a single cluster
51
Contrived Example
 5 genes of interest on 3 replicates arrays/gels
Array1 Array2 Array3
p53 9 3 7
mdm2 10 2 9
bcl2 1 9 4
cyclinE 6 5 5
caspase 8 1 10 3
 Calculate distance between each pair of genes

e.g. d ( p53, mdm2)  (9  10)2  (3  2)2  (7  9)2  2.5
52
Example
 Construct a distance matrix of all pair-wise distances
p53 mdm2 bcl2 cyclinE caspase 8
p53 0 2.5 10.44 4.12 11.75

mdm2 - 0 12.5 6.4 13.93
bcl2 - - 0 6.48 1.41
cyclinE - - - 0 7.35
caspase 8 - - - - 0
 Cluster the 2 genes with smallest distance

 Take their average & re-calculate distances to other genes
53
{caspase-8 &
p53 mdm2 cyclin E
bcl-2}
p53 0 2.5 4.12 10.9
mdm2 0 6.4 9.1
cyclin E 0 6.9
{caspase-8 &
0
bcl-2}
{p53 & {caspase-8 &

cyclin E
mdm2} bcl-2}
{p53 & mdm2} 0 3.7 9.2
cyclin E 0 6.9
{caspase-8 & bcl-2} 0
54
Example (contd)
..and the final cluster:
55
Example of a gene expression dendrogram
56
Variety of approaches to clustering
• Clustering techniques
– agglomerative -start with every element in its own cluster, and
iteratively join clusters together
– divisive - start with one cluster and iteratively divide it into
smaller clusters
• Distance Metrics
– Euclidean (as-the-crow-flies)
– Manhattan
– Minkowski (a whole class of metrics)
– Correlation (similarity in profiles: called similarity metrics)
• Linkage Rules
– average: Use the mean distance between cluster members
– single: Use the minimum distance (gives loose clusters)
– complete: Use the maximum distance (gives tight clusters)
– median: Use the median distance
– centroid: Use the distance between the “average” member or
each cluster
57
Clustering Summary
 The clusters & tree topology often depend highly on

the distance measure and linkage method used
 Recommended to use two distance metrics, such

as Euclidean and a correlation metric
 A clustering algorithm will always yield clusters,

whether the data are organised in clusters or not!
58

Introduction To Statistics 1 COD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Statistics 1 COD

Uploaded by

Copyright:

Available Formats

Introduction to Statistics

Neuropsychiatric Genetics, TCD

1. Distributions & Densities

 Sample: A subset of the population.

 Variables can be further classified as:

 If the value of a variable cannot be predicted in

 Descriptive Statistics: Quantities and

 An (Empirical) Frequency Distribution or

 A Relative Frequency Distribution presents the

 A Barchart presents the frequencies for a

 Blood samples taken from 36 male

 The serum CK concentrations were

121 82 100 151 68 58

20 40 60 80 100 120 140 160 180 200 220

20 40 60 80 100 120 140 160 180 200 220 17

(i) the Arithmetic Mean,

 Let x1,x2,x3,…,xn be the realised values of a

Example 2: The systolic blood pressure of

 If the sample data are arranged in increasing

The reordered systolic blood pressure data seen

113, 124, 124, 132, 146, 151, and 170.

The Median is the middle value of the ordered data,

Two individuals have systolic blood pressure = 124

Six men with high cholesterol participated in a study to investigate

230, 274, 274, 292, 327 and 366.

 The median is not influenced by large sample values and is a better

 Note if mean=median=mode then the data are said to be

 e.g. In the CK measurement study, the sample mean = 98.28. The

 Measures of dispersion characterise how

 The sample variance, s2, is the arithmetic

 The sample standard deviation, s, is the

 s has the advantage of being in the same units

The CV of the blood pressure data is:

 The Median divides a distribution into two halves.

 The first and third quartiles (denoted Q1 and Q3) are

 The inter-quartile range (IQR) is the difference

The ordered blood pressure data is:

Inter Quartile Range (IQR) is 151-124 = 27

 A box-plot is a visual description of the

The pulse rates of 12 individuals arranged in

Q1=(68+70)2 = 69, Q3=(76+78)2 = 77

IQR = (77 – 69) = 8

AG_04659_AS.cel AG_11745_AS.cel KB_5828_AS.cel KB_8840_AS.cel

 An outlier is an observation which does not

 Re-define the upper and lower limits of the

 Note that the lines may not go as far as these

 Displays the relationship between two

 Useful in the early stage of analysis when

 May show outliers in your data

 Dot-Plots, Stem-and-leaf plots

 Clustering is useful for visualising multivariate

 Clustering is especially useful for high-

 An obvious areas of application are gel

 Clustering requires a definition of distance between

 Once a pair are grouped, they are combined into a

 A tree can be constructed with branch length proportional to

 Clustering is an example of unsupervised learning: No use is

 Unweighted Pair-Group Method Average

 Calculate distance between each pair of genes

 Construct a distance matrix of all pair-wise distances

p53 mdm2 bcl2 cyclinE caspase 8

p53 0 2.5 10.44 4.12 11.75

 Cluster the 2 genes with smallest distance