You are on page 1of 24

Dr.

Qin Xu 27/09/2013

Descriptive Statistics

• Levels of Data
• Measures of Central Tendency
• Measures of Dispersion
• Distribution

A. Levels of Data

• Nominal scale: Numbers are used simply to identify


groups or categories. the numeric units used bear no
meaningful relationship to one another.
• Ordinal scale: Numbers used not only identify
groups or categories, but they stand in some relation
to each other in certain order, but the differences
among the groups/categories (i.e., the intervals) are
unequal/unknown.
• Interval scale: Numerical measurement on a scale in
which all the intervals are equal.

1
Dr. Qin Xu 27/09/2013

B. Measures of Central Tendency


• Mode: The score in a set of scores that occurs most
frequently. It is possible to have a sample that is bimodal or
multi-modal. Measurement of central tendency in nominal
data should be using mode.
• Median: The middle score of a set of scores placed in size
order. It has the advantage over other measures of central
tendency since it is not influenced by the values of extreme
scores. It should be used when interval data distribution is
skewed and when data are ordinal in nature.
• Mean: Sum of all scores divided by the number of the scores.
It has the advantage of taking all the scores into consideration
but is influenced by extreme scores. It should be used with
interval or ratio levels of measurement.

C. Measures of Dispersion
• Range = (highest – lowest), used with mean

• Interquartile Range = (75th percentile score – 25th


percentile score), used with median

• Variance (s2) = (sum of squares) ⁄ (number of


observations) = [∑(x-x)2] ⁄ n

• Standard Deviation (s or stdev) = √variance


(used with mean) = √[∑(x-x)2] ⁄ (n-1)

Q: Why sum of squares? Why (n-1)?

2
Dr. Qin Xu 27/09/2013

D. Distribution
• Frequency Distribution is often used to describe
the distribution of nominal and ordinal data in the
forms of a histogram or a barchart
• Skewed Distribution is a non-symmetrical
distribution of data in which the mean is distorted by
the existence of extreme scores that lie to one side
of the median.
• Normal Distribution is a theoretical distribution
(followed by many random factors) that is
symmetrical and bell shaped, where
mean=median=mode
68% data lie within ±1 stdev
95% data lie within ±2 stdev
99.7% data lie within ±3 stdev

3
Dr. Qin Xu 27/09/2013

Inferential Statistics

• Introduction
• Samples and Populations
• Sampling Error
• Hypothesis testing
• Type I and Type II Errors

A. Introduction

Purpose of statistical inference is to obtain information


about a population from information contained in a
sample.

Hypothesis is a testable statement, usually derived from


theory or observation, which predicts the nature of the
relationship between variables, or test differences
among sets of data (observed values of variables) that
can be attributed to influence of particular variables.

4
Dr. Qin Xu 27/09/2013

B. Samples and Populations


Population refers to all possible objects of a particular type.
Sample is a subset of the population.

Sources of sampling error

population variance (s)


sample size (n)

sampling error is a function of s2 ⁄ n

There is always a difference between two sample means even


though they come from the same population.

5
Dr. Qin Xu 27/09/2013

Central limit theorem


For repeated sampling of sample sizes n>20, the distribution of
mean differences is almost always normal regardless of the
population distribution.

Standard error (SE)


The standard deviation of the sampling distribution of the
differences is called standard error and it will decrease as the
sample size increases. It indicates the amount of error in the
estimation of the population parameters from the sample data

SE = √variance ⁄ n = s ⁄ √n

C. Hypothesis Testing
Hypothesis Testing is the statistical inference of rejecting and
accepting the “by chance” explanation of observed difference in
samples.

Significant Level (α) is the chosen probability level at which the


chance explanation is rejected.

Null Hypothesis (H0) assumes the differences in samples arise


because of purely chance fluctuations in samples’ scores.
H0: There is no difference/relationship between/among...

6
Dr. Qin Xu 27/09/2013

Alternate Hypothesis (H1) assumes the differences in samples


are caused, at least in part, by either the differences in
populations or by independent variables.

Two-tailed hypothesis does not state the direction of the


difference while

H1: there is a difference/relationship …

One-tailed hypothesis states the direction of the difference

H1: A is more/less than B; or


A is positively/negatively related to B

• Two-tailed or One-tailed
When α = 0.05, i.e., risk of 5% probability of getting it wrong but 95% probability of getting it right!
Reject
Accept H0 Accept H0 H0
Reject H0

95% 95%

5%
2.5% 2.5%

One-tailed Test Two-tailed Test

7
Dr. Qin Xu 27/09/2013

D. Type I and Type II Errors


Type I Error is the mistake of rejecting the null hypothesis when
it is in fact true.
To reduce the chance of Type I error, the chosen significant level
should be reduced, e.g., from α=0.05, to α=0.01. however, this
would result in increased number of occasions of
Type II Error when the null hypothesis is accepted when in fact
it is incorrect.

Sig. level
Probability of H0 is correct
100%------------------------------------------------------------5%---------0%
Decreasing probability (α) of committing Type I error

Decreasing probability (β) of committing Type II error


(0%-----------------------------------------------------------95%---------100%)
Probability of H0 is incorrect

Consider:
Impact of drinks on
physical performance
Two Drink:
Water
lucozade
Resource:
40 Day-pass for Sports Centre
£35

8
Dr. Qin Xu 27/09/2013

Choosing an Appropriate Test


• One-sample
It is used when an existing population mean or a previous
sample mean is known. It examines whether a collected
sample belongs to the known population or comes from the
same population as the existing sample.
• Paired-samples
- repeated measure design (using the same sample for different
treatments)
- matched pair design (using 2 samples of closely matched
profiles)
- crossover design (using 2 samples undergoing different
treatments alternately)
• Independent-samples
- randomised control trials (RCT, randomly assign samples to
treatments)
- comparison group (cross-sectional) design (obtaining
independent samples at a fixed time point)

9
Dr. Qin Xu 27/09/2013

Simple Parametric Tests


(one or two sample)

• Assumptions for Parametric Tests


• Testing Assumptions on Samples
• Choosing an Appropriate Test

A. Assumptions for Parametric Tests

Parametric test is the statistics used to examine


for significant differences between means.

It assumes that
• Samples come from normally distributed
populations
• Samples to be compared have the same
variance
• Dependent variable is assumed to be
measured on an interval scale.

10
Dr. Qin Xu 27/09/2013

B. Testing Assumptions on Samples


1. Test for normal distribution

- Using descriptive statistics


skewness (statistics÷[std.error]), -1.96 < z < 1.96
kurtosis (statistics÷[std.error]), -1.96< z < 1.96

- Employ a test of normality (SPSS)


Kolmogorov-Smirnov Test, n≥50
Shapiro-Wilk Test, N<50

- [Visual inspection using


Histogram
Boxplot
Normal Probability Plot (normal Q-Q plot)]

2. Test for Equal Variance


There are at least 56 of such tests. In SPSS Levene’s Test
is used. It is ONLY applied when there are samples from two
or more independent sources.
It employs H0 there is equal variance and
F -test, where p>0.05 assumes equal variance
Between groups variance
F=
Within group variance

3. Data Levels
There is no statistical test to assess this
- Nominal data: in name only
- Ordinal data: ranked
- Interval data: continuous scale with equal intervals

11
Dr. Qin Xu 27/09/2013

Non-parametric Tests

• Comparison with Parametric tests


• Choice of Tests
• Chi-square and Related Tests

A. Comparison with Parametric Tests


Advantages
No assumptions of normal distribution
No assumptions for data at interval level
Used for ordinal and nominal data
Used for interval data that fail to satisfy the conditions for
parametric tests

Disadvantages
Less powerful than parametric tests: significance difference is
detected only when the sample is sufficiently deviated from the
population mean/median. A larger sample is required to be
able to reject H0 when it should be rejected (i.e., avoiding Type
II error)
Not as informative as parametric tests: significance calculated on
either ranks or frequencies of data rather than values

12
Dr. Qin Xu 27/09/2013

B. Choice of Tests
• Wilcoxon Test
Comparing the medians of two matched samples in repeated
measure, matched pair and crossover designs. It assumes
(academically) that samples are from a population that is
symmetrically but not necessarily normally distributed.
(The differences of the paired data are ranked, and the sums of
+ve and –ve ranks are used as statistics)
• Mann-Whitney Test
Comparing the medians of two independent samples in RCT
and cross-sectional designs. It assumes (academically) that
the two samples have similar distributions, i.e., not for
comparing +ve skewed sample distribution with a –ve skewed
one.
(The data from the two samples are ranked together, and sums of
each sample’s rank are used as statistics)

C. Frequency Analysis
Chi-square (χ2) test is one of the best know frequency
analysis test; there are several variations of it that are
applied to different types of data. These tests are
used when the main data are from counting
occurrence of events/choices.
homogeneity test: testing observed frequencies is not
random
red yellow blue green
UK 334 256 328 282
(Random 300 300 300 300)

or goodness-of-fit test : testing a set of observed


frequencies is in agreement with that of the expected
frequencies.
red yellow blue green
Japan 548 832 410 679
UK 334 256 328 282

13
Dr. Qin Xu 27/09/2013

Chi-Square and Related Tests (Cont.)

cross-tabulation (association or contingency table)


is used to assess whether the number or frequency
observed is influenced by the association of two
independent variables. The observed frequencies are
compared to expected frequencies generated by the
multiplication of row and column subtotals divided by the grand
total.

Example

Does family income influence voters choice in general


election?

Correlation and Regression

• Introduction

• Correlation Coefficient

• Partial Correlation

• Regression

• Multiple regression

14
Dr. Qin Xu 27/09/2013

A. Introduction
Correlation expresses the extent to which two
variables
vary together (i.e., co-vary). It may be
- measurement of direct cause-effect relationship
- measurement of co-change of two variables due
to the presence of a 3rd variable

Positive Correlation is when an increase of one


variable is associated by an increase in the other
Negative Correlation is when an increase in one
variable is accompanied by a decrease in the other

15
Dr. Qin Xu 27/09/2013

16
Dr. Qin Xu 27/09/2013

B. Correlation Coefficient
Correlation coefficient (r) is a statistical index of the degree to
which two variables are related.

r=1 r=0 r = -1
Perfect +ve correlation No correlation Perfect -ve correlation

The Strength of a Correlation

Value of r Strength of correlation


0.00 to 0.19 Very weak
0.20 to 0.39 Weak
0.40 to 0.69 Modest
0.70 to 0.89 Strong
0.90 to 1.00 Very strong

Coefficient of Determination (R2) provides a meaningful


measurement of the proportion of the variability in one variable
that is explained/accounted for by the variability in the other.

For r = 0.9, R2 = 0.9 x 0.9 x100% = 81%,


81% of changes in one may be explained by the influence of the
other

For r = 0.3, R2 = 0.3 x 0.3 x100% = 9%,


only 9% of changes in one may be explained by the influence of
the other

In the above examples, the strength of correlation may be only 3


times greater in case 1, but the coefficient of determination is of
9 times greater which reflects the strength of the relationship in
case 1.

17
Dr. Qin Xu 27/09/2013

C. Overlapping Variance
Partial correlation is when relationship of two variables is
examined with exclusion of another predictor

Cochlear
function

1
2

Age
3
Hearing loss
4

Central
processing

D. Regression Analysis
Correlation analysis indicates the relationship of two
variables
Regression analysis quantifies the magnitude and
direction of relationship between two variables.
Regression equation is the mathematical description of
the relationship that may be used to predict/estimate
the value of one variable from a measurement of the
other. i.e., y = a +bx
Line of least squares is the method adopted for fitting
the regression line where the sum of the vertical
distances of all points from the line is minimised.

18
Dr. Qin Xu 27/09/2013

Y
Y = b0 + b1X


○ ○




○ ○
○ y1

x1

b1 = y1 ⁄ x1
a{
X

E. Multiple Regression
Simple regression examines the relationship between
two variables
Multiple regression examines the influence of many
independent variables (predictors) on the measured
dependent variable
Education

1
2
Beginning
3 salary
Final salary
4

5
Previous
experiences

19
Dr. Qin Xu 27/09/2013

Tests for Three or More Samples

• Introduction

• Non-parametric Tests for Multiple Samples

• ANOVA (Analysis of Variance)

A. Introduction
Why not do repeated two-sample tests?
- increased probability in committing Type I error
- Increased probability of committing Type II error if using
reduced significance level
Basis of multiple sample analysis
Difference = (inter-individual variability + intra-individual variability
+ random fluctuations) + treatment variability

for independent samples (inter-) (intra-)(random)(treatment)


DIFFERENCE =
(inter-)(intra-)(random)

for related samples


(intra-)(random)(treatment)
(or matched) DIFFERENCE =
(intra-)(random)

20
Dr. Qin Xu 27/09/2013

B. Non-parametric Tests for Multiple Samples

C. ANOVA
ANOVA is a parametric technique for ≥ 3 samples and it
has the same assumption as t-test:

- All data sets are normally distributed


- Interval data level
- Equal variance of independent samples (Levene’s
test)
- Sphericity of related samples (equal variance of
differences between any paired samples, Mauchly’s
test)

ANOVA is a very versatile technique, being used to


explore various ways of analysing data from
complicated research design.

21
Dr. Qin Xu 27/09/2013

Bonferroni adjustment is necessary, for example,


comparisons have to be carried out for all possible
pairs.
Multiple comparisons within the same test increase risk
of Type I error (rejecting H0 when it is true).

Simple adjustment
for n comparisons, apply αadj = α ⁄ n
so for A, B & C, αadj = 0.05 ⁄ 3 = 0.0167

Bonferroni adjustment
for n comparisons αadj = 1 - n√(1 – α)
so for A, B & C, αadj = 1- 3√(1 – 0.05) = 1 – 0.9831
= 0.0169

Corrections are built in with ANOVA in SPSS:

Post-hoc test is often carried out when significant difference is


indicated by an ANOVA test. Comparisons are made to all
possible combinations and corrections are made to avoid
increasing risk of committing Type I errors.

12 such tests are available from SPSS, they differ in how


cautious they are in making corrections. The more
cautious/conservative a test, the more difficult to reject H0
Examples

Least significant Same as multiple t-test No corrections made


difference (LSD)
Duncan’s multiple range Least conservative
test
Tukey’s HSD test In between Pair-wise comparison

Scheffe’s test Most conservative Compare all groups

22
Dr. Qin Xu 27/09/2013

Are They Related?


Correlation, Regression, ANOVA

SS: sum of squares


MS: mean sum of
squares = -

F = MSM/MSR

What is Regression based on?


SST: When mean is used to predict values, the difference
between data and mean is the total sum of squares.

SSR: When a model is used, it will not match the data


perfectly, so there is a residual sum of squares.

SSM: The improvement of prediction due to the model (in


comparison with mean) gives the model sum of squares.

SSM = SST - SSR

R2 = SSM / SST

23
Dr. Qin Xu 27/09/2013

How is correlation measured:


Variance = (sum of squares) ⁄ (degree of freedom)

(s2) = [∑(xi-x)2] ⁄ n-1


Standard deviation = √variance = s

Covariance (X, Y) = [(deviation of X)(Deviation of Y)]/(degree of freedom)

cov (x,y) = [∑(xi-x)(yi-y)] ⁄ n-1


Correlation coefficient = standardised covariance
r= cov (x,y)/(sx)(sy) = [∑(xi-x)(yi-y)] ⁄ [(n-
1)(sx)(sy)]
R2 = SSM / SST = r x r

What happens in ANOVA?


Mean squares (MS): sum of
squares/degree of
freedom

F = MSM/ MSR

MSM : mean variance due to


experimental manipulation
(3 dose levels)
MSR : mean variance
caused by confounding
variables (within sample
differences)

24

You might also like