You are on page 1of 110

Nonparametric Statistical

Methods

Presented by
Guo Cheng, Ning Liu , Faiza Khan, Zhenyu Zhang, Du
Huang, Christopher Porcaro, Hongtao Zhao, Wei Huang

1
Introduction
Definition
 Nonparametric methods 1: rank-based methods
are used when we have no idea about the
population distribution from which the data is
sampled.
 Used for small sample sizes.
 Used when the data are measured on an ordinal
scale and only their ranks are meaningful.

3
Outline
 1. Sign Test
 2. Wilcoxon Signed Rank Test
 3. Inferences for Two Independent Samples
 4. Inferences for Several Independent Samples
 5. Friedman Test
 6. Spearman’s Rank Correlation
 7. Kendall’s Rank Correlation Coefficient

4
1 .Sign Test

5
Parameter of interest: Median
Median is used as a parameter because it is a
better measure of data as compared to the
mean for skewed distributions.

6
Hypothesis test
H0: µ = µ0 vs Ha: µ > µ0 where µ0 is a specified
value and µ is unknown median

7
Testing Procedure
 Step 1: Given a random sample x1, x2, …, xn from a
population with unknown median µ, count the
number of xi’s that exceed µ0.
 Denote them by s+.
 s-= n - s+
 Step 2: Reject H0 if s+ is large or s- is small.

8
How to reject H0?
 To determine how large s+ must be in order to
reject H0, we need to find out the distribution of
the corresponding random variable S+.
 Xi: random variable corresponding to the observed
values xi
 S-: random variable corresponding to s-

9
Distribution of S+ and S-

10
Calculating P-value

11
Rejection criteria

12
Large sample z-test

13
Confidence Interval

14
Example

15
SAS code

DATA themostat;
INPUT temp;
datalines;
202.2
203.4

;
PROC UNIVARIATE DATA=themostat
loccount mu0=200;
VAR temp;
RUN;
16
SAS Output
Basic Statistical Measures
Location Variability
Mean 201.7700 Std Deviation 2.41019
Median 201.7500 Variance 5.80900
Mode . Range 8.30000
Interquartile Range 2.90000

Tests for Location: Mu0=200


Test -Statistic- -----p Value------
Student's t t 2.322323 Pr > |t| 0.0453
Sign M 3 Pr >= |M| 0.1094
Signed Rank S 19.5 Pr >= |S| 0.048

17
2. Wilcoxon signed rank test

18
Inventor

Frank Wilcoxon (2 September 1892


in County Cork, Ireland – 18 November
1965, Tallahassee, Florida, USA) was
a chemist and statistician, known for
development of several statistical tests.

19
What is it used for?
 Two related samples
 Matched samples
 Repeated measurements on a single sample
Hypothesis

21
Testing procedure

22
Example

23
SAS codes
DATA thermo;
INPUT temp;
datalines;
202.2
203.4

;
PROC UNIVARIATE DATA=thermo loccount mu0=200;
TITLE "Wilcoxon signed rank test the
thermostat";
VAR temp;
RUN;

24
8

SAS outputs (selected results)


Basic Statistical Measures
Location Variability
Mean 201.7700 Std Deviation 2.41019
Median 201.7500 Variance 5.80900
Mode . Range 8.30000
Interquartile Range 2.90000

Tests for Location: Mu0=200


Test -Statistic- -----p Value------
Student's t t 2.322323 Pr > |t| 0.0453
Sign M 3 Pr >= |M| 0.1094
Signed Rank S 19.5 Pr >= |S| 0.048
25
Large sample approximation

26
Derive E(x) & Var(x)

27
Rejection region:

28
3. Inferences for Two
Independent Samples

29
Hypothesis
Definition

31
Definition

32
Wilcoxon sum rank test

33
Mann-Whitney-U test

34
Between two tests

35
Advantages

36
For large samples

37
For large samples

38
Treatment of ties

39
Example
 To test if the grades of two classes which have
the same teacher are the same, we randomly
pick 7 students from Class A and 9 from Class
B, their scores are as follows

 A: 8.50 9.48 8.65 8.16 8.83 7.76 8.63


 B: 8.27 8.20 8.25 8.14 9.00 8.10 7.20 8.32
7.70

40
Example

7.20 7.70 7.76 8.10 8.14 8.16 8.20 8.25


B B A B B A B B
1 2 3 4 5 6 7 8

8.27 8.32 8.50 8.63 8.65 8.83 9.00 9.48


B B A A A A B A
9 10 11 12 13 14 15 16

41
Example

42
Example

43
SAS code
Data exam;
Input group $ score @@;
Datalines;
A 8.50 A 9.48 A 8.65 A 8.16 A 8.83 A 7.76 A 8.63
B 8.27 B 8.20 B 8.25 B 8.14 B 9.00 B 8.10 B 7.20
B 8.32 B 7.70
;

44
SAS code
Proc npar1way data=exam wilcoxon;
Var score;
Class group;
Exact wilcoxon;
Run;

45
Output
Wilcoxon Scores (Rank Sums) for Variable score
Classified by Variable group

group N Sum of Expected Std Dev Mean


Scores Under H0 Under H0 Score

A 7 75.0 59.50 9.447222 10.714286

B 9 61.0 76.50 9.447222 6.777778

46
Output
Wilcoxon Two-Sample Test
Statistic (S) 75.0000
   
Normal Approximation  
Z 1.5878
One-Sided Pr > Z 0.0562
Two-Sided Pr > |Z| 0.1123
   
t Approximation  
One-Sided Pr > Z 0.0666
Two-Sided Pr > |Z| 0.1332
   
Exact Test  
One-Sided Pr >= S 0.0571
Two-Sided Pr >= |S - Mean| 0.1142
Z includes a continuity correction of
0.5.

47
Output

48
4. Inferences for Several
Independent Samples

49
Introduction
 We know that if our data is normally distributed
and that the population standard deviations are
equal, we can test for a difference among several
populations by using the One-way ANOVA F test.

50
When to use Kruskal-Wallis test?
 But what happens when our data is not normal?
 This is when we use the nonparametric

Kruskal-Wallis test to compare more than


two populations as long as our data comes
from a continuous distribution.
 The notion of the kw rank test is to rank all

the data from each group together and then


apply one-way ANOVA to the ranks rather
than to the original data.
51
Kruskal-Wallis Test (kw Test)
 A non-parametric method for testing whether
samples originate from the same distribution.
 Used for comparing more than two samples
that are independent.
  

52
Kruskal-Wallis Test: History
 William Henry Kruskal
 October 10th, 1919 – April 21st, 2005

 Obtained Bachelors and Masters degree in

Mathematics at Harvard University and


received his Ph. D. from Columbia University
in 1955.
 Wilson Allen Wallis
 November 5th,1912 – October 12th, 1998

 Undergraduate work at the University of

Minnesota and Graduate work at the


University of Chicago in 1933.

53
Kruskal-Wallis Test: Steps
1. Create Hypothesis:
Null Hypothesis (Ho): The samples from populations
are identical
Alternative Hypothesis (Ha): At least one sample is
different

54
Kruskal-Wallis Test: Steps
2. Rank all the data. The lowest number gets the
lowest rank and so on. Tied data gets the average
of the ranks they would have obtained if they
weren’t tied.

3. All the ranks of the different samples are added


together. Label these sums L1, L2, L3, and L4.

55
Kruskal-Wallis Test: Steps
4. Find Test Statistic:

n = total number of observations in all samples


Li = total rank of each sample
kw = test statistic

5. Reject Ho if H is greater than the chi-square table value.

56
Kruskal-Wallis Test: Example
 An experiment was done to compare four different
ways of teaching a concept to a class of students.
In this experiment, 28 tenth grade classes were
randomly assigned to the four methods (7 classes
per method). A 45 question test was given to each
class. The average test scores of the classes are
given in the following table. Apply the Kruskal-
Wallis test to the test scores data set.

57
Kruskal-Wallis Test: Example

Given
Data

Ranks
of Data
values

58
Kruskal-Wallis Test: Example

59
Kruskal-Wallis Test: Example

60
SAS Input
 Equation 24.92
 data test;  Equation 24.92
 input methodname $ scores;  Equation 28.68
 cards;  Equation 23.32
 case 14.59
 Equaiton 32.85
 Equation 33.90
 case 23.44  Equation 23.42
 case 25.43  Unitary 33.16
 case 18.15  Unitary 26.93
 Case 20.82  Unitary 30.43
 Case 14.06
 Unitary 36.43
 Unitary 37.04
 Case 14.26  Unitary 29.76
 Formula 20.27  Unitary 33.88
 Formula 26.84  ;
 Formula 14.71  proc npar1way data=test
wilcoxon;
 Formula 22.34  class methodname;
 Formula 19.49  var scores;
 Formula 24.92  run;
 Formula 20.20
 Equation 27.82 61
SAS Output
Wilcoxon Scores (Rank Sums) for Variable scores
Classified by Variable methodname

Sum of Expected Std Dev Mean


methodname N Scores Under H0 Under H0 Score
case 7 49.00 101.50 18.845498 7.000000
formula 7 66.50 101.50 18.845498 9.500000
equation 7 125.50 101.50 18.845498 17.928571
unitary 7 165.00 101.50 18.845498 23.571429

Average scores were used for ties.

Kruskal-Wallis Test

Chi-Square 18.1390
DF 3
Pr > Chi-Square 0.0004

62
4. Friedman Test

63
Introduction
 A distribution-free rank-based test for
comparing the treatments is known as the
Friedman test, named after the Nobel
Laureate economist Milton Friedman who
proposed it.
 The Friedman Test is a version of the
repeated-Measures ANOVA that can be
performed on ordinal(ranked) data.

64
Steps in the Friedman test

65
Steps in the Friedman test

66
Example
Now we have 8 treatments separated in 3 blocks,
α = 0.025

67
Define Null and Alternative
Hypothesis
 H0: There is no difference between 8 treatments
 Ha: There exists difference between 8 treatments

68
Rank Sum

69
Friedman Test

70
Conclusion

71
5. Spearman’s Rank
Correlation Coefficient

72
Introduction
 From Pearson to Spearman
 Spearman’s Rank Correlation Coefficient

 Large-Sample Approximation

 Hypothesis Test

 Examples

           

73
From Pearson to Spearman
 Pearson’s
 Measure only the degree of linear association

 Based on the assumption of bivariate normally

of two variables

 Spearman’s
 Take in account only the ranks

 Measure the degree of monotone association

 Inferences on the rank correlation coefficients

are distribution-free
74
From Pearson to Spearman

75
From Pearson to Spearman
Charles Edward Spearman
 As a psychologist
① General factor of intelligence

② the nature and causes of


variations in human

 As a statistician
① Rank correlation

② two-way analysis
Charles Edward Spearman (10 Sept. 1863 – 17 Sept. 1945)

③ Correlation coefficient

76
Spearman’s Rank Correlation Coefficient

77
Spearman’s Rank Correlation
Coefficient

78
Large sample approximation

79
Hypothesis testing

80
Example
Table 5.1 Wine Consumption and Heart Disease Deaths

81
Example

82
Example
Table 5.2 Ranks of Wine Consumption and Heart Disease Deaths

83
Example

84
Example

85
6. Kendall’s Rank Correlation
Coefficient

86
Kendall’s Tau
 It is a coefficient use to measure the association
between two pairs of ranked data.
 Named after British statistician Maurice Kendall
who developed it in 1938.
 Ranges from -1.0 to 1.0
 Tau-a (with no ties) and Tau-b (with ties)

87
Formula for Tau-a

88
Concordant and Discordant

89
Example 1 Kendall’s tau-a
 Raw data for 11 students in 2 exams:
Exam 1 Exam 2
85 85
98 95
90 80
83 75
57 70
63 65
77 73
99 93
80 79
96 88
69 74
90
Ranks of exam results
Exam1 x Exam 2 y c d
1 2 9 1
2 1 9 0
3 3 8 0
4 5 6 1
5 4 6 0
6 7 4 1
7 6 4 0
8 9 2 1
9 8 2 0
10 11 0 1
11 10 C=50 D=5
91
Calculation for ṫ

92
Steps for calculating ṫ
1.Sort data x in ascending order, pair y ranks with x
2.Count c and d for each y
3.Sum C and D
4.Use formula to calculate ṫ

93
Formula for tau-b(with ties)

94
Example 2 Kendall’s tau-b
Wine Consumption and heart disease deaths data
i Country xi yi c d
1 Ireland 0.7 300 0 18
2 Iceland 0.8 211 3 11
2 Norway 0.8 227 2 13
4 Finland 0.8 297 0 15
5 U.S. 1.2 199 5 9
6 U.K 1.3 285 0 13
7 Sweden 1.6 207 3 9

8 Netherlands 1.8 167 5 5


9 N. Z 1.9 266 0 10
10 Canada 2.4 191 2 7
11 Australia 2.5 211 1 7
12 Germany 2.7 172 1 6
13 Belgium 2.9 131 2 4
14 Denmark 2.9 220 0 5
15 Austria 3.9 167 0 4

16 Switzerland 5.8 115 0 3


17 Spain 6.5 86 1 1
18 Italy 7.9 107 0 1
19 France 9.1 71 0 0
C=25 D=141 95
Calculation for tau-b

96
Hypothesis Test for τ

97
Hypothesis test results

98
Hypothesis test results

99
100
Example 1 extension

101
102
103
SAS Code
Data exams;
Input exam1 exam2;
Datalines;
85 85
98 95

;
Run;
Proc corr data=exams kendall;
Var exam1 exam2;
Run;
104
SAS output
The CORR Procedure
2 Variables: exam1 exam2

Simple Statistics
Variable N Mean Std Dev Median Minimum Maximum
exam1 11 81.54545 14.13056 83.00000 57.00000 99.00000
exam2 11 79.72727 9.58218 79.00000 65.00000 95.00000

Kendall Tau b Correlation Coefficients, N = 11


Prob > |tau| under H0: Tau=0
exam1 exam2
exam1 1.00000 0.81818
0.0005
exam2 0.81818 1.00000
0.0005
105
7. Conclusion

106
Summary
 Nonparametric tests are very useful when we
don’t know anything about the distributions.
 Especially when the distribution is not normal,
we can’t use T-test, then we have to study the
nonparametric methods.
 Median is a better measurement of central
tendency for non-normal population.
 Sample can be ordinal and sample size is
usually small.
107
Summary
In summary, we have briefly introduced some most
common methods in our presentation including:
Sign test

Wilcoxon rank sum test and signed rank test

Kruskal-Wallis Test

Friedman Test

Spearman’s Rank Correlation

Kendall’s Rank Correlation Coefficient

108
Questions

109
The End.

Thank You !

110

You might also like