Kruskal Wallace Test.doc | Student's T Test | Categorical Variable

# The Kruskal-Wallis test is a nonparametric method of testing the hypothesis that several populations have the same

continuous distribution versus the alternative that measurements tend to be higher in one or more of the populations. To apply the test, we obtain independent random samples of sizes n1, n2, . . . , nm from m populations. Assume that there are N observations in all. We rank all N observations and let Ri be the sum of the ranks of the ni observations in the ith sample. The Kruskal-Wallace statistic is H = [12 / (N (N + 1)) ] * Sum[ Ri ^2 / ni , i, 1, m ] - 3(N + 1). When the sample sizes are large and all m populations have the same continuous distribution, then H has an approximate chi-square distribution with m - 1 degrees of freedom. When H is large, creating a small right-tail probability (p-value), then we reject the null hypothesis that all populations have the same distribution.

Kruskall-Wallis test. The Kruskall-Wallis test is a non-parametric alternative to
one-way (between-groups) ANOVA. It is used to compare three or more samples, and it tests the null hypothesis that the different samples in the comparison were drawn from the same distribution or from distributions with the same median. Thus, the interpretation of the Kruskall-Wallis test is basically similar to that of the parametric one-way ANOVA, except that it is based on ranks rather than means. For more details, see Siegel & Castellan, 1988. See also, Nonparametric Statistics

Friedman's Test
Friedman's Test is a nonparametric alternative to two-way analysis of variance. The hypothesis is: H0 the means of all the samples are equal H1 the mean of at least one of the samples is different Consider the example where three treatments are evaluated on four different patients: Therapy Relaxed Normal High Intensity The test involves: Andrew 110 115 117 Belinda 140 150 155 Chris 100 105 100 Dave 130 135 135

impose ranks on each of the columns. If values are equal, average the ranks they would have got if they were slightly different: Therapy Andrew Belinda Chris Dave

Relaxed Normal High Intensity

1 2 3

1 2 3

1.5 3 1.5

1 2.5 2.5

calculate the Fr statistic using the formula:

Where: I J Ri Number of samples (treatments) Number of blocks sum of the ranks in row 'i'

This gives a value for Fr of 4.65 The value of Fr has an approximately chi-square distribution with I - 1 degrees of freedom.

Kruskal Wallace Test
This is a distribution free alternative to ANOVA. It will compare several samples and test the hypothesis: H0 the means of all the samples are equal H1 the mean of at least one of the samples is different The test involves:

• • •

sort the combined results into size order and allocate ranks form a table that contains the ranks, instead of the values calculate the test statistic 'K using the formula:

Where: Ri sum of the ranks in row 'i' J number of values in row 'i' N the total number of values I the number of rows 'K' has an approximately

χ

2

distribution with degrees of freedom I - 1

Levene's Test
The ANOVA method relies on the assumption that the variance is the same for all the samples. Levene's test is: H0 the variance is the same for all the samples H1 the variance of at least one of the samples is different

The test uses the statistic 'W' where:

Where: N Ni k total number of values number of values in row 'i' number of levels median of level 'i' the average of all the zij values the average of the zij values in row 'i'

Mann Whitney Test
This is the nonparametric version of the two sample t-test; it compares the means of two samples. It tests the hypothesis: H0 the means are equal H1 the mean of sample 'm' is [less than/greater than/not equal to] the mean of sample 'n' It is similar in concept to the Wilcoxon Signed Rank Test and is also known as the Wilcoxon Rank Sum Test. The test involves:

• •

sort the combined results into size order and allocate ranks find the sum 'w' of the ranks of the sample with the smallest number of values (if they both have the same number of values, select either). This is one of the critical values.

For small data sets the critical value is found from published tables. Because the sum of the ranks must be an integer, it is not usually possible to find the exact critical value:

or m is the number of values in the sample with the least number of values n is the number of values in the other sample

µ

1

is associated with the sample containing the smallest number of values

Median Test
See Mood's Median Test

Mood's Median Test
The Mood's Median Test is a nonparametric equivalent of ANOVA. It is an alternative to the Kruskal-Wallace test. The hypothesis is: H0 the medians of all the samples are equal H1 the median of at least one of the samples is different The test involves:

• •

find the median of the combined data set find the number of values in each sample greater than the median and form a contingency table: A B C Greater than the median Less than or equal to the median Total Total

find the expected value for each cell:

find the chi-square value from:

Parametric Hypothesis Test
A Parametric test make assumptions about the underlying distribution of the population from which the sample is being drawn, and which is being investigated. This is typically that the population conforms to a normal distribution.

Sign Scores Test
Alternative name for Mood's Median Test

[Top]

Wilcoxon Rank Sum Test
See the Mann Whitney Tests

[Top]

Wilcoxon Signed Rank Test

[Top]

This is the nonparametric equivalent of the one sample t-test. It tests the hypothesis: H0 the mean equals zero H1 the mean [is less than/greater than/not equal to] zero The test involves:

• •

sort the values into order of their absolute magnitude (ignoring the signs) and allocate ranks calculate the sum the ranks of the data values are positive (S+)

An example would be: Rank Data Value 1 -1.0 2 -2.0 3 2.5 4 3.0 5 3.0 6 3.5

If the mean is non-zero the positive values will be clustered at one end or other of the ordered data set and S+ and S- will be very different. For the example S+ = 3 + 4 + 5 + 6 = 18 For small data sets values the critical value is found from published tables. Because the sum of the ranks must be an integer, it is not usually possible to find the exact critical value:

where

For larger data sets the z-statistic can be found from:

This gives the upper tail test (the mean is less than zero). For the lower tail test, use the sum S+ and for the two tailed test use both, remembering to use the table value twice. The p-value can be calculated using normal distribution tables.

What statistical analysis should I use?
The following table shows general guidelines for choosing a statistical analysis. We emphasize that these are general guidelines and should not be construed as hard and fast rules. Usually your data could be analyzed in multiple ways, each of which could yield legitimate answers. The table below covers a number of common analyses and helps you choose among them based on the number of dependent variables (sometimes referred to as outcome variables), the nature of your independent variables (sometimes referred to as predictors). You also want to consider the nature of your dependent variable, namely whether it is an interval variable, ordinal or categorical variable, and whether it is normally distributed (see What is the difference between categorical, ordinal and interval variables? for more information on this). The table then shows one or more statistical tests commonly used given these types of variables (but not necessarily the only type of test that could be used) and links showing how to do such tests using SAS, Stata and SPSS.

Number of Dependent Variables

Nature of Independent Variables

Nature of Dependent Variable(s) interval & normal

Test(s) one-sample t-test one-sample median binomial test Chi-square goodness-of-fit 2 independent sample t-test Wilcoxon-Mann Whitney test Chi- square test Fisher's exact test one-way ANOVA Kruskal Wallis Chi- square test paired t-test Wilcoxon signed ranks test McNemar one-way repeated measures ANOVA Friedman test repeated measures logistic regression factorial ANOVA ??? factorial logistic regression correlation

How to How to SAS Stata SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS ??? SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS SAS Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata ??? Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata Stata

How to SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS ??? SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS SPSS

0 IVs (1 population)

ordinal or interval categorical (2 categories) categorical interval & normal

1 IV with 2 levels (independent groups)

ordinal or interval categorical interval & normal

1 IV with 2 or more levels (independent groups)

ordinal or interval categorical interval & normal

1 IV with 2 levels (dependent/matched groups) 1 1 IV with 2 or more levels (dependent/matched groups)

ordinal or interval categorical interval & normal ordinal or interval categorical interval & normal

2 or more IVs (independent groups)

ordinal or interval categorical

interval & normal 1 interval IV ordinal or interval categorical interval & normal categorical interval & normal interval & normal interval & normal simple linear regression non-parametric correlation simple logistic regression multiple regression analysis of covariance multiple logistic regression discriminant analysis one-way MANOVA multivariate multiple linear regression canonical correlation

1 or more interval IVs and/or 1 or more categorical IVs 1 IV with 2 or more levels (independent groups) 2 or more 0

2 or more 2 or more 2 sets of 2 or more

2 or more Number of Dependent Variables

0 Nature of Independent Variables

interval & normal Nature of Dependent Variable(s)

factor analysis Test(s)

SAS

Stata

SPSS How to SPSS

How to How to SAS Stata

Nonparametric Statistics

I shall compare the Wilcoxon rank-sum statistic with the independent samples t-test to illustrate the differences between typical nonparametric tests and their parametric “equivalents.” Independent Samples t test H∅: µ1 = µ2 Assumptions: Normal populations Homogeneity of variance (but not for separate variances test) Both tests are appropriate for determining whether or not there is a significant association between a dichotomous variable and a continuous variable with independent samples data. Note that with the independent samples t test the null hypothesis focuses on the population means. If you have used the general form of the nonparametric hypothesis (without assuming that the populations have equal shapes and equal dispersions), rejection of that null hypothesis simply means that you are confident that the two populations differ on one or more of location, shape, or dispersion. If, however, we are willing to assume that the two populations have identical shapes and dispersions, then we can interpret rejection of the nonparametric null hypothesis as indicating that the populations differ in location. With these equal shapes and dispersions assumptions the nonparametric test is quite similar to the parametric test. In many ways the nonparametric tests we shall study are little more than parametric tests on rank-transformed data. The nonparametric tests we shall study are especially sensitive to differences in medians. If your data indicate that the populations are not normally distributed, then a nonparametric test may be a good alternative, especially if the populations do appear to be of the same non-normal shape. If, however, the populations are approximately normal but heterogeneous in variance, I would recommend a separate variances t-test over a nonparametric test. If you cannot assume equal dispersions with the nonparametric test, then you cannot interpret rejection of the nonparametric null hypothesis as due solely to differences in location. Wilcoxon Rank-Sum Test H∅: Population 1 = Population 2 None for general test, but often assume: Equal shapes Equal dispersions

Conducting the Wilcoxon Rank-Sum Test
Rank the data from lowest to highest. If you have tied scores, assign all of them the mean of the ranks for which they are tied. Find the sum of the ranks for each group. If n1 = n2, then the test statistic, WS, is the smaller of the two sums of ranks. Go to the table (starts on page 689 of the 6th edition of Howell) and obtain the one-tailed (lower tailed) p. For a two-tailed test (nondirectional hypotheses), double the p. If n1 ≠ n2, obtain both WS and WS′ ′ : WS is the sum of the ranks for the group with the smaller n, W S = 2W − W S (see the rightmost column in the table), the sum of the ranks that would have been obtained for the smaller group if we had ranked from high to low rather than low to high. The test statistic is the smaller of WS and WS′ . If you have directional hypothesis, to reject the null hypothesis not only must the onetailed p be less than or equal to the criterion, but also the mean rank for the sample predicted (in H1) to come from the population with the smaller median must be less than the mean rank in the other sample (otherwise the exact p = one minus the p that would have been obtained were the direction correctly predicted). If you have large sample sizes, you can use the normal approximation procedures explained on pages 651-653 of Howell. Computer programs generally do use such an approximation, but they may also make a correction for continuity (reducing the absolute value of the numerator by .5) and they may obtain the probability from a t-distribution rather than from a zdistribution. Please note that the rank-sum statistic is essentially identical to the (better know to psychologists) Mann-Whitney U statistic. but the Wilcoxon is easier to compute. If someone insists on having U, you can always transform your W to U (see pages 653-654 in Howell). Here is a summary statement for the problem on page 652 of Howell (I obtained an exact p from SAS rather than using a normal approximation): A Wilcoxon rank-sum test indicated that babies whose mothers started prenatal care in the first trimester weighed significantly more (N = 8, M = 3259 g, Mdn = 3015 g, SD = 692 g) than did those whose mothers started prenatal care in the third trimester (N = 10, M = 2576 g, Mdn = 2769 g, SD = 757 g), W = 52, p = .034. Power of the Wilcoxon Rank Sums Test You already know (from having read Gaito's 1980 article in Psychological Bulletin) that the majority of statisticians reject the notion that parametric tests require interval data and thus ordinal data need be analyzed with nonparametric methods. There are more recent simulation studies that also lead one to the conclusion that scale of measurement (interval versus ordinal) should not be considered when choosing between parametric and nonparametric procedures (see the references on page 57 of Nanna & Sawilowsky, 1998). There are, however, other factors that could lead one to prefer nonparametric analysis with certain types of ordinal data. Nanna and Sawilowsky (1998, Psychological Methods, 3, 55 - 67) addressed the issue of Likert scale data. Such data typically violate the normality assumption and often the homogeneity of variance assumption made when conducting

traditional parametric analysis. Although many have demonstrated that the parametric methods are so robust to these violations that this is not usually a serious problem with respect to holding alpha at its stated level (but can be, as you know from reading Bradley's articles in the Bulletin of the Psychonomic Society), one should also consider the power characteristics of parametric versus nonparametric procedures. While it is generally agreed that parametric procedures are a little more powerful than nonparametric procedures when the assumptions of the parametric procedures are met, what about the case of data for which those assumptions are not met, for example, the typical Likert scale data? Nanna and Sawilowsky demonstrated that with typical Likert scale data, the Wilcoxon rank sum test has a considerable power advantage over the parametric t test. The Wilcoxon procedure had a power advantage with both small and large samples, with the advantage actually increasing with sample size. Wilcoxon’s Signed-Ranks Test This test is appropriate for matched pairs data, that is, for testing the significance of the relationship between a dichotomous variable and a continuous variable with related samples. It does assume that the difference scores are rankable, which is certain if the original data are interval scale. The parametric equivalent is the correlated t-test, and another nonparametric is the binomial sign test. To conduct this test you compute a difference score for each pair, rank the absolute values of the difference scores, and then obtain two sums of ranks: The sum of the ranks of the difference scores which were positive and the sum of the ranks of the difference scores which were negative. The test statistic, T, is the smaller of these two sums for a nondirectional test (for a directional test it is the sum which you predicted would be smaller). Difference scores of zero are usually discarded from the analysis (prior to ranking), but it should be recognized that this biases the test against the null hypothesis. A more conservative procedure would be to rank the zero difference scores and count them as being included in the sum which would otherwise be the smaller sum of ranks. Refer to the table that starts on page 683 of Howell to get the exact one-tailed (lower-tailed) p, doubling it for a nondirectional test. Normal approximation procedures are illustrated on pages 656-658 of Howell. Again, computer software may use a correction for continuity and may use t rather than z. Here is an example summary statement using the data on page 657 of Howell: A Wilcoxon signed-ranks test indicated that participants who were injected with glucose had significantly better recall (M = 7.62, Mdn = 8.5, SD = 3.69) than did subjects who were injected with saccharine (M = 5.81, Mdn = 6, SD = 2.86), T(N = 16) = 14.5, p = .004. Kruskal-Wallis ANOVA This test is appropriate to test the significance of the association between a categorical variable (k ≥ 2 groups) and a continuous variable when the data are from independent samples. Although it could be used with 2 groups, the Wilcoxon rank-sum test would usually be used with two groups. To conduct this test you rank the data from low to high and for each group

obtain the sum of ranks. These sums of ranks are substituted into the formula on page 654 of Howell. The test statistic is H, and the p is obtained as an upper-tailed area under a chi-square distribution on k-1 degrees of freedom. Do note that this one-tailed p is appropriately used for a nondirectional test. If you had a directional test (for example, predicting that Population 1 < Population 2 < Population 3), and the medians were ordered as predicted, you would divide that one-tailed p by k ! before comparing it to the criterion. The null hypothesis here is: Population 1 = Population 2 = ......... = Population k. If you reject that null hypothesis you probably will still want to make “pairwise comparisons,” such as group 1 versus group 2, group 1 versus group 3, group 2 versus group 3, etc. This topic is addressed in detail in Chapter 12 of Howell. One may need to be concerned about inflating the “familywise alpha,” the probability of making one or more Type I errors in a family of c comparisons. If k = 3, one can control this familywise error rate by using Fisher’s procedure (also known as “a protected test”): Conduct the omnibus test (the Kruskal-Wallis) with the promise not to make any pairwise comparisons unless that omnibus test is significant. If the omnibus test is not significant, you stop. If the omnibus test is significant, then you are free to make the three pairwise comparisons with Wilcoxon’s rank-sum test. If k > 3 Fisher’s procedure does not adequately control the familywise alpha. One fairly conservative procedure is the Bonferroni procedure. With this α procedure one uses an adjusted criterion of significance, α ′ = fw . This pc c procedure does not require that you first conduct the omnibus test, and should you first conduct the omnibus test, you may make the Bonferroni comparisons whether or not that omnibus test is significant. Suppose that k = 4 and you wish to make all 6 pairwise comparisons (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) with a maximum familywise alpha of .05. Your adjusted criterion is .05 divided by 6, .0083. For each pairwise comparison you obtain an exact p, and if that exact p is less than or equal to the adjusted criterion, you declare that difference to be significant. Do note that the cost of such a procedure is a great reduction in power (you are trading an increased risk of Type II error for a reduced risk of Type I error). Here is a summary statement for the problem on page 660 of Howell: Kruskal-Wallis ANOVA indicated that type of drug significantly affected the number of problems solved, H(2, N = 19) = 10.36, p = .006. Pairwise comparisons made with Wilcoxon’s rank-sum test revealed that ......... Basic descriptive statistics (means, medians, standard deviations, sample sizes) would be presented in a table. Friedman’s ANOVA This test is appropriate to test the significance of the association between a categorical variable (k ≥ 2) and a continuous variable with randomized blocks data (related samples). While Friedman’s test could be employed with k = 2, usually Wilcoxon’s signed-ranks test would be employed if there were only two groups. Subjects have been matched (blocked) on some variable or variables thought to be correlated with the continuous variable of primary interest. Within each block the continuous variable scores

are ranked. Within each condition (level of the categorical variable) you sum the ranks and substitute in the formula on page 661 of Howell. As with the Kruskal-Wallis, obtain p from chi-square on k− degrees of freedom, using an 1 upper-tailed p for nondirectional hypotheses, adjusting it with k! for directional hypotheses. Pairwise comparisons could be accomplished employing Wilcoxon signed-ranks tests, with Fisher’s or Bonferroni’s procedure to guard against inflated familywise alpha. Friedman’s ANOVA is closely related to Kendall’s coefficient of concordance. For the example on page 661 of Howell, the Friedman tests asks whether the rankings are the same for the three levels of visual aids. Kendall’s coefficient of concordance, W, would measure the extent to which 2 χF W = the blocks agree in their rankings. . N (k − 1) Here is a sample summary statement for the problem on page 661 of Howell: Friedman’s ANOVA indicated that judgments of the quality of the 2 lectures were significantly affected by the number of visual aids employed, χ F (2, n = 17) = 10.94, p = .004. Pairwise comparisons with Wilcoxon signedranks tests indicated that ....................... Basic descriptive statistics would be presented in a table. Power It is commonly opined that the primary disadvantage of the nonparametric procedures is that they have less power than does the corresponding parametric test. The reduction in power is not, however, great, and if the assumptions of the parametric test are violated, then the nonparametric test may be more powerful. Everything You Ever Wanted to Know About Six But Were Afraid to Ask You may have noticed that the numbers 2, 3, 4, 6, 12, and 24 commonly appear as constants in the formulas for nonparametric test statistics. This results from the fact that the sum of the integers from 1 to n is equal to n(n + 1) / 2. Effect Size Estimation As you know, the American Psychological Association now emphasizes the reporting of effect size estimates. Since the unit of measure for most criterion variables used in psychological research is arbitrary, standardized effect size estimates, such as Hedges’ g, η2, and ω 2 are popular. What is one to use when the analysis has been done with nonparametric methods? This query is addressed in the document “A Call for Greater Use of Nonparametric Statistics,” pages 13-15. The authors (Leech & Onwuegbuzie) note that researchers who employ nonparametric analysis generally either do not report effect size estimates or report parametric effect size estimates such as g. It is, however, known that these effect size estimates are adversely affected by departures from normality and heterogeneity of variances, so they may not be

well advised for use with the sort of data which generally motivates a researcher to employ nonparametric analysis. There are a few nonparametric effect size estimates (see Leech & Onwuegbuzie), but they are not well-known and they are not available in the typical statistical software package. You can find SAS code for computing two nonparametric effect size estimates in the document “Robust Effect Size Estimates and Meta-Analytic Tests of Homogeneity” (Hogarty & Kromrey, SAS Users Group International Conference, Indianapolis, April, 2000).

Using SAS to Compute Nonparametric Statistics
Run the program Nonpar.sas from my SAS programs page. Print the output and the program file. The first analysis is a Wilcoxon Rank Sum Test, using the birthweight data also used by Howell (page 652) to illustrate this procedure. SAS gives us the sum of scores for each group. That sum for the smaller group is the statistic which Howell calls WS (100). Note that SAS does not report the W′ S statistic (52). SAS does report both a normal approximation (z = 2.088, p = . 037) and an exact (not approximated) p = .034. The z differs slightly from that reported by Howell because SAS employs a correction for continuity (reducing by .5 the absolute value of the denominator of the z ratio). The next analysis is a Wilcoxon Matched Pairs Signed-Ranks Test using the data from page 657 of Howell. Glucose-Saccharine difference scores are computed and then fed to Proc Univariate. Among the many other statistics reported with Proc Univariate, there is the Wilcoxon Signed-Ranks Test. For the data employed here, you will see that SAS reports “S = 53.5, p n(n + 1) = .004.” S, the signed-rank statistic, is the absolute value of T − , where T 4 is the sum of the positive ranks or the negative ranks. S is the difference between the expected and the obtained sums of n(n + 1) ranks. You know that the sum of the ranks from 1 to n is . Under the 2 null hypothesis, you expect the sum of the positive ranks to equal the sum of the negative ranks, so you expect each of those sums of ranks to be half of n(n + 1) . For the data we analyzed here, the sum of the ranks 1 through 16 = 2 136, and half of that is 68. The observed sum of positive ranks is 121.5, and the observed sum of negative ranks is 14.5 The difference between 68 and 14.5 (or between 121.5 and 68) is 53.5, the value of S reported by SAS. To get T from S, just subtract the absolute value of S from the n(n + 1) − | S | . Alternatively, expected value for the sum of ranks, that is, T = 4 just report S instead of T and be prepared to explain what S is to the ignorant psychologists who review your manuscript.

If you needed to conduct several signed-ranks tests, you might not want to produce all of the output that you get by default with Proc Univariate. See my program WilcoxonSignedRanks.sas on my SAS programs page to see how to get just the statistics you want and nothing else. Note that a Binomial Sign Test is also included in the output of Proc Univariate. SAS reports “M = 5, p = .0213.” M is the difference between the expected number of negative signs and the obtained number of negative signs. Since we have 16 pairs of scores, we expect, under the null, to get 8 negative signs. We got 3 negative signs, so M - 8 - 3 = 5. The p here is the probability of getting an event as or more unusual than 3 successes on 16 binomial trials when the probability of a success on each trial is .5. Another way to get this probability with SAS is: Data p; p = 2*PROBBNML(.5, 16, 3); proc print; run;

Next is a Kruskal-Wallis ANOVA, using Howell’s data on effect of stimulants and depressants on problem solving (page 660). Do note that the sums and means reported by SAS are for the ranked data. Following the overall test, I conducted pairwise comparisons with Wilcoxon Rank Sum tests. Note how I used the subsetting IF statement to create the three subsets necessary to do the pairwise comparisons. The last analysis is Friedman’s Rank Test for Correlated Samples, using Howell’s data on the effect of visual aids on rated quality of lectures (page 661). Note that I first had to use Proc Rank to create a data set with ranked data. Proc Freq then provides the Friedman statistic as a CochranMantel-Haenszel Statistic. One might want to follow the overall analysis with pairwise comparisons, but I have not done so here. I have also provided an alternative rank analysis for the data just analyzed with the Friedman procedure. Note that I simply conducted a factorial ANOVA on the rank data, treating the blocking variable as a second independent variable. One advantage of this approach is that it makes it easy to get the pairwise comparisons -- just include the LSMEANS command with the PDIFF option. The output from LSMEANS includes the mean ranks and a matrix of p values for tests comparing each group’s mean rank with each other group’s mean rank.