0 views

Uploaded by Sugan Pragasam

Cross Tab for Practice

Cross Tab for Practice

© All Rights Reserved

- 12451-37352-1-SM.pdf
- An Objective Measure for Evaluating Efl Compo
- Biography of William Edwards Deming
- ANOVA_2
- bayestalkBW
- Gr. 10
- class1_economics
- Assignment 1 & 2
- tugasan 3 kwln kualiti
- Syl
- Quantitative Analysis
- intro stat
- ch04
- Sampling Design
- Student Tabela
- CSS Drinking
- Confidence Intervals
- Statistical Inference Frequentist
- Syl32y14r1.New 1
- prova2

You are on page 1of 7

Spring 2008

Lab Activities # 6

Today we will:

- talk about inter-rater agreement and reliability

- create contingency tables (remember?) to use in estimating kappas

- estimate kappa using SPSS and calculate it by hand

- calculate percent agreement

Before we dig into this week’s dataset, let’s talk about concepts.

Are inter-rater agreement and inter-rater reliability the same thing? Why or why

not?

Agreement: “Extent to which the different judges tend to make exactly the same judgments about

the rated subject” (Tinsley & Weiss, 1975, p. 359) E.g., different judges assign exactly the same

numerical value to the same subject.

Reliability: “Degree to which the ratings of different judges are proportional when expressed as

deviations from their means” (Tinsley & Weiss, 1975, p. 359) E.g., different judges have the same

general rank ordering of subjects, but don’t necessarily assign the same numerical value to each

subject.

*Note: not a ton of time needs to be spent on this issue, but it’s important to understand that inter-

rater agreement/reliability are NOT the same thing. Also, a general warning that a lot of very

smart people have messed this up by confusing the terms (e.g., Schmidt & Hunter, 1989;

Kozlowski & Hattrup, 1992).

1

DATAFILE: engagementratings.sav

Here we have two different raters rating the level of engagement on a reading

activity for 91 children. Each rater makes one rating per child, and there are four

levels of engagement: low, moderate-low, moderate-high, and high. What you

see in the dataset is a summary of the observations. So the frequency

associated with 1-1, for example, represents the number of children that rater1

and rater2 both rated as “low.” 1-2 represents the number of children who

received a rating of “low” from rater1 and a rating of “moderate-low” from rater2.

The engR1 variable contains the ratings for rater1 and engR2 contains ratings for

rater2. If inter-rater agreement is high, where would you expect the majority of

the frequencies to be? (in 1-1, 2-2, 3-3, and 4-4)

DataWeight Cases.

Select Weight Cases By and move freq over to the open space.

Now we will analyze the weighted data.

AnalyzeDescriptive StatisticsCrosstabs. (this should be familiar to you from

521)

Move engR1 to Rows and engR2 to Columns.

Click Statistics, select Chi Square, Phi and V, and Kappa. Click Continue.

Click Cells, select Observed and Expected.

Click Continue and OK.

Crosstabs

Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

engR1 * engR2 91 100.0% 0 .0% 91 100.0%

engR2

low mod-low mod-high high Total

engR1 low Count 7 7 2 3 19

Expected Count 2.5 5.8 3.8 6.9 19.0

mod-low Count 2 8 3 7 20

Expected Count 2.6 6.2 4.0 7.3 20.0

mod-high Count 1 5 4 9 19

Expected Count 2.5 5.8 3.8 6.9 19.0

high Count 2 8 9 14 33

Expected Count 4.4 10.2 6.5 12.0 33.0

Total Count 12 28 18 33 91

Expected Count 12.0 28.0 18.0 33.0 91.0

2

This table provides the expected, observed, and marginal counts (we could also

get the residuals), which we can use to compute kappa. Note that the diagonal

counts represent rater agreement (these are the 1-1, 2-2, etc from the dataset).

Chi-Square Tests

Asymp. Sig.

Value df (2-sided)

Pearson Chi-Square 16.955a 9 .049

Likelihood Ratio 15.486 9 .078

Linear-by-Linear

10.014 1 .002

Association

N of Valid Cases 91

a. 7 cells (43.8%) have expected count less than 5. The

minimum expected count is 2.51.

Chi square: this statistic tells us that there does appear to be dependency

somewhere in the data. This is a good thing, as we want the ratings of the two

judges to be related.

Linear-by-linear association: this tells us that as one rater’s ratings increase, the

other rater’s ratings also increase. (Note that this is only useful if the categories

used in ratings are at least ordinal in nature, ideally interval or ratio, and the rows

and columns of the contingency table are in an increasing or decreasing order. If

either is not true, both the chi-square and linear-by-linear statistics are

meaningless). This is also a good thing, as it indicates that the ratings are likely

more similar than dissimilar. A negative linear-by-linear association would not be

a good thing.

Symmetric Measures

Asymp.

a b

Value Std. Error Approx. T Approx. Sig.

Nominal by Nominal Phi .432 .049

Cramer's V .249 .049

Measure of Agreement Kappa .129 .069 2.114 .035

N of Valid Cases 91

a. Not assuming the null hypothesis.

b. Using the asymptotic standard error assuming the null hypothesis.

Phi & Cramer’s V: You tell me-which of these statistics should I focus on?

Given that our table is larger than 2X2 we’ll interpret Cramer’s V: the relationship

between rater1’s ratings and rater2’s ratings is .25, a small relationship.

Kappa: (agreement correcting for chance) is .13, which is less than desirable.

However, our kappa does differ significantly from 0, p < .04. The discrepancy

between magnitude (.13) and significance (<.04) is likely due to sample size. A

3

significant kappa is a necessary but not sufficient condition in the context of

displaying inter-rater agreement.

-So, can anyone tell me why kappa doesn’t look so great? What’s causing

the problem (i.e., lack of agreement)?

Avoiding small kappas: Ideally, before you’ve put a ton of time and resources into

your study, you have piloted your scales (particularly if they are behavioral

observation scales). If there are problems with the rating scale, they will likely

emerge during the pilot study (i.e., you’ll see that kappa stinks). However, the

problem may also be in the way you train your raters. Debriefing with the raters is

a good way to get to the root cause.

Dealing with small kappas: In some sense, you can think of a small kappa in the

same way that you would think of a really low alpha (i.e., .4). It’s a bummer, and

you don’t want it. It limits the confidence you can have in the validity of the

construct in the same way that a low alpha does: if you’re not at least measuring

“something” consistently, how can you say that the “something” you’re measuring

is in fact what you intended to measure?

That said, some people will try to reduce the number of rating categories in order

to improve kappa. Reducing the number of categories to improve inter-rater

agreement is a judgment call, and you will likely need to do quite a bit of justifying

so that your reviewers, peers, advisors, committee members, etc., don’t yell at

you. This decision becomes especially shady because it is known that, all else

equal, kappa is higher when the number of categories is smaller. In short, I

wouldn’t advocate this approach.

DATAFILE: engagementratingsB.sav.

This dataset contains the data from two different raters who rated the same 91

children as in the first example. Here, though, raters were simply asked to

indicate whether the level of engagement on the activity was high or low. Let’s

analyze the agreement between these two new raters using a different rating

scale.

DataWeight Cases. Select Weight Cases By and move freq over to the open

space. Now we will analyze the weighted data following the same steps as

above. AnalyzeDescriptive StatisticsCrosstabs. Move engR1 to Rows and

engR2 to Columns.

Statistics: select Chi Square, Phi and V, and Kappa. Click Continue.

Cells: select Observed and Expected. Click Continue and OK.

Crosstabs

4

Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

engR1 * engR2 91 100.0% 0 .0% 91 100.0%

engR2

low high Total

engR1 low Count 24 15 39

Expected Count 17.1 21.9 39.0

high Count 16 36 52

Expected Count 22.9 29.1 52.0

Total Count 40 51 91

Expected Count 40.0 51.0 91.0

This table provides the expected, observed, and marginal counts (we could also

get the residuals…), which we can use to compute kappa. Note that the diagonal

counts represent rater agreement.

To calculate % agreement:

#Agree/total =

60/91 = .66

66% rater agreement on engagement level during the activity.

Chi-Square Tests

Value df (2-sided) (2-sided) (1-sided)

Pearson Chi-Square 8.565b 1 .003

Continuity Correctiona 7.361 1 .007

Likelihood Ratio 8.657 1 .003

Fisher's Exact Test .005 .003

Linear-by-Linear

8.471 1 .004

Association

N of Valid Cases 91

a. Computed only for a 2x2 table

b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 17.

14.

Linear-by-linear association: as one rater’s ratings increase, the other rater’s

ratings also increase. Again, this is a good thing.

5

Symmetric Measures

Asymp.

a b

Value Std. Error Approx. T Approx. Sig.

Nominal by Nominal Phi .307 .003

Cramer's V .307 .003

Measure of Agreement Kappa .307 .100 2.927 .003

N of Valid Cases 91

a. Not assuming the null hypothesis.

b. Using the asymptotic standard error assuming the null hypothesis.

Phi: .31, p = .003; the ratings of rater one are moderately related to the ratings of

rater two.

Kappa: (agreement correcting for chance) is .31—still less than desirable, but

better than in the other example. Kappa does differ significantly from 0, p < .01.

There is 31% agreement between rater one and rater two after correcting for

chance. Note how much this differs from the simple percent agreement statistic

of .66!

Low-low Cell

-Probability of low rating by rater 1 = 39/91 = .43

-Probability of low rating by rater 2 = 40/91 = .44

If the two raters are operating independently of one another, the probability that

they would both report low = .43X.44 = .19.

Thus, for 91 cases, .19X91 = 17.22; 17.22 low ratings for both raters would be

expected purely by chance.

Expected frequency for low-low cell: 17.22

High-high Cell

-Probability of high rating by rater 1 = 52/91 = .57

-Probability of high rating by rater 2 = 51/91 = .56

If the two raters are operating independently of one another, the probability that

they would both report low = .57X.56 = .32.

Thus, for 91 cases, .32X91 = 29.14; 29.14 high ratings for both raters would be

expected purely by chance.

Expected frequency for low-low cell: 29.14

Now we can plug these values, along with the observed frequencies, into the

tables below. (Note: of course, you can get the expected frequencies directly

from SPSS, rather than calculating them by hand)

6

Exp Freq r2 low r2 high Obs Freq r2 low r2 high

r1 low 17.22 r1 low 24

r1 low 29.14 r1 high 36

Adding agreement diagonal: 46.36 60

κ = Pobs – Pexp

1 – Pexp

κ = (60/91) – (46/91)

1 – (46/91)

~30% agreement between rater one and rater two after correcting for chance

(basically the same as what we found when using SPSS, with rounding error)

(24/15) / (36/16) = .71

The odds-ratio tells us that the odds of raters agreeing on high engagement .71

times greater than the odds of raters agreeing on low engagement. This might

indicate that it is easier for the raters to recognize high engagement behaviors.

- 12451-37352-1-SM.pdfUploaded byrebecca b bagtasos
- An Objective Measure for Evaluating Efl CompoUploaded bytolitz_basilio
- Biography of William Edwards DemingUploaded byA Joshi
- ANOVA_2Uploaded byRituraj Singh
- bayestalkBWUploaded byraskhy
- class1_economicsUploaded bySushil Kanathe
- Assignment 1 & 2Uploaded byNoraina Abdullah
- SylUploaded byfriendboy
- Gr. 10Uploaded byJoyce Anne Tuala Yabut
- tugasan 3 kwln kualitiUploaded byshafarizy
- Quantitative AnalysisUploaded byedniel maratas
- intro statUploaded bykm3197
- ch04Uploaded byRohan Viswanath
- Sampling DesignUploaded byJames Davis
- Student TabelaUploaded byDiego Azevedo Souza
- CSS DrinkingUploaded byJohn Arthur
- Confidence IntervalsUploaded byminc727
- Statistical Inference FrequentistUploaded byChanpreet Singh
- Syl32y14r1.New 1Uploaded byHéctor Flores
- prova2Uploaded byVERDADEILUMINADA
- 10.2 Ci and Sig Test for Two Means28 Student in Class29Uploaded byJack Mserk
- is.15393.2.2003.pdfUploaded byLuis Constante
- Treatment Effect on Recidivism for Juveniles Who Have Sexually Offended - A MetanalysisUploaded bySarah White
- SolSet5AUploaded byharisankerPradeep
- Instructions_for_Chi-Square_Test.pdfUploaded byEsteban Garces
- Probability Calculator InstructionsUploaded byMbuso Jama
- PEER_stage2_10.1007%252Fs00107-009-0356-7Uploaded bydarshak444
- ClusterExample.docUploaded byShejj Peter
- INTRO TO STAT.pptUploaded byRainiel Victor M. Crisologo
- Final ExamUploaded byPerficmsfit

- Business StatisticsUploaded byLokesh Baskaran
- India International ExchangeUploaded bySugan Pragasam
- Calcutta Stock ExchangeUploaded bySugan Pragasam
- National Stock Exchange of IndiaUploaded bySugan Pragasam
- Writing Up a Factor AnalysisUploaded byKhan Sahab
- BSEUploaded bySugan Pragasam
- Economy_of_India.pdfUploaded bySugan Pragasam
- AmulUploaded bySugan Pragasam
- Alibaba GroupUploaded bySugan Pragasam
- AWS White PaperUploaded byiam_ritehere
- SWOT analysis template.docUploaded bySugan Pragasam
- Sampling ProceduresUploaded byEdinson Raul Quispe
- 7.pdfUploaded bySugan Pragasam
- Factor AnalysisUploaded bySugan Pragasam
- Workshop3 Bangert Factor Analysis Tutorial and ExSurvey ItemsUploaded bySugan Pragasam
- Running the Crosstabs Option in SPSSUploaded bySugan Pragasam
- STAT Formulas 08262008Uploaded bySugan Pragasam
- STAT Formulas 08262008.docUploaded byRenger Abona
- IntroductionUploaded bySugan Pragasam
- mc-106Uploaded bySugan Pragasam
- Intro to ANOVAUploaded byjohnoo7
- CommerceUploaded bySugan Pragasam
- MGT.pdfUploaded bySugan Pragasam
- HRM-MCQ-Sem-IUploaded bySuren Koundal
- Multiple Choice Questions (MCQ) and Answers on Financial ManagementUploaded byShaveeto Khan
- D 1715 Paper III ManagementUploaded bySugan Pragasam
- 2 MGT SYLLABUSUploaded bySugan Pragasam

- Cell FormationUploaded byAmrik Singh
- Morgan Essay - myth and propagandaUploaded bywulfstan1000
- Buckling of compositesUploaded byAnonymous LcR6ykPBT
- Anisa UmerUploaded byHenry Duna
- 7809.pdfUploaded byDiogomussum
- Delivering Sustainable GrowthUploaded byMatt Mace
- manufacturing company-statement-Uploaded bykuningan
- algebra 2 quarter 3 do nowsUploaded byapi-214128188
- Trees in DataStructuresUploaded bybooknpen
- A comprehensive survey: artificial bee colony (ABC) algorithm and applicationsUploaded byCaterina Carbone
- Assignment on International Business ( The current business environment of India)Uploaded byAman Bhattacharya
- Bob 'Dylan-Christ''s Rolling Stone Mag Crucifixion: Father Forgive the Evangelical Apologists for They Know Not What They EquateUploaded by'Messianic' Dylanologist
- Oracle EBS Concepts R 12.2Uploaded byTtereezsiittaa Geerooniimoo
- Pastpaper Pk Nts Gat Subject Test Sample Paper of ManagementUploaded byMuhammad Arslan Usman
- ROTALIGN Ultra System ManualUploaded byFranco Taday
- Functions of HrmUploaded bymehtamehul2904
- An Ordinary of Arms 1895Uploaded byGeordie Winkle
- Power Reader Summer/Fall Reading SamplerUploaded byRandom House Publishing Group
- AuditingUploaded bySuresh Reddy
- For the Health of Body and SoulUploaded bybaronehhde
- Renew Green Energy LimitedUploaded byHimanshu Jhamb
- Tales of Madness, Miracles, Death and SalvationUploaded byValentina Ferracioli
- Composition in Bahia, Brazil - Ernst Widmer and His Octatonic Strategies.pdfUploaded byGeorge Christian Vilela Pereira
- LC-90LE760XUploaded byبوند بوند
- Chapter 3 McUploaded byLari Jean Gallos
- Python SQL DatabaseUploaded byGaurav Arora
- ReviewUploaded bynve09
- Garment Seam Slippage Test ProcedureUploaded byAnonymous jVpMr5bQ
- 070.650-SPL_SGC_2014-01Uploaded byMark DeMateo
- Evaluation of the CO2behavior in binary mixtures with alkanes, alcohols, acids and esters using the Cubic-Plus-Association Equation of StateUploaded byMario Ricardo Urdaneta Parra