You are on page 1of 57

The Chi Square Test

By SDK, AIM

Chi sq: The test of the goodness of fit

The Chi Square Test


The Chi Square Test (2) is used to

determine how well theoretical distributions (Normal, Binomial, Poisson, etc) fit empirical distributions (Those obtained from samples) Pearson developed this test in 1990 to check the goodness of fit of distributions

Consider
A Particular Sample : A set of possible events E1, E2,, Ek, that are observed to occur with frequencies o1, o2,, ok (called observed frequencies). As per the rules of probability, these events are expected to occur with frequencies e1, e2,, ek Event
Expected frequency

E1 e1

E2 o2 e2

Ek ok ek

Observed frequency o1

Example
If we toss a fair coin 100 times, we may expect 50 heads and 50 tails. However the results may not be obtained exactly

The 2 Variable gives a measure of the disparity existing between the observed and the expected frequencies

The 2 Variable

2 = i=1k (oi-ei)2/ei
N = total frequency = i=1koi = i=1kei

Thus
2 = i=1n[(oi2-2oiei+ei2)/ei] = i=1n[(oi2/ei) 2N + N] = i=1n(oi2/ei)-N

Example
I assume that the marks of a class are distributed normally. However when the examination takes place I realize that the class has had a better performance Marks
Observed frequency Expected frequency

0-20
21-40 41-60

2
5 18

5
10 30

61-80

23

10
5 60

81-100 12 Total 60

Marks

Observed Expected frequency frequency

(oi-ei)2/ei 1.8 2.5 4.8 16.9


9.8 35.8 = 2

0-20 21-40 41-60 61-80


81-100 Total

2 5 18 23
12 60

5 10 30 10
5 60

Concept
If

2=0 2>0

Observed & theoretical distributions fit exactly. Observed and theoretical distributions differ

Also, 20 as it is the sum of squares & the larger the 2 the greater the difference in the two distributions

The probability function of 2

Calculation of
Population Population parameters are not parameters are known and have to known m = 0 be estimated from =k-1 sample statistics = k 1 m, m = no of population parameters estimated

The Curve
Acceptance Area
Y

Rejection Area

Table value of 2

Example at = 0.05 and df = 5-1 = 4

Table value of 2

Steps of the Chi Square Test


Define H0 and H1 List the observed frequencies Calculate the expected frequencies if the data follows a theoretical distribution

Compute 2

Accept H0 if computed 2 < 2 critical

Example
In 200 tosses of a coin, 115 heads and 85 tails are observed. Is the coin fair? H0 : Coin is fair Here are 2 events :
E1 : Outcome is head E2 : Outcome is tail

Event

E1

E2

Observed 115

85

Expected 100

100

Event

E1

E2 Chi sq 85 crit at df = 1, Chi sq level of comp signific 100 ance = 0.05 2.25 4.5 3.8415

Observed 115

Expected 100

(o-e)sq/e 2.25

Thus H0 is rejected. Coin is not fair.

The expected and actual sales of a TV company are given below Actual Sales fo
57 69 51 83 44 48 35 37

Expected Sales fe
59 76 55 75 39 53 30 48 2

(fo-fe)2/fe
0.068 0.645 0.291 0.853 0.641 0.472 0.833 2.521 6.324

Note that df = 7

Thus
H0 : Observed and expected frequencies are similar rejected. Or actual sales do not meet expectation

Example
A company requires that college seniors who are seeking placement be interviewed by 3 different executives. Each executive gives the candidate either a +ve or a ve rating. For staffing purposes, the HR director thinks that the HR process can be approximated by a binomial distribution. Is he right ?

Data given :
No of +ve ratings 0 1 2 3 No of candidates fo 18 47 24 11

Here n = 3, r = 0,1,2,3
No of +ve ratings 0 1 2 3 No of No of Prob (r ) candidates candidates (fo-fe)2/fe fe=Prob (r ) fo 0.216 0.432 0.288 0.064 21.6 43.2 28.8 6.4 100 18 47 24 11 100 0.6000 0.3343 0.8000 3.3063 5.0405 4 3 7.8147 Chi sq comp k df Chisq crit at alpha = 0.05

nCr 1 3 3 1

Thus

H0 : Binomial Distribution with p = 0.4 is a good description of the interview process is accepted

Example
A salesman for a paper company has 5 accounts to visit per day. It is suggested that the variable, sales may be described by the binomial distribution, with the probability of selling each account being 0.45. Given the following frequency distribution of the sales per day, can we conclude that the data do in fact follow the suggested distribution? ( = 0.05)

Number of sale, r Observed frequency of sales, fo

0 8

1 30

2 52

3 33

4 14

5 3

H0: Data follow the Binomial distribution

Number of sale, r 0 1 2 3 4 5 Total

Expected Observed frequency 5Cr frequency of (fo-fe) 2 /fe of sales, sales, fo fe 1 7 8 0.1292 5 29 30 0.0479 10 47 52 0.4951 10 39 33 0.8101 5 16 14 0.2024 1 3 3 0.0672 140 140 1.7519 Chi sq comp

Chi sq crit at = 0.05 and df = 6-1 = 4 is 12.592 > Chi sq comp Accept H0

Chi sq as a test of Independence

Note that :
The tests of significance are all based on the assumption that the population is normally distributed. However it is not always possible to assume the underlying distribution pattern for the sampling done

If we classify a population into several categories with respect to two attributes (e.g.: age, job preference), we can use the Chi Sq Test to determine if the two attributes are independent of each other

Example
In 4 regions National Health Company samples its employees attitudes towards job performance reviews. Respondents are given a choice: between the present method of 2 reviews a year and the proposed method of quarterly reviews.

Also,
1.

2.
3. 4.

pN is the proportion of employees from the north who prefer the present plan pE is the proportion of employees from the east who prefer the present plan pS is the proportion of employees from the south who prefer the present plan pW is the proportion of employees from the west who prefer the present plan

H0: pN

= pE = pS = pW

Contingency table
North South East West Number who prefer present method Number who prefer new method Total Total

68

75

57

79

279

32

45

33

31

141

100

120

90

110

420

Thus combined proportion of employees preferring the new method = 1 0.6643 = 0.3357
Thus, 1. 0.6643 = Estimate of population proportion who prefer the current method 2. 0.3357 = Estimate of population proportion who prefer the new method Multiply the estimate with the total number of employees sampled in each region to get the expected number

Contingency table
Observed values North South East West Number who prefer present method Number who prefer new method Total Total
Number who prefer present method Number who prefer new method Total Expected Values North South East West Total

68

75

57

79

279

66

80

60

73

279

32

45

33

31

141

34

40

30

37

141

100

120

90

110

420

100

120

90

110

420

Degrees of freedom=(4-1)(2-1) =3

Thus H0 is accepted

Consider a contingency table


Criteria

C1 O11 O21

C2 O12 O22

C3 O13 O23

Total

R1 R2

R1
R1

Total

C1

C2

C3

H0: Ri is independent of Cj Or P(Ri Cj) = P(Ri) P(Cj) But P(Ri Cj) = Eij/ n Also P(Ri) = Ri/ n P(Cj) = Cj/ n Thus = P(Ri Cj) = P(Ri) P(Cj) = (Ri/ n)(Cj/ n) Thus Eij = Ri Cj/ n2

Example
In order to study the profits and losses of firms by industry, a random sample of 100 firms is selected, and for each form in the sample, we record whether the company made ,money or lost money, and whether the firm is a service company. The data are summarized in a 2x2 contingency table. Is the possibility of making a profit independent of type of industry?

An insurance companys data regarding claims gathered by studying three different age groups of sample size 100 each is given below
Age group Over 25 and 25 and 50 and under under over 50 40 35 60 60 65 40

H0: Claim is not related to age

Claim No claim

Contingency Table(Observed Values) Contingency Table(Expected Values) Age group Age group Over Over 25 and 25 and 50 and 25 and 25 and 50 and under under over under under over 50 50 Claim 40 35 60 135 Claim 45 45 45 135 No No 60 65 40 165 55 55 55 165 claim claim 100 100 100 300 100 100 100 300

fo 40 35 60 60 65 40

fe 45 45 45 55 55 55 Chi sq comp df Level of significance Chi sq tabulated

(fo-fe)2/fe 0.56 2.22 5.00 0.45 1.82 4.09 14.14 2 0.05 5.99

Thus reject H0

THE MEDIAN TEST

Example
An economist wants to testy the null hypothesis that median family incomes in three rural areas are approximately equal. For simplicity, an equal sample size of 10 in each population was chosen. The family incomes are shown alongside
Family incomes $1000 per year Region A Region B Region C 22 31 28 29 37 42 36 26 21 40 25 47 35 20 18 50 43 23 38 27 51 25 41 16 62 57 30 16 32 48

Family incomes $1000 per year Region A Region B Region C 22 31 28 29 37 42 36 26 21 40 25 47 35 20 18 50 43 23 38 27 51 25 41 16 62 57 30 16 32 48 Median 31.5

Family incomes $1000 per year Region A Region B Region C No of incomes less than median No of incomes greater than than median 4 6 5 5 6 4

Contingency Table(Observed Values) Family incomes $1000 per year Region A Region B Region C No of incomes less than median No of incomes greater than than median Total 4 6 10 5 5 10 6 4 10

Total 15 15 30

Contingency Table(Expected Values)Expected value = (10x15)/30 Region A Region B Region C No of incomes less 5 5 5 15 than median No of incomes greater than than 5 5 5 15 median Total 10 10 10 30

fo 4 5 6 6 5 4

fe (fo-fe)2/fe 5 0.2 5 0 5 0.2 5 0.2 5 0 5 0.2 Chi sq comp 0.8 df 2 Level of 0.05 significance Chi sq tabulated 5.99

Thus accept H0

You might also like