Professional Documents
Culture Documents
By SDK, AIM
determine how well theoretical distributions (Normal, Binomial, Poisson, etc) fit empirical distributions (Those obtained from samples) Pearson developed this test in 1990 to check the goodness of fit of distributions
Consider
A Particular Sample : A set of possible events E1, E2,, Ek, that are observed to occur with frequencies o1, o2,, ok (called observed frequencies). As per the rules of probability, these events are expected to occur with frequencies e1, e2,, ek Event
Expected frequency
E1 e1
E2 o2 e2
Ek ok ek
Observed frequency o1
Example
If we toss a fair coin 100 times, we may expect 50 heads and 50 tails. However the results may not be obtained exactly
The 2 Variable gives a measure of the disparity existing between the observed and the expected frequencies
The 2 Variable
2 = i=1k (oi-ei)2/ei
N = total frequency = i=1koi = i=1kei
Thus
2 = i=1n[(oi2-2oiei+ei2)/ei] = i=1n[(oi2/ei) 2N + N] = i=1n(oi2/ei)-N
Example
I assume that the marks of a class are distributed normally. However when the examination takes place I realize that the class has had a better performance Marks
Observed frequency Expected frequency
0-20
21-40 41-60
2
5 18
5
10 30
61-80
23
10
5 60
81-100 12 Total 60
Marks
2 5 18 23
12 60
5 10 30 10
5 60
Concept
If
2=0 2>0
Observed & theoretical distributions fit exactly. Observed and theoretical distributions differ
Also, 20 as it is the sum of squares & the larger the 2 the greater the difference in the two distributions
Calculation of
Population Population parameters are not parameters are known and have to known m = 0 be estimated from =k-1 sample statistics = k 1 m, m = no of population parameters estimated
The Curve
Acceptance Area
Y
Rejection Area
Table value of 2
Table value of 2
Compute 2
Example
In 200 tosses of a coin, 115 heads and 85 tails are observed. Is the coin fair? H0 : Coin is fair Here are 2 events :
E1 : Outcome is head E2 : Outcome is tail
Event
E1
E2
Observed 115
85
Expected 100
100
Event
E1
E2 Chi sq 85 crit at df = 1, Chi sq level of comp signific 100 ance = 0.05 2.25 4.5 3.8415
Observed 115
Expected 100
(o-e)sq/e 2.25
The expected and actual sales of a TV company are given below Actual Sales fo
57 69 51 83 44 48 35 37
Expected Sales fe
59 76 55 75 39 53 30 48 2
(fo-fe)2/fe
0.068 0.645 0.291 0.853 0.641 0.472 0.833 2.521 6.324
Note that df = 7
Thus
H0 : Observed and expected frequencies are similar rejected. Or actual sales do not meet expectation
Example
A company requires that college seniors who are seeking placement be interviewed by 3 different executives. Each executive gives the candidate either a +ve or a ve rating. For staffing purposes, the HR director thinks that the HR process can be approximated by a binomial distribution. Is he right ?
Data given :
No of +ve ratings 0 1 2 3 No of candidates fo 18 47 24 11
Here n = 3, r = 0,1,2,3
No of +ve ratings 0 1 2 3 No of No of Prob (r ) candidates candidates (fo-fe)2/fe fe=Prob (r ) fo 0.216 0.432 0.288 0.064 21.6 43.2 28.8 6.4 100 18 47 24 11 100 0.6000 0.3343 0.8000 3.3063 5.0405 4 3 7.8147 Chi sq comp k df Chisq crit at alpha = 0.05
nCr 1 3 3 1
Thus
H0 : Binomial Distribution with p = 0.4 is a good description of the interview process is accepted
Example
A salesman for a paper company has 5 accounts to visit per day. It is suggested that the variable, sales may be described by the binomial distribution, with the probability of selling each account being 0.45. Given the following frequency distribution of the sales per day, can we conclude that the data do in fact follow the suggested distribution? ( = 0.05)
0 8
1 30
2 52
3 33
4 14
5 3
Expected Observed frequency 5Cr frequency of (fo-fe) 2 /fe of sales, sales, fo fe 1 7 8 0.1292 5 29 30 0.0479 10 47 52 0.4951 10 39 33 0.8101 5 16 14 0.2024 1 3 3 0.0672 140 140 1.7519 Chi sq comp
Chi sq crit at = 0.05 and df = 6-1 = 4 is 12.592 > Chi sq comp Accept H0
Note that :
The tests of significance are all based on the assumption that the population is normally distributed. However it is not always possible to assume the underlying distribution pattern for the sampling done
If we classify a population into several categories with respect to two attributes (e.g.: age, job preference), we can use the Chi Sq Test to determine if the two attributes are independent of each other
Example
In 4 regions National Health Company samples its employees attitudes towards job performance reviews. Respondents are given a choice: between the present method of 2 reviews a year and the proposed method of quarterly reviews.
Also,
1.
2.
3. 4.
pN is the proportion of employees from the north who prefer the present plan pE is the proportion of employees from the east who prefer the present plan pS is the proportion of employees from the south who prefer the present plan pW is the proportion of employees from the west who prefer the present plan
H0: pN
= pE = pS = pW
Contingency table
North South East West Number who prefer present method Number who prefer new method Total Total
68
75
57
79
279
32
45
33
31
141
100
120
90
110
420
Thus combined proportion of employees preferring the new method = 1 0.6643 = 0.3357
Thus, 1. 0.6643 = Estimate of population proportion who prefer the current method 2. 0.3357 = Estimate of population proportion who prefer the new method Multiply the estimate with the total number of employees sampled in each region to get the expected number
Contingency table
Observed values North South East West Number who prefer present method Number who prefer new method Total Total
Number who prefer present method Number who prefer new method Total Expected Values North South East West Total
68
75
57
79
279
66
80
60
73
279
32
45
33
31
141
34
40
30
37
141
100
120
90
110
420
100
120
90
110
420
Degrees of freedom=(4-1)(2-1) =3
Thus H0 is accepted
C1 O11 O21
C2 O12 O22
C3 O13 O23
Total
R1 R2
R1
R1
Total
C1
C2
C3
H0: Ri is independent of Cj Or P(Ri Cj) = P(Ri) P(Cj) But P(Ri Cj) = Eij/ n Also P(Ri) = Ri/ n P(Cj) = Cj/ n Thus = P(Ri Cj) = P(Ri) P(Cj) = (Ri/ n)(Cj/ n) Thus Eij = Ri Cj/ n2
Example
In order to study the profits and losses of firms by industry, a random sample of 100 firms is selected, and for each form in the sample, we record whether the company made ,money or lost money, and whether the firm is a service company. The data are summarized in a 2x2 contingency table. Is the possibility of making a profit independent of type of industry?
An insurance companys data regarding claims gathered by studying three different age groups of sample size 100 each is given below
Age group Over 25 and 25 and 50 and under under over 50 40 35 60 60 65 40
Claim No claim
Contingency Table(Observed Values) Contingency Table(Expected Values) Age group Age group Over Over 25 and 25 and 50 and 25 and 25 and 50 and under under over under under over 50 50 Claim 40 35 60 135 Claim 45 45 45 135 No No 60 65 40 165 55 55 55 165 claim claim 100 100 100 300 100 100 100 300
fo 40 35 60 60 65 40
(fo-fe)2/fe 0.56 2.22 5.00 0.45 1.82 4.09 14.14 2 0.05 5.99
Thus reject H0
Example
An economist wants to testy the null hypothesis that median family incomes in three rural areas are approximately equal. For simplicity, an equal sample size of 10 in each population was chosen. The family incomes are shown alongside
Family incomes $1000 per year Region A Region B Region C 22 31 28 29 37 42 36 26 21 40 25 47 35 20 18 50 43 23 38 27 51 25 41 16 62 57 30 16 32 48
Family incomes $1000 per year Region A Region B Region C 22 31 28 29 37 42 36 26 21 40 25 47 35 20 18 50 43 23 38 27 51 25 41 16 62 57 30 16 32 48 Median 31.5
Family incomes $1000 per year Region A Region B Region C No of incomes less than median No of incomes greater than than median 4 6 5 5 6 4
Contingency Table(Observed Values) Family incomes $1000 per year Region A Region B Region C No of incomes less than median No of incomes greater than than median Total 4 6 10 5 5 10 6 4 10
Total 15 15 30
Contingency Table(Expected Values)Expected value = (10x15)/30 Region A Region B Region C No of incomes less 5 5 5 15 than median No of incomes greater than than 5 5 5 15 median Total 10 10 10 30
fo 4 5 6 6 5 4
fe (fo-fe)2/fe 5 0.2 5 0 5 0.2 5 0.2 5 0 5 0.2 Chi sq comp 0.8 df 2 Level of 0.05 significance Chi sq tabulated 5.99
Thus accept H0