You are on page 1of 14

1.

INTRODUCTION
Categorical data arise in a myriad of situations and the methods for the same are much relevant
in the current scenario where statistics has an indispensable role to play. Statistical methods for
categorical data were late in gaining sophistication unlike those for continuous data which was
achieved early in the 20th century itself.
1.1 CATEGORICAL RESPONSE DATA
A ‘categorical variable’ has a measurement scale consisting of a set of categories. As example
we have, ‘political philosophy’ measured as Liberal, Moderate or Conservative.
Categorical scales are all-pervasive as exhibited in the following examples:
Field Variable or Phenomenon Categories
1 Cultural Activities Music preference • Carnatic
• Hindustani
• Light Music
• Western
• Folk
2 Zoology Food preference of alligators • Fish
• Invertebrate
• Reptile
3 Education Course Preference of students at UG level • Arts
• Science
• Engineering
• Medicine
• Fine Arts/ Media
• Law
4 Medical Science Outcome of Head Injury • Mild Pain in head
• Acute Pain in head
• Severe Injury (Requiring Hospitalization)
• Critical Injury (Requiring Intensive Care)
• Coma
• Death
5 Marketing Consumer Preference for a product • Brand A
• Brand B
• Brand C
6 Engineering(Quality Control) Quality of a product • Conforming to standards
• Non-Conforming (but can be used)
• Non-Conforming & Useless
7 Management Employee Satisfaction • Dissatisfied
• Somewhat Satisfied
• Well Satisfied
• Happy
8 Spiritual activities Religious affiliation • Hindu
• Christian
• Muslim
• Jain
• Buddhist
9 Behavioural Sciences Type of Mental Illness • Schizophrenia
• Depression
• Neurosis
10 Transportation Mode of transport • Bus
• Train
• Car
• Motorcycle
• Cycle

1
1.1.1 Categorical Variables – Nominal and Ordinal Scales
Categorical variables have two primary types of scales: Nominal & Ordinal.
Variables having categories without a natural ordering are called Nominal.
Variables in Examples 1, 2, 3, 5, 8, 9, 10 in the table above are Nominal.
For a nominal variable, the order of listing the categories is irrelevant and the statistical analysis
does not depend on that ordering.
Variables with categories that have an ordering are called Ordinal.
Variables in Examples 4, 6, 7 in the table above are Ordinal.
Ordinal variables have ordered categories, but the distances between categories are unknown. In
Example 4 above, though an injury categorized as ‘Critical Injury’ is worse than ‘Severe Injury’,
no numerical value describes how much more worse it is [ Of course, medical personnel may
have some important parameters whose measurement may lead to the categorization as ‘Severe’,
‘Critical’, and so on].
1.1.2 Numerical Variables – Interval and Ratio Scales
Numerical variables are those that have numerical distances between two values. There are two
types: Interval & Ratio Scale.
Numerical variables whose measurements are based on an ‘arbitrary’ origin (That is, one for
which a ‘zero’ is not a natural zero but is chosen for convenience) are known as Interval
variables. For example, ‘temperature (in centigrade)’ is an interval variable because, the ‘zero’
value of temperature is ‘chosen’ as the temperature at which water freezes to ice. Similarly, the
time of the day (railway time) is measured by taking midnight as ‘origin’. For interval variables,
the ratio of two values does not have an interpretation.
Numerical variables that have a ‘natural’ zero and for which ‘ratio’ of two values has valid
interpretation are known as Ratio-Scalevariables. For example, income, number of years of
education, weight, distance, lifetime of an electronic product, etc., are in ratio scale.
It may however be noted that the difference between IntervalandRatio scales is sometimes
blurred and we use the term ‘Interval variables’ for both types.
The way that a variable is measured determines its classification. For example, ‘Education’ is
only nominal if measured as government school, aided school & private school. ‘Education’
becomes ordinal if measured as primary, secondary, higher-secondary, bachelor’s degree,
master’s degree, doctorate. It is a ratio-scale variable if measured as ‘Number of years of
education’.
Note: The hierarchy of measurement is: Nominal < Ordinal < Numerical.
Methods suited for one type can be used for higher types. For instance, methods for nominal
variables can be used for ordinal variables by ignoring the ordering of categories. However, it is
not possible to do the other way since nominal categories cannot be assigned orderings
meaningfully. We also note that numerical variables with a small number of distinct values and
those which are categorized as low, medium and high values can be treated as ordinal.

2
1.1.3 Discrete - Continuous Distinction
Numerical variables are classified as discrete or continuous according to the number of values
they can take. Actual measurements occur in a discrete manner due to limitations in our
instruments. The discrete-continuous classification, in practice, distinguishes between variables
that take lots of values and variables that take a few values. For instance, discrete variables
having a large number of values (eg. examination scores, number of inhabitants of towns) are
treated as continuous responses.

1.2 DISTRIBUTIONS RELEVANT TO CATEGORICAL DATA


Inferential data analysis require ‘reasonable’ assumptions about the random mechanism that
generate the data. For continuous responses, it is well known that normal distribution plays a
central role. For categorical responses, the following are relevant:
HypergeometricDistribution
• Binary responses measured in a sample drawn without replacement from a finite [small]
population.
Binomial Distribution
• Binary responses from ‘independent’ trials conducted under ‘identical’ conditions.
• Binary responses measured in a sample drawn with replacement from a finite population
• Binary responses measured in a sample drawn without replacement from a relatively
‘large’ population.
Multinomial Distribution
• Responses with more than two possible categories of outcomes obtained from
independent trials under identical conditions
Poisson Distribution
• Count data that do not result from a fixed number of trials
• Count of events that occur randomly over time or space (i.e. when outcomes in ‘non-
overlapping’ intervals of time or ‘disjoint’ regions are independent).
• Binomial count, when the number of trials is ‘sufficiently large’ and the probability of
‘success’ in each trial is ‘sufficiently’ small.
Overdispersion
• Count data that exhibit variability exceeding the ones governed by Binomial or Poisson
laws.
• In a ‘Binomial-like’ situation, when there is no assurance of a ‘constant’ probability of
success in different trials [ i.e., the success probability varies due to various factors; in
other words, success probability itself is ‘random’ (Bayesian) ]
• In a ‘Poisson-like’ situation, when variance can ‘exceed’ the mean, the negative binomial
distribution is relevant.

3
Connection between Poisson and Multinomial distributions
Let Y1, Y2, . . .,Yk be independent Poisson variables with means λ1, λ2, . . ., λk. The total Σ Yi
follows Poisson (λ) due to additive property. Here, λ = Σ λi. Now, let us consider the conditional
distribution of (Y1, Y2, …,Yk) given Σ Yi = n.
P[ Y1 = n1, Y2 = n2, . . ., Yk = nk | Σ Yi = n ]
P [ Y1  n1 , Y2  n2 , .... , Yk  nk ,  Yi  n ]
=
P [  Yi  n]
P [ Y1  n1 , Y2  n2 , .... , Yk  n  n1      nk 1 ]
=
P [  Yi  n]

=
 [exp( )  i i
ni
/ ni ! ]
where nk = n – n1 – n2 – · · · – nk-1
exp(i ) (i ) n / n !
n!  ni
n!  i 
ni

  
i
= =
 ni ! n  ni !   
which is the p.m.f. of a Multinomial distribution with parameters (n, λ1/λ, λ2/λ, . . ., λk-1/λ).
Analyses using multinomial law usually have the same parameter estimates as those of analyses
under the Poisson law, due to the similarity in the likelihood functions.
Exercises:
1)Each of 100 multiple multiple choice questions on an exam has four possible answers, one of
which is correct. For each question, a student guesses by selecting an answer randomly.
Q.a] Specify the distribution of the student’s number of correct answers
Answer: Since the student answers at random, the prob. of giving a correct answer for any
question is 1/4. Also, the way he answers, his answers to different questions are independent of
each other. Thus, each trial is Bernoulli with success probability 1/4 and there are 100 trials.
Thus, the Number of Correct Answers follows B(100, 1/4).
Q. b] Find the mean and s.d. of that distribution. Would it be surprising if the student made at
least 50 correct responses? Why?
Answer: Mean = np = 100 x 1/4 = 25, Var = np(1 – p) = 100 x 1/4 x 3/4 = 18.75
Denote the no. of correct answers as X.
The prob. of at least 50 correct answers = P( X ≥ 50) = 0.000000021 [which is very very small].
Thus the event “X ≥ 50” is a very improbable event. So, it would be really surprising if this event
happens.
Q. c] Specify the distn of (n1, n2, n3, n4) where nj is the no. of times the student picked choice ‘j’.
Answer: The distn of (n1, n2, n3) is Multinomial (100, 1/4, 1/4, 1/4).
This is because, each trial can result in one of four choices with equal probability 1/4. Thus, we
get Multinomial distn.
[Also, note that n4 is not considered because, specification of the other three automatically
specifies the fourth one].

4
Q. d] Find E(nj), Var(nj), Cov(nj, nk), Corr(nj, nk)
Answer: E(nj) = 100 x 1/4 = 25, j = 1,2,3,4.
Var(nj) = 100 x 1/4 x 3/4 = 18.75
Cov(nj, nk) = – 100 x 1/4 x 1/4 = – 6.25
Corr(nj, nk) = – 6.25 / 18 .75 x 18 .75 = – 1/3
______________________________________________________________________________
2)A British author Graham Greene underwent a period of mental depression and during that
time, he played a ‘game’ which consisted of putting a bullet in one of the six chambers of a
pistol, spinning the chambers to select one at random, and them firing the pistol once at his
head.
Q. a] Greene played this game six times and was lucky that none of them resulted in a bullet
firing. Find the prob. of this outcome.
Answer:The prob for bullet firing in any single trial is 1/6 and for not firing it is 5/6.
In six trials, the prob. for not firing is = (5/6)6.
Q. b] Suppose that he had kept playing the game until the bullet fired. Let ‘Y’ denote the number
of the game on which it fires. What is the p.m.f. of Y ?
Answer: Y can take the values y = 1,2,……. (unbounded)
The event [Y = y] happens iff (y – 1) trials fail to fire & the yth trial fires.
So, P[Y = y] = (5/6) y – 1 (1/6), y = 1, 2, . . . . which is the pmf of Geometric distn.
______________________________________________________________________________
n
3) Suppose that P(Yi = 1) = π, i = 1,2,…,n. Let Y = Y
i 1
i

Q. a] When {Yi} have pairwise correlation ρ > 0, show that Var(Y) > n π(1 – π), overdisperison
relative to Binomial.
n n n
Answer:Var(Y) = Var(  Yi )= Var(Yi ) +2  Cov (Yi , Y j )
i 1 i 1 i j

= n π(1– π) + 2 n (n–1) ρ π (1– π ) > n π(1 – π)


Q. b] Suppose that heterogeneity exists: P(Yi = 1 | π ) = π for all i, Yi’s are independent, but π is a
r.v. with density function g(.) on [0 , 1] having mean ρ & positive variance. Show that Var(Y) >
n ρ ( 1 – ρ)
Answer: We have E( Yi | π)= π, Var( Yi | π)= π(1 –π). So, E( Y | π) = nπ, Var(Y| π) = n π( 1 – π).
We also have E(π) = ρ & V(π) > 0.
Now, V(Y) = V[E(Y | π) ] + E [ V(Y| π) ] = V[nπ] + E[n π( 1 – π)]
= n2 V( π) + n { E( π) – E( π2) }
= n2 V( π) + n { E( π) – V(π) – E2(π) }
= n2 V( π) + n { ρ – V(π) – ρ2}
= n (ρ – ρ2) + (n2 – n) V(π)
= n ρ (1 – ρ) + some positive quantity > n ρ (1 – ρ)

5
Q. c]Suppose that P(Yi = 1 | πi ) = πi , i = 1,2…,n, where { πi } are independent from g(.). Explain
why Y has B(n, ρ) distribution unconditionally, but not conditionally on the { πi }.
Answer:When we condition on all the πi’s, it is seen that Y is the sum of independent but not
identically distributed Bernoulli variables. In fact, E( Y | π1, …, πn) = π1 + π2 + .. + πn and V(Y |
π1, …, πn) = V( Y1| π1, …, πn) + V( Y2| π1, …, πn) + . . . + V(Yn | π1, …, πn)
= V( Y1| π1) + V( Y2| π2) + . . . + V( Yn | πn)
[ since Yi depends on πi alone and not on the other πj’s ]
= π1( 1 – π1) + π2( 1 – π2) + . . . + πn( 1 – πn)
We can easily see that the conditional mean & conditional variance are not of the form np & np ( 1 – p ).
Thus, clearly, the conditional distribution of Y cannot be Binomial.
Now, we consider the unconditional distribution.
1

The unconditional probability P( Yi = 1) =  P(Y


0
i  1 |  i )  g ( i )  d i

= 
0
i  g ( i )  d i = E(πi) = ρ which is a number in (0, 1).

Thus, unconditionally, each Yi is a r.v. having a B(1, ρ) distribution. So, Y is the sum of independent &
identically distributed Bernoulli r.v.’s and therefore has B(n, ρ) distribution.

1.3 INFERENCE FOR CATEGORICAL DATA


Maximum likelihood estimation is generally recommended due to some desirable features. It is
known that MLE is the parameter value under which the observed data have the highest
probability (likelihood) of occurrence. Under weak regularity conditions, such as the parameter
space having fixed dimension with the true value falling in its interior, MLEs have desirable
properties:
• They have large-sample normal distributions
• They are asymptotically efficient, producing large-sample standard errors not exceeding
those from other estimation methods.
Let β= (β1, β2, . . ., βp)' denote the (vector) parameter and let L(β) be the log-likelihood function.
For many models, L(β) has concave shape and the MLE ˆ is the point at which the derivative of
L(β) equals zero.
  2 L ( ) 
Let I = ((i jk )) be the ‘Information Matrix’. Its (j,k)th element is i jk = – E  
   
 j k 
[Note : The diagonal elements of the above matrix represent the ‘curvature’ of the log-likelihood
and greater the curvature, the better it is, since large curvature implies that the log-likelihood
drops quickly as β moves away from ˆ ]
Let VC ( ˆ ) = (( Cov( ̂ j , ̂ k ) )) be the var-cov matrix of ˆ .

6
Under usual regularity conditions, VC( ˆ ) = I– 1. Clearly, greater the curvature, smaller is the
standard errors of the estimates.
1.3.1 Wald Test
Suppose we wish to test Ho: β = βoversus H1: β ≠ βowhere βis a real parameter. The test statistic
ZW=( ˆ –βo)/SE( ˆ ) has an approximate standard normal distribution under Ho. We refer to the
std. normal tail probabilities (two-sided) to get the p-value. Equivalently, we can use z2 which
has a chi-square distribution with 1 dof under Ho and we can refer to the right-tail probabilities of
χ2(1) to get the p-value.
Now consider Ho: β = βo versus H1: β ≠βo where βis a vector parameter. The multiparameter
extension of Wald Test uses the test statistic W = (β –βo)' [ VC( ˆ ) ] – 1 (β –βo) which has a Chi-
Square (r) under Ho, where r = Rank of VC( ˆ ) = Number of non-redundant parameters in β.

1.3.2 Likelihood Ratio Test (Wilks)


This test is applicable to testing any Ho versus any H1. Let L 0 be the maximum of the likelihood
under Hoand L be the maximum of the likelihood in the full parameter space (Ho  H1). It may
be noted that L 0 ≤ L . The ratio Λ = L 0 / L is known as Wilks Lambda.
Under Ho, for large sample size, – 2 log Λ [or –2(L0 – L1), where L0 = log L 0& L1 = log L ]
follows a Chi-square distribution with (d – do) degrees of freedom, where ‘d’ is the dimension of
the full parameter space while ‘do’ is the dimension under Ho. The right tail probabilities of χ2( d
– do) give the p-values. We denote – 2 log Λ as G2 and call it LR Statistic.

1.3.3 Score Test (R. A. Fisher & C. R. Rao)


Consider the test for Ho: β = βo versus H1: β ≠ βo where βis a real parameter. The score test is
based on the ‘slope’ and the ‘expected curvature’ of the log-likelihood function.
Denote u (βo) = ∂L(β) / ∂β [Slope of log likelihood evaluated at βo] and
i (βo) = – E [ ∂2 L(β) / ∂β2 ] = Information evaluated at βo
The score statistic ZS= u (βo) / [ i (βo) ] ½. The score statistic has an approximate std. normal
distribution under Ho.
The equivalent chi-squared form is [u (βo) ] 2 / i (βo) .
Now consider Ho: β = βo versus H1: β ≠βo where βis a vector parameter.
Denote u (βo) = (∂L(β) / ∂β1, ∂L(β) / ∂β2, . . . , ∂L(β) / ∂βp) ' [evaluated at βo ]
I(βo) = (( i jk )) = Information Matrix evaluated at βo
The multiparameter extension of the score test uses the statistic [u(βo)] ' [I(βo) ]– 1 [u(βo)] which
follows an approximate χ2(p) under Ho.
Note: Of the three tests given above, the Likelihood Ratio Test uses the most information
[evaluating the (log)likelihood at both βo and ˆ ]. As the sample size becomes large, the three

7
tests have certain asymptotic equivalences. For moderate sized samples, the LR test is usually
more reliable than Wald Test.

1.3.4 Confidence Intervals


A confidence interval results from inverting any test. For instance, a 95% confidence interval for
β is the set of βo’s for which the test of Ho: β = βo has a p-value exceeding 0.05 (i.e, not rejected).
The Wald CI is the set of βo for which | ˆ –βo| / SE( ˆ ) < zα/2
(i.e.) ˆ ± zα/2 S.E. where zα/2 is the α/2-critical point on the right tail of N(0,1) distn.

The LR based CI is the set of βo for which –2 log Λ < χ2α(1) where χ2α(1) is the α-critical point
on the right tail of χ2(1) distn.

Note: When ˆ has a normal distribution, the log-likelihood function has a parabolic shape. For
small samples with categorical data, ˆ may be far from normality and the log-likelihood
function may be far from parabolic shape. This can also happen with moderate to large samples
when a model contains many parameters. In such cases, inference based on asymptotic normality
does not perform well. Inference can instead use an exact small-sample distribution.
The Wald CI is most commonly used since it is simple to construct using the MLEs and
standard errors reported in software. For small to moderate sample sizes, the LR based CI is
preferable. Note that, for a linear regression model, with a normal response variable, all the three
approaches provide nearly identical results.

1.4 INFERENCE FOR BINOMIAL PARAMETER


Consider ‘n’ independent Bernoulli trials with success probability in each trial ‘π’ and let ‘y’ be
the number of successes. The MLE of π is ˆ = y / n.
1.4.1 Tests for Ho: π = πo
Wald Test
We note that Var( ˆ ) = π(1 – π) / n and SE( ˆ ) = ˆ (1  ˆ ) / n
ˆ   0
Wald Statistic is ZW = - - - - (1.4.1)
ˆ (1  ˆ ) / n
LR Test
The log-likelihood function is L = c + y log π + ( n – y) log ( 1 – π) and
L0 = c + y log π0 + ( n – y) log ( 1 – π0 ) , L1 = c + y log ˆ + ( n – y) log ( 1 – ˆ )

 y n y 
The LR statistic is – 2 log Λ = 2  y log  (n  y ) log  - - - (1.4.2)
 n  0 n  n 0 

which follows χ (1).


2

8
Note that y = observed no. of successes, n πo = expected no. of successes under Ho
and n – y = obs. no. of failures , n – n πo = expected no. of failures under Ho
 observed 
The LR statistic is, thus,2  observed  log   . It compares observed success &
 fitted(under H 0 ) 
failure counts to fitted (null hypothesis) counts.

Score Test
y n y y  n y n y
We have u (π) = ∂L(π) / ∂π =  = & ∂2 L(π) / ∂π 2
=–  so
 1   (1   )  2
(1   ) 2
n n  n n
thati(π) = +  = and so
 (1   ) 2
 (1   )
y  n 0 ˆ   0
ZS = u (πo) / [ i(πo) ] ½ = = - - - (1.4.3)
n  0 (1   0 )  0 (1   0 ) / n

Note: A significance test merely indicates whether a particular value of the parameter is
plausible. But, by using ‘confidence intervals’ we determine a range of plausible values.

1.4.2 Confidence intervals for π


ˆ   0
A value πo is accepted (plausible) by Wald Testif – zα/2< < zα/2
ˆ (1  ˆ ) / n
ˆ (1  ˆ )
Thus, the 100 (1 – α)%Wald CI for πis ˆ  z / 2
n - - - (1.4.4)
Historically, this was the CI used, but it has been found to perform poorly unless ‘n’ is very large.

The LR based CI is computationally complex. It is the set of πo values for which the LR test has a p-
value exceeding α. Equivalently, it is the set of plausible πo values for which we have
 y n y 
2  y log  (n  y ) log  < χ2α(1) . In other words, it is the set of plausible πo
 n 0 n  n 0 
values for which 2( log likelihood) drops by less than χ2α(1) from its value at the ML estimate ˆ .

The Score CI is the set of πo values for which | ZS | <zα/2. The end-points of the CI are the ‘πo’
ˆ   0
solutions to the equations =  zα/2. Squaring, we get
 0 (1   0 ) / n
n (ˆ   0 ) 2 =  0 (1   0 ) z2/ 2 which is a quadratic equation in πo namely
( n  z2/ 2 )  02 – (2 n ˆ  z2/ 2 )  0  n ˆ 2  0
The solutions to this equation are

9
(2 n ˆ  z2/ 2 )  ( 4 n 2 ˆ 2  z 4/ 2  4 n ˆ z2/ 2 )  4 n ˆ 2 (n  z2/ 2 )
2 ( n  z2/ 2 )
which reduces to
 n  1  z2/ 2  z2/ 2 [ z2/ 2  4 n ˆ (1  ˆ ) ]
ˆ    
 n  z2/ 2  2  n  z2/ 2  4 (n  z2/ 2 ) 2
   
 n  1  z2/ 2  1  n   z2/ 2  1 
ˆ      z / 2  ˆ (1  ˆ )   
 n  z2/ 2  2  n  z2/ 2  n z2/ 2   2   n  z2/ 2  4 
     n z  /2     - - - - (1.4.5)
It is seen that the mid-point of this interval is the weighted average of ˆ and ½ (where the
weight for ˆ increases as ‘n’ increases). Similarly, inside the square root, we have the weighted
average of the estimated variance [adjusted] of the sample proportion namely ˆ (1 – ˆ )/ n* and
the numeral 1/ (4 n*) which is the variance [adjusted] of the sample proportion when the true
proportion is ½ [ where, instead of ‘n’ we have an adjusted sample size n* = n + z2/ 2 ].
Generally, the Score CI performs much better than Wald CI.

Exercises:
1)In a survey, a question requiring yes / no (support / oppose) response was addressed to a sample of
1824 persons to find out the support for a new proposal. 842 said ‘yes’. Let π be the population proportion
who would reply ‘yes’ (support). Find the p-value for testing Ho: π = 0.5 using the score test and construct
a 95% CI for π. Interpret the results.
Solution: We have n = 1824 and observed y = 842. And ˆ = 842/1824 = 0.4616
ˆ   0 0.4616  0.5
The score statistic ZS = = = – 3.28 with a p-value of 0.00104
 0 (1   0 ) / n 0.5 x 0.5 / 1824
[ The chi-squared form is ZS2 = 10.7584 with a p-value of 0.00104 ] Ho is strongly rejected.
Now, for 95% CI, we have α = 0.05 and zα/2 = 1.96. Substituting these quantities in equation
(1.4.5), we get the 95% CI for π as (0.4388, 0.4845).
We first see that the estimate of the proportion of support is 0.4616 (which is below the
hypothetical value 0.5 being tested). Further, the 95 % CI also does not contain the value 0.5 and
it is well below 0.5. With 95% confidence, we can say that only 44% to 48.5% of the population
support the new proposal.

2) In an experiment on chlorophyll inheritance in maize, for 1103 seedlings of self-fertilized


heterozygous green plants, 854 seedlings were green and 249 were yellow. Theory predicts the
ratio of green to yellow is 3:1. Test the hypothesis that 3:1 is the true ratio. Report the p-value
and interpret.
Solution: Taking ‘green’ as success, we have to test for Ho: π = 3/4. We have n = 1103 and
ˆ   0
ˆ = 854/1103 = 0.774. The Wald statistic is WS = = 1.9058. The p-value is 0.0567
ˆ (1  ˆ ) / n

10
The LR Statistic is – 2 log Λ = 3.539. The p-value from χ2(1) is 0.0599

ˆ   0
The Score Statistic is ZS = = 1.841. The p-value is 0.0656
 0 (1   0 ) / n
None of the three tests shows evidence against Ho and so we accept Ho.

1.5 INFERENCE FOR MULTINOMIAL PARAMETERS


Consider ‘n’ independent multinomial trials, each trial leading to one of ‘c’ categories of
outcomes with the probability of jth category ‘πj’. Let nj be the number of jth category outcomes;
c
so, n1 +...+ nc = n.The (kernel of) log-likelihood is L =  n j log  j where πc = 1 – π1 – … – πc – 1
j 1

and the MLEs are ˆ j = nj / n.

1.5.1 Tests for Ho: πj = πjo, j = 1, 2, . . , c, (where Σ πjo = 1)


Score Test - Pearson Statistic for testing a specified multinomial
c (n j  nˆ j ) 2
When Ho is true, the expected frequencies are nˆ j  n  j 0 . Pearson’s statistic isX = 2

j 1 nˆ j
which has an approximate χ2(c – 1) distribution. Ho is rejected for large values of X2 (i.e.) for
large ‘p’ values.
Note that ‘p’ value = P[ χ2(c – 1) > observed value of X 2]
Special Case Binomial (c = 2): We shall denote π1 = π, so π2 =1 – π. Thus, n1 = y, n̂1 = nπo and n2
= n – ny, n̂ 2 = n – nπo and ˆ = y / n. Then,
( y  n 0 ) 2 ( y  n 0 ) 2 ( y  n 0 ) 2
Pearson’s X = 2
 = which is same as the Chi-square form
n 0 n(1   0 ) n  0 (1   0 )
of the score test [Refer Section 1.4.1. It is seen that Pearson’s X2 = ZS2 ].
[ For the general ‘c’ value also, we can show that Pearson’s test is same as score test ].

LR Test
c
The (max of) log-likelihood under H0 is L0 =  n j log  j 0 and the max log likelihood in the full
j 1

c
parameter space is L1= n
j 1
j log (n j / n) . The dimension of the full parameter space is ( c – 1)

and that under Ho is 0 [ because, the parameters are fully specified in Ho].
c
The LR statistic is G2= –2(L0 – L1) = 2 n
j 1
j log (n j / n  j 0 ) which follows χ2(c – 1) for large

sample size.

11
Testing with Estimated Expected Frequencies
Pearson’s X2compares a sample distribution to a hypothetical one { πj0 }. In some applications,
πj0 are functions of a smaller set of parameters, say, πjo(θ) where dim (θ) = p < c – 1. ML
estimates of θ determine ML estimates of πjo(θ) and hence, ML estimates of nj (expected
frequencies) in X2. The degrees of freedom reduces to (c – 1) – p.
An Application: A sample of 156 calves born in a locality were classified according to whether
they caught pneumonia within 60 days of birth. Calves that got infected were also classified
according to whether they got a secondary infection within two weeks after the first infection
cleared up. Calves that did not get primary infection cannot get secondary infection; That is, no
observations can fall in the category for “No” primary infection &“Yes” secondary infection.
This combination is called a structural zero.
The data are summarized below:
Secondary Infection
Yes No
Primary Yes 30 63
Infection No --- 63

A goal was to test whether the probability of primary infection was the same as the conditional
probability of secondary infection given that the calf got the primary infection. Denoting
π11 = Prob. of Yes(for primary infection) & Yes(for secondary infection)
π12 = Prob. of Yes (for primary infection) & No (for secondary infection)
π21 = Prob. of No (for primary infection) & Yes (for secondary infection) = 0 [impossible]
π22 = Prob. of No (for primary infection) & No (for secondary infection)
Thus, the testing problem may be restated as Ho: π11+π12 = π11 / (π11+π12) or π11 = (π11+π12)2.

Denote π11 + π12 = π (Prob. of primary infection). Thus, under Ho, the probabilities for yes-yes,
yes-no, no-no combinations are π2, π( 1 – π), 1 – π.[ It may be noted that Ho does not specify any
‘particular’ numeral value for πij’s but only gives a reparametrization in terms of a single
parameter π ].Let n11, n12, n22 be the number of observed yes-yes, yes-no, no-no combinations.
The (kernel of) log likelihood is
L = n11 log (π2) + n12 log ( π( 1 – π) ) + n22 log ( 1 – π)
Differentiating w.r.t. π gives the likelihood equation
2 n11 n12 n n
  12  22 = 0
  1 1
The MLE is ˆ = (2 n11 + n12) / (2 n11 + 2 n12 + n22)

12
The expected frequencies are n̂11 = n ̂ 2 , n̂12 = n (ˆ  ˆ 2 ) , n̂ 22 = n (1  ˆ )
From the given data, we find ˆ = 0.494. The estimated expected frequencies are given in the
following table:
Secondary Infection
Yes No
Primary Yes 38.1 39
Infection No --- 78.9
(30  38.1) 2 (63  39) 2 (63  78.9) 2
Pearson’s statistics X2 =   ≈ 19.7
38.1 39 78.9
The smaller space of parameter has dimension p = 1 (There is only one parameter). The degrees
of freedom is (c – 1) – 1 = ( 3 – 1) – 1 = 1. The ‘p’ value for the observed value 19.7
corresponding to χ2(1) is 0.00009. This shows that Ho is rejected strongly.
Thus, it is concluded that conditional probability of secondary infection given primary infection
occurred is ‘significantly different’ from the probability of primary infection. In fact, it is seen
that the expected number (if Ho were true) of calves that would not get secondary infection
among those with primary infection is 39 only, whereas the actual number of calves that did not
get secondary infection among those with primary infection is 63.This leads us to conclude that
primary infection reduces greatly the probability for secondary infection (immunity is created by
the primary infection).
1.6 INFERENCE FOR POISSON PARAMETER
Let Y1, Y2, . . ., Yn be independent from Poisson(λ). The MLE is ̂ = Y .
1.6.1 Tests for Ho: λ = λo
Wald Test
ˆ  0
We note that Var( ̂ ) = λ / n and SE( ̂ ) = ̂ / n and Wald Statistic is ZW = - - -
ˆ / n
(1.6.1)
LR Test
n
The log-likelihood function is L = c – n λ + Y
i 1
i log λ

n n
L0 = c – n λo +  Yi log λo , L1 = c – n ̂ +
i 1
Y
i 1
i log ̂

n
The LR statistic is – 2 log Λ = 2 n (λo– ̂ ) + 2  Yi [ log ̂ – log λo ] - - - (1.6.2)
i 1

which follows χ2(1).

Score Test

13
n
n (Y   )  Yi
We have u (λ) = ∂L(λ) / ∂λ = – n + Y
i 1
i /λ =

&∂2 L(λ) / ∂λ2 = –
2
so that

i (λo) = – E [ ∂2 L(λ) / ∂λ2 ] = n / λo


n  (Y  0 )
Thus, ZS = u (λo) / [ i(λo) ] ½ = - - - (1.6.3)
0
1.6.2 Confidence intervals for λ
ˆ  0
A valueλo is accepted (plausible) by Wald Test if – zα/2< < zα/2
ˆ / n

ˆ
Thus, the 100 (1 – α)%Wald CI for πis ˆ  z / 2
n - - - (1.6.4)

The LR based CI is computationally complex. It is the set of λo values for which the LR test has a p-
value exceeding α. Equivalently, it is the set of plausible λo values for which we have
n
2 n (λo– ̂ ) + 2 Y
i 1
i [ log ̂ – log λo ] < χ2α(1). In other words, it is the set of plausible λo

values for which 2( log likelihood) drops by less than χ2α(1) from its value at the ML estimate ̂ .

The Score CI is the set of λo values for which | ZS | <zα/2. The end-points of the CI are the ‘λo’
n  (Y  0 )
solutions to the equation =  zα/2. Squaring, we get
0
n (Y  0 ) 2 =  0 z2/ 2 which is a quadratic equation in λo namely
2
n  02 – (2 n Y  z2/ 2 ) 0  n Y  0
The solutions of this equation are
2
z 2 Y  z / 2 
Y   / 2  z / 2  
2n n  2 n 

*******

14

You might also like