You are on page 1of 89

C

If we divide up the employees of a business


Categorical Data
into professions (and at least three profes-
Categorical data consists of counts of obser- sions are presented), the data we obtain is
vations falling into specified classes. unordered multicategorical data (there is no
We can distinguish between various types of natural ordering of the professions).
categorical data: In contrast, if we are interested in the number
• Binary, characterizing the presence or of people that have achieved various levels
absence of a property; of education, there will probably be a nat-
• Unordered multicategorical (also called ural ordering of the categories: “primary,
“nominal”); secondary” and then university. Such data
• Ordered multicategorical (also called would therefore be an example of ordered
“ordinal”); multicategorical data.
• Whole numbers. Finally, if we group employees into cate-
We represent the categorical data in the form gories based on the size of each employee’s
of a contingency table. family (that is, the number of family mem-
bers), we obtain categorical data where the
DOMAINS AND LIMITATIONS categories are whole numbers.
Variables that are essentially continuous can
also be presented as categorical variables.
One example is “age”, which is a continuous FURTHER READING
variable, but ages can still be grouped into  Analysis of categorical data
classes so it can still be presented as categor-  Binary data
ical data.  Category
 Data
EXAMPLES
 Dichotomous variable
In a public opinion survey for approving or
disapproving a new law, the votes cast can  Qualitative categorical variable
be either “yes” or “no”. We can represent the  Random variable
results in the form of a contingency table:

Yes No REFERENCES
Votes 8546 5455 See analysis of categorical data.
60 Category

Category
A category represents a set of people or
objects that have a common characteris-
tic.
If we want to study the people in a popu-
lation, we can sort them into “natural” cat-
egories, by gender (men and women) for Cauchy distribution, θ = 1, α = 0
example, or into categories defined by other
criteria, such as vocation (managers, secre-
taries, farmers . . . ). MATHEMATICAL ASPECTS
The expected value E[X] and the variance
Var(X) do not exist.
If α = 0 and θ = 1, the Cauchy distribution
FURTHER READING is identical to the Student distribution with
 Binary data
one degree of freedom.
 Categorical data
 Dichotomous variable
 Population DOMAINS AND LIMITATIONS
 Random variable Its importance in physics is mainly due to the
 Variable fact that it is the solution to the differential
equation describing force resonance.

Cauchy Distribution FURTHER READING


 Continuous probability distribution
A random variable X follows a Cauchy  Student distribution
distribution if its density function is of the
form:
REFERENCES

2 −1
1 x−α Cauchy, A.L.: Sur les résultats moyens
f (x) = · 1+ , d’observations de même nature, et sur les
πθ θ
résultats les plus probables. C.R. Acad.
θ >0. Sci. 37, 198–206 (1853)

The parameters α and θ are the location and


dispersion parameters, respectively.
Causal Epidemiology
The Cauchy distribution is symmetric about
x = α, which represents the median. The The aim of causal epidemiology is to identify
first quartile and the third quartile are given how cause is related to effect with regard to
by α ± θ . human health.
The Cauchy distribution is a continuous In other words, it is the study of causes of ill-
probability distribution. ness, and involves attempting to find statis-
Causal Epidemiology 61

tical evidence for a causal relationship or an Smokers Non-


association between the illness and the factor smokers

proposed to cause the illness. Number of cases of 50 10


lung cancer (/1000)
Proportion of 5% 1% C
HISTORY people in this group
See epidemiology. that contracted
lung cancer

MATHEMATICAL ASPECTS
If we compare the proportion of smokers that
See cause and effect in epidemiology,
contract lung cancer to the proportion of non-
odds and odds ratio, relative risk, attri-
smokers that do, we get 5%/1% = 5, so we
butable risk, avoidable risk, incidence
can conclude that the risk of developing lung
rate, prevalence rate.
cancer is five times higher in smokers than
in nonsmokers. We generally also evaluate
DOMAINS AND LIMITATIONS the significance of this result computing p-
Studies of the relationship between tobac- value. Suppose that the chi-square test gave
co smoking and the development of lungial p < 0.001. It normally accepted that if the
cancers and the relationship between HIV p-value is smaller then 0.05, then the results
and the AIDS are examples of causal epide- obtained are statistically significant.
miology. Research into a causal relation is Suppose that we perform the same study, but
often very complex, requiring many stud- the dimension of each group (smokers and
ies and the incorporation and combination of nonsmokers) is 100 instead of 1000, and we
various data sets from biological and animal observe the same proportions of people that
experiments, to clinical trials. contract lung cancer:
While causes cannot always be identified
precisely, a knowledge of the risk factors Smokers Non-
associated with an illness and therefore the smokers
groups of people at risk allows us to inter- Number of cases of 5 1
vene with preventative measures that could lung cancer (/100)
preserve health. Proportion of 5% 1%
individuals in this
group that
EXAMPLES contracted lungs
As an example, we can investigate the cancer
relationship between smoking and the
development of lung cancer. Consider If we perform the same statistical test, the
a study of 2000 subjects: 1000 smokers and p-value is found to be 0.212. Since this is
1000 nonsmokers. The age distributions greater than 0.05, we can cannot draw any
and male/female proportions are identical solid conclud that there is about the existence
in both groups. Let us analyze a summary of of a significant statistical relation between
data obtained over many years. The results smoking and lung cancer. This illustrates
are presented in the following table: that, in order to have a statistically signifi-
62 Cause and Effect in Epidemiology

cant level of difference between the results MATHEMATICAL ASPECTS


for different populations obtained from epi- Formally, we have:
demiological studies, it is usually necessary
Absolute effect = Risk for exposed
to study large samples.
− risk for unexposed .

FURTHER READING The absolute effect expresses the excess risk


See epidemiology. or cases of illness that result (or could result)
from exposure to the cause.

REFERENCES risk for exposed


Relative effect = .
See epidemiology. risk for unexposed
The relative effect expresses the strength of
the association between the illness and the
Cause and Effect cause. It is measured using the relative risk
in Epidemiology and the odds ratio.

In epidemiology, the “cause” is an agent


DOMAINS AND LIMITATIONS
(microbial germs, polluted water, smoking,
Strictly speaking, the strength of an asso-
etc.) that modifies health, and the “effect”
ciation between a particular factor and an ill-
describes the the way that the health is
ness is not enough to establish a causal rela-
changed by the agent. The agent is often
tionship between them. We also need to con-
potentially pathogenic (in which case it is
sider:
known as a “risk factor”).
• The “temporality criterion” (we must be
The effect is therefore effectively a risk com-
sure that the supposed cause precedes the
parison. We can define two different types of
effect), and;
risk in this context:
• Fundamental and experimental research
• The absolute effect of a cause expresses
elements that allow us to be sure that
the increase in the risk or the additional
the supposed causal factor is not actual-
number of cases of illness that result or
ly a “confusion factor” (which is a factor
could result from exposure to this cause. It
that is not causal, but is statistically relat-
is measured by the attributable risk and
ed to the unidentified real causal factor).
its derivatives.
Two types of causality correspond to these
• The relative effect of a cause expresses
relative and absolute effects. Relativecausal-
the strength of the association between the
ity is independent of the clinical or public
causal agent and the illness.
health impact of the effect; it generally does
A cause that produces an effect by itself is
not allow us to prejudge the clinical or pub-
called sufficient.
lic health impact of the associated effect. It
can be strong when the risk of being exposed
HISTORY is very high or when the risk of being unex-
The terms “cause” and “effect” were defined posed is very low.
at the birth of epidemiology, which occurred Absolute causality expresses the clinical or
in the seventeenth century. public health impact of the associated effect,
Cause and Effect in Epidemiology 63

and therefore enables us to answer the ques- Popu- Risk for the Risk for the
tion: if we had suppressed the cause, what lation exposed (%) unexposed (%)

level of impact on the population (in terms of A B


cases of illness) would have been avoided? If P1 0.2 0.1
the patient had stopped smoking, what would P2 20 10
C
the reduction in his risk of developing lung
cancer or having a myocardial infarction be?
Popu- Relative Absolute Avoidable
A risk factor associated with a high relative lation effect effect (%) cases
effect, but which concerns only a small num- A
C = A −B C ×100000
ber of individuals, will cause fewer illnesses B
and deaths then a risk factor that is associated P1 2.0 0.1 100
with a smaller relative effect but where many P2 2.0 10 10000
more individuals are exposed. It is there-
fore clear that the importance of a causal The relative effect is the same for popula-
relation varies depending on whether we tions P1 and P2 (the ratio of the risk for the
are considering relative or absolute causal- exposed to the risk for the unexposed is 2),
ity. but the impact of the same prevention mea-
We should also make an important point here sures would be very different in the two pop-
about causal interactions. There can be many ulations, because the absolute effect is ten
causes for the same illness. While all of these times more important in P2 : the number of
causes contribute to the same result, they can potentially avoidable cases is therefore 100
also interact. The main consequence of this in population P1 and 10000 in population P2.
causal interaction is that we cannot prejudge Now consider the incidence rate of lung
the effect of simultaneous exposure to caus- cancer in a population of individuals who
es A and B (denoted A+B) based on what we smoke 35 or more cigarettes per day:
know about the effect of exposure to only A or 3.15/1000/year. While this rate may seem
only B. In contrast to the case for independent small, it masks the fact that there is a strong
causes, we must estimate the joint effect, not relative effect (the risk is 45 times bigger
restrict ourselves with the isolated analyses for smokers then for nonsmokers) due to
of the interacting causes. the fact that lung cancer is very rare in non-
smokers (the incidence rate for nonsmokers
EXAMPLES is 0.07/1000/year).
The relative effect and the absolute effect are
subject to different interpretations, as the fol-
lowing example shows. FURTHER READING
Suppose we have two populations P1 and P2,  Attributable risk
each comprising 100000 individuals. In pop-  Avoidable risk
ulation P1 , the risk of contracting a given ill-  Incidence rate
ness is 0.2% for the exposed and 0.1% for the
 Odds and odds ratio
unexposed. In population P2 , the risk for the
exposed is 20% and that for the unexposed  Prevalence rate

is 10%, as shown in the following table:  Relative risk


64 Census

REFERENCES Censuses were also completed in Meso-


Lilienfeld, A.M., Lilienfeld, D.E.: Founda- potamia (about 3000 BC), as well as in
tions of Epidemiology, 2nd edn. Claren- ancient Egypt from the first dynasty on-
don, Oxford (1980) wards; these censuses were performed due
MacMahon, B., Pugh, T.F.: Epidemiology: to military and fiscal objectives. Under Ama-
Principles and Methods. Little Brown, sis II, everybody had to (at the risk of death)
Boston, MA (1970) declare their profession and source(s) of rev-
enue.
Morabia, A.: Epidemiologie Causale. Editi-
The situation in Israel was more complex:
ons Médecine et Hygiène, Geneva (1996)
censuses were sometimes compulsary and
Rothmann, J.K.: Epidemiology. An Intro- sometimes forbidden due to the Old Testa-
duction. Oxford University Press (2002) ment. This influence on Christian civiliza-
tion lasted quite some time; in the Middle
Ages, St. Augustin and St. Ambroise were
Census still condemning censuses.
A census is an operation that consists of In China, censuses have been performed
observing all of the individuals in a popu- since at least 200 BC, in different forms
lation. The word census can refer to a pop- and for different purposes. Hecht, J. (1987)
ulation census, in other words a population reported the main censuses:
count, but it can also refer to inquiries (called 1. Han Dynasty (200 years BC to 200 years
“exhaustive” inquiries) where we retrieve AD): population censuses were related to
information about a population by observ- the system of conscription.
ing all of the individuals in the population. 2. Three Kingdoms Period to Five Dynas-
Clearly, such inquiries will be very expen- ties (221–959 AD): related to the system
sive for very large populations. That is why of territorial distribution.
exhaustive inquiries are rarely performed; 3. Song and Yuan Dynasties (960–1368
sampling, which consists of observing of AD): censuses were performed for fiscal
only a portion of the population (called the purposes.
sample), is usually preferred instead. 4. Ming Dynasty (1368–1644 AD): “yellow
registers” were established for ten-year
HISTORY censuses. They listed the name, profes-
Censuses originated with the great civi- sion, gender and age of every person.
lizations of antiquity, when the large areas 5. Qing Dynasty (since 1644 AD): censuses
of empires and complexity associated with were performed in order to survey popu-
governing them required knowledge of the lation migration.
populations involved. In Japan during the Middle Age, different
Among the most ancient civilizations, it types of census have been used. The first cen-
is known that censuses were performed sus was probably performed under the rule
in Sumeria (between 5000 and 2000 BC), of Emperor Sujin (86 BC).
where the people involved reported lists of Finally, in India, a political and economic
men and goods on clay tables in cuneiform science treatise entitled “Arthasastra” (profit
characters. treaty) gives some information about the use
Census 65

of an extremely detailed record. This treatise ioners that were maintained by pastors; in
was written by Kautilya, Prime Minister in 1668, a decree made it obligatory to be count-
the reign of Chandragupta Maurya (313–289 ed in these registers, and instead of being reli-
BC). gious, the registers became administrative.
Another very important civilization, the 1736 saw the appearance of another decree, C
Incans, also used censuses. They used stating that the governor of each province
a statistics system called “quipos”. Each had to report any changes in the population
quipo was both an instrument and a reg- of the province to parliament. Swedish pop-
istry of information. Formed from a series of ulation statistics were officially recognized
cords, the colors, combinations and knots on on the 3rd of February 1748 due to creation
the cords had precise meanings. The quipos of the “Tabellverket” (administrative tables).
were passed to specially initiated guards that The first summary for all Swedish provinces,
gathered together all of these statistics. realized in 1749, can be considered to be
In Europe, the ancient Greeks and Romans the first proper census in Europe, and the
also practiced censuses. Aristotle reported 11th of November 1756 marked the creation
that the Greeks donated a measure of wheat of the “Tabellkommissionen,” the first offi-
per birth and a measure of barley per death to cial Division of Statistics. Since 1762, these
the goddess Athéna. In Rome, the first census tables of figures have been maintained by the
was performed at the behest of King Servius Academy of Sciences.
Tullius (578–534 BC) in order to monitor Initially, the Swedish censuses were orga-
revenues, and consequently raise taxes. nized annually (1749–1751), and then every
Later, depending on the country, census- three years (1754–1772), but since 1775 they
es were practiced with different frequencies have been conducted every five years.
and on different scales. At the end of the eighteenth century an offi-
In 786, Charlemagne ordered a count of all cial institute for censuses was created in
of his subjects over twelve years old; pop- France (in 1791). In 1787 the principle of
ulation counts were also initiated in Italy in census has been registered in the Constitu-
the twelfth century; many cities performed tional Charter of the USA (C. C. USA).
censuses of their inhabitants in the fifteenth See also official statistics.
century, including Nuremberg (in 1449) and
Strasbourg (in 1470). In the sixteenth centu-
FURTHER READING
ry, France initiated marital status registers.
 Data collection
The seventeenth century saw the develop-
 Demography
ment of three different schools of thought:
 Official statistics
a German school associated with descrip-
 Population
tive statistics, a French school associated
 Sample
with census ideology and methodology, and
 Survey
an English school that led to modern statis-
tics.
In the history of censuses in Europe, there REFERENCES
is a country that occupies a special place. In Hecht, J.: L’idée du dénombrement jusqu’à
1665, Sweden initiated registers of parish- la Révolution. Pour une histoire de la
66 Central Limit Theorem

statistique, tome 1, pp. 21–82 . Econo- ed (with any distribution) with a mean μ and
mica/INSEE (1978) a finite variance σ 2 .
We define the sum Sn = X1 + X2 + . . . + Xn
and we establish the ratio:
Central Limit Theorem Sn − n · μ
√ ,
σ· n
The central limit theorem is a fundamental √
theorem of statistics. In its simplest form, it where n·μ and σ · n represent the mean and
prescribes that the sum of a sufficiently large the standard deviation of Sn , respectively.
number of independent identically distribut- The central limit theorem establishes that the
ed random variables approximately follows distribution of this ratio tends to the standard
a normal distribution. normal distribution when n tends to infinity.
This means that:

Sn − n · μ
HISTORY P √ ≤ x −→ (x)
σ n n→+∞
The central limit theorem was first estab-
where (x) is the distribution func-
lished within the framework of binomi-
tion of the standard normal distribution,
al distribution by Moivre, Abraham de
expressed by:
(1733). Laplace, Pierre Simon de (1810)
formulated the proof of the theorem.  x 2

1 x
Poisson, Siméon Denis (1824) also worked (x) = √ exp − dx ,
−∞ 2π 2
on this theorem, and Chebyshev, Pafnu-
−∞<x <∞.
tii Lvovich (1890–1891) gave a rigorous
demonstration of it in the middle of the nine-
DOMAINS AND LIMITATIONS
teenth century.
The central limit theorem provides a simple
At the beginning of the twentieth centu-
method of approximately calculating proba-
ry, the Russian mathematician Liapounov,
bilities related to the sums of random vari-
Aleksandr Mikhailovich (1901) created the
ables.
generally recognized form of the central lim-
Besides its interest in relation to the sam-
it theorem by introducing its characteris-
pling theorem, where the sums and the
tic functions. Markov, Andrei Andreevich
means play an important role, the central lim-
(1908) also worked on it and was the first to
it theorem is used to approximate normal
generalize the theorem to the case of inde-
distributions derived from summing iden-
pendent variables.
tical distributions. We can for example, with
According to Le Cam, L. (1986), the qualifi-
the help of the central limit theorem, use
er “central” was given to it by George Polyà
the normal distribution to approximate the
(1920) due to the essential role that it plays
binomial distribution, the Poisson distri-
in probability theory.
bution, the gamma distribution, the chi-
square distribution, the Student distri-
MATHEMATICAL ASPECTS bution, the hypergeometric distribution,
Let X1 , X2 , . . . , Xn be n independent ran- the Fisher distribution and the lognormal
dom variables that are identically distribut- distribution.
Central Limit Theorem 67

EXAMPLES We have:
In a large batch of electrical items, theproba-
Sn − n · p
bility of choosing a defective item equals z= √
n · p (1 − p)
p = 18 . What is the probability that 4 defec-
4.5 − 3.125 C
tive items are chosen when 25 items are =
selected? 1.654
Let X the dichotomous variable be the = 0.832 .
result of a trial:
From the normal table, we obtain the

1 if the selected item is defective; probability:
X=
0 if it is not. P(Z > z) = P(Z > 0.832)

The random variable X follows a Bernoul- = 1 − P(Z ≤ 0.832)


li distribution with parameter p. Conse- = 1 − 0.7967 = 0.2033 .
quently, the sum Sn = X1 +X2 +. . .+Xn fol-
lows a binomial distribution with a mean, FURTHER READING
np and a variance np(1 − p), which, follow-  Binomial
distribution
ing the central limit theorem, can be approxi- Chi-square distribution
mated by a normal distribution with a mean  Convergence
μ = np and a variance σ 2 = np(1 − p).  Convergence theorem
We evaluate these values:  Fisher distribution
 Gamma distribution
μ = n · p = 25 · 8 = 3.125
1
   Hypergeometric distribution
σ 2 = n · p(1 − p) = 25 · 18 · 1 − 18  Law of large numbers
 Lognormal distribution
= 2.734 .
 Normal distribution
We then calculate P(Sn > 4) in two different  Poisson distribution
ways:  Probability
1. With the binomial distribution:  Probability distribution
From the binomial table, the probability  Student distribution
of P(Sn ≤ 4) = 0.8047. The probability
of P (Sn > 4) is then:
REFERENCE
Laplace, P.S. de: Mémoire sur les approxi-
P(Sn > 4) = 1 − P(Sn ≤ 4) = 0.1953 ,
mations des formules qui sont fonctions
2. With the normal approximation (obtained de très grands nombres et sur leur appli-
from the central limit theorem): cation aux probabilités. Mémoires de
In order to account for the discrete char- l’Académie Royale des Sciences de Paris,
acter of the random variable Sn , we must 10. Reproduced in: Œuvres de Laplace
make a continuity correction; that is, we 12, 301–347 (1810)
calculate the probability that Sn is greater Le Cam, L.: The Central Limit Theorem
then 4 + 12 = 4.5 around 1935. Stat. Sci. 1, 78–96 (1986)
68 Chebyshev, Pafnutii Lvovich

Liapounov, A.M.: Sur une proposition de akovsky, Viktor Yakovlevich at St. Peters-
la théorie des probabilités. Bulletin de bourg University.
l’Academie Imperiale des Sciences de St.- His name lives on through the Chebyshev
Petersbourg 8, 1–24 (1900) inequality (also known as the Bienaymé–
Markov, A.A.: Extension des théorèmes lim- Chebyshev inequality), which he proved.
ites du calcul des probabilités aux sommes This was published in French just after Bien-
des quantités liées en chaîne. Mem. Acad. aymé, Irénée-Jules had an article published
Sci. St. Petersburg 8, 365–397 (1908) on the same topic in the Journal de Math-
Moivre, A. de: Approximatio ad summam ématiques Pures et Appliquées (also called
terminorum binomii (a + b)n , in seriem the Journal of Liouville).
expansi. Supplementum II to Miscellanae He initiated rigorous work into establishing
Analytica, pp. 1–7 (1733). Photograph- a general version of the central limit theo-
ically reprinted in a rare pamphlet on rem and is considered to be the founder of
Moivre and some of his discoveries. Pub- the mathematical school of St. Petersbourg.
lished by Archibald, R.C. Isis 8, 671–683
(1926) Some principal works and articles of
Poisson, S.D.: Sur la probabilité des résultats Chebyshev, Pafnutii Lvovich:
moyens des observations. Connaissance
1845 An Essay on Elementary Analysis of
des temps pour l’an 1827, pp. 273–302
the Theory of Probabilities (thesis)
(1824)
Crelle’s Journal.
Polyà, G.: Ueber den zentralen Grenzwert-
satz der Wahrscheinlichkeitsrechnung 1867 Preuve de l’inégalité de Tchebychev.
und das Momentproblem. Mathematische J. Math. Pure. Appl., 12, 177–184.
Zeitschrift 8, 171–181 (1920)
Tchebychev, P.L. (1890–1891). Sur deux FURTHER READING
théorèmes relatifs aux probabilités. Acta  Central limit theorem
Math. 14, 305–315
REFERENCES
Heyde, C.E., Seneta, E.: I.J. Bienaymé.
Chebyshev, Pafnutii Lvovich Statistical Theory Anticipated. Springer,
Berlin Heidelberg New York (1977)
Chhebyshev, Pafnutii Lvovich (1821–1894)
began studying at Moscow University in
1837, where he was influenced by Zernov, Chebyshev’s Inequality
Nikolai Efimovich (the first Russian to get
See law of large numbers.
a doctorate in mathematical sciences) and
Brashman, Nikolai Dmitrievich. After gain-
ing his degree he could not find any teaching
work in Moscow, so he went to St. Peters-
Chi-Square Distance
bourg where he organized conferences on Consider a frequency table with n rows
algebra and probability theory. In 1859, he and p columns, it is possible to calculate row
took the probability course given by Buni- profiles and column profiles. Let us then plot
Chi-Square Distance 69

the n or p points from each profile. We can increases the importance of small devia-
define the distances between these points. tions in the rows (or columns) which have
The Euclidean distance between the compo- a small sum with respect to those with more
nents of the profiles, on which a weighting important sum package.
is defined (each term has a weight that is the The chi-square distance has the property of C
inverse of its frequency), is called the chi- distributional equivalence, meaning that it
square distance. The name of the distance is ensures that the distances between rows and
derived from the fact that the mathematical columns are invariant when two columns (or
expression defining the distance is identical two rows) with identical profiles are aggre-
to that encountered in the elaboration of the gated.
chi square goodness of fit test.
EXAMPLES
MATHEMATICAL ASPECTS Consider a contingency table charting how
Let (fij ), be the frequency of the ith row and satisfied employees working for three differ-
jth column in a frequency table with n rowsan ent businesses are. Let us establish a distance
p columns. The chi-square distance between table using the chi-square distance.
two rows i and i is given by the formula: Values for the studied variable X can fall into
" one of three categories:
# p
2
# f f  1 • X1 : high satisfaction;
d(i, i ) = $
ij ij
− · , • X2 : medium satisfaction;
fi. fi . f.j
j=1
• X3 : low satisfaction.
where The observations collected from samples of
individuals from the three businesses are giv-
fi. is the sum of the components of the ith en below:
row;
f.j is the sum of the components of the jth Busi- Busi- Busi- Total
ness 1 ness 2 ness 3
column;
fij X1 20 55 30 105
[ fi. ] is the ith row profile for j = 1, 2, . . . , p.
X2 18 40 15 73
Likewise, the distance between two co- X3 12 5 5 22
lumns j and j is given by:
Total 50 100 50 200
"
# n

# fij fij 2
1
d(j, j ) = $
The relative frequency table is obtained by
− · ,
f.j f.j fi. dividing all of the elements of the table by
i=1
200, the total number of observations:
f
where [ fij.j ] is the jth column profile for j =
Busi- Busi- Busi- Total
1, . . . , n. ness 1 ness 2 ness 3
X1 0.1 0.275 0.15 0.525
DOMAINS AND LIMITATIONS
X2 0.09 0.2 0.075 0.365
The chi-square distance incorporates
X3 0.06 0.025 0.025 0.11
a weight that is inversely proportional to
Total 0.25 0.5 0.25 1
the total of each row (or column), which
70 Chi-square Distribution

We can calculate the difference in employ- This allows us to calculate the distances
ee satisfaction between the the 3 enterprises. between the different rows:
The column profile matrix is given below:
1
d 2 (1, 2) = · (0.19 − 0.246)2
Busi- Busi- Busi- Total 0.25
ness 1 ness 2 ness 3 1
+ · (0.524 − 0.548)2
X1 0.4 0.55 0.6 1.55 0.5
1
X2 0.36 0.4 0.3 1.06 + · (0.286 − 0.206)2
X3 0.24 0.05 0.1 0.39 0.25
= 0.039296
Total 1 1 1 3
d(1, 2) = 0.198
This allows us to calculate the distances
between the different columns: We can calculate d(1, 3) and d(2, 3) in a simi-
1 lar way. The differences between the degrees
d 2 (1, 2) = · (0.4 − 0.55)2
0.525 of employee satisfaction are finally summa-
1 rized in the following distance table:
+ · (0.36 − 0.4)2
0.365
1
+ · (0.24 − 0.05)2 X1 X2 X3
0.11
X1 0 0.198 0.835
= 0.375423
X2 0.198 0 0.754
d(1, 2) = 0.613
X3 0.835 0.754 0
We can calculate d(1, 3) and d(2, 3) in a sim-
ilar way. The distances obtained are summa-
rized in the following distance table: FURTHER READING
 Contingency table
Busi- Busi- Busi-
 Distance
ness 1 ness 2 ness 3
 Distance table
Business 1 0 0.613 0.514
 Frequency
Business 2 0.613 0 0.234
Business 3 0.514 0.234 0

We can also calculate the distances between


Chi-square Distribution
the rows, in other words the difference in
employee satisfaction; to do this we need the A random variable X follows a chi-square
line profile table: distribution with n degrees of freedom if its
density function is:
Busi- Busi- Busi- Total
ness 1 ness 2 ness 3 n
x 2 −1 exp − 2x
X1 0.19 0.524 0.286 1 f (x) = n , x≥0,
X2 0.246 0.548 0.206 1
2 2 n2
X3 0.546 0.227 0.227 1
where is the gamma function (see Gamma
Total 0.982 1.299 0.719 3
distribution).
Chi-square Distribution 71

MATHEMATICAL ASPECTS
The chi-square distribution appears in the
theory of random variables distributed
according to a normal distribution. In this,
it is the distribution of the sum of squares of C
normal, centered and reduced random vari-
ables (with a mean equal to 0 and a variance
equal to 1).
χ 2 distribution, ν = 12 Consider Z1 , Z2 , . . . , Zn , n independent,
standard normal random variables. Their
The chi-square distribution is a continuous sum of squares:
probability distribution.
 n
X = Z12 + Z22 + . . . + Zn2 = Zi2
HISTORY i=1
According to Sheynin (1977), the chi-square
is a random variable distributed according to
distribution was discovered by Ernst Karl
a chi-square distribution with n degrees of
Abbe in 1863. Maxwell obtained it for
freedom.
three degrees of freedom a few years before
The expected value of the chi-square distri-
(1860), and Boltzman discovered the general
bution is given by:
case in 1881.
However, according to Lancaster (1966), E[X] = n .
Bienaymé obtained the chi-square distri-
bution in 1838 as the limit of the discrete ran- The variance is equal to:
dom variable
Var(X) = 2n .
 k
(ni − npi)2
, The chi-square distribution is related to other
npi
i=1 continuous probability distributions:
if (N1 , N2 , . . . , Nk ) follow a joint multi- • The chi-square distribution is a particular
nomial distribution of parameters n, p1, case of the gamma distribution.
p2 , . . . , pk . • If two random variables X1 and X2 fol-
Ellis demonstrated in 1844 that the sum of low a chi-square distribution with, respec-
k random variables distributed according tively, n1 and n2 degrees of freedom,
to a chi-square distribution with two degrees then the random variable
of freedom follows a chi-square distribution X1 /n1
with 2k degrees of freedom. The general Y=
X2 /n2
result was demonstrated in 1852 by Bien-
aymé. follows a Fisher distribution with n1 and
The works of Pearson, Karl are very impor- n2 degrees of freedom.
tant in this field. In 1900 he used the chi- • When the number of degrees of free-
square distribution to approximate the chi- dom n tends towards infinity, the chi-
square statistic used in different tests based square distribution tends (relatively slow-
on contingency tables. ly) towards a normal distribution.
72 Chi-square Goodness of Fit Test

DOMAINS AND LIMITATIONS


Chi-square
The chi-square distribution is used in many
approaches to hypothesis testing, the most
Goodness of Fit Test
important being the goodness of fit test The chi-square goodness of fit test is, along
which involves comparing the observed fre- with the Kolmogorov–Smirnov test, one
quencies and the hypothetical frequencies of of the most commonly used goodness of fit
specific classes. tests.
It is also used for comparisons between the This test aims to determine whether it is
observed variance and the hypothetical vari- possible to approximate an observed distri-
ance of normally distributed samples, and to bution by a particular probability distri-
test the independence of two variables. bution (normal distribution, Poisson
distribution, etc).
FURTHER READING
 Chi-square
goodness of fit test HISTORY
 Chi-square table The chi-square goodness of fit test is the old-
est and most well-known of the goodness
 Chi-square test
of fit tests. It was first presented in 1900 by
 Continuous probability distribution Pearson, Karl.
 Fisher distribution
 Gamma distribution MATHEMATICAL ASPECTS
 Normal distribution Let X1 , . . . , Xn be a sample of n observations.
The steps used to perform the chi-square
goodness of fit test are then as follows:
REFERENCES 1. State the hypothesis. The null hypothesis
Lancaster, H.O.: Forerunners of the Pear- will take the following form:
son chi-square. Aust. J. Stat. 8, 117–126
(1966) H0 : F = F 0 ,
Pearson, K.: On the criterion, that a given
system of deviations from the probable in where F0 is the presumed distribution
the case of a correlated system of vari- function of the distribution.
ables is such that it can be reasonably sup- 2. Distribute the observations in k disjoint
posed to have arisen from random sam- classes:
pling. In: Karl Pearson’s Early Statisti- [ai−1 , ai ] .
cal Papers. Cambridge University Press, We denote the number of observations
pp. 339–357. First published in 1900 in contained in the ith class, i = 1, . . . , k,
Philos. Mag. (5th Ser) 50, 157–175 (1948) by ni .
Sheynin, O.B.: On the history of some statis- 3. Calculate the theoretical probabilities for
tical laws of distribution. In: Kendall, M., every class on the base of the presumed
Plackett, R.L. (eds.) Studies in the History distribution function F0 :
of Statistics and Probability, vol. II. Grif-
fin, London (1977) pi = F0 (ai ) − F0 (ai−1 ) , i = 1, . . . , k .
Chi-square Goodness of Fit Test 73

4. Obtain the expected frequencies for every estimated frequencies, ei , are not too small.
class We normally state that the estimated fre-
quencies must be greater then 5, except for
ei = n · pi , i = 1, . . . , k , extreme classes, where they can be smaller
then 5 but greater then 1. If this constraint is C
where n is the size of the sample.
not satisfied, we must regroup the classes in
5. Calculate the χ 2 (chi-square) statistic:
order to satisfy this rule.
 k
(ni − ei ) 2
χ2 = .
ei
i=1 EXAMPLES
Goodness of Fit to the Binomial
If H0 is true, the χ 2 statistic follows a chi-
Distribution
square distribution with v degrees of
We throw a coin four times and count the
freedom, where:
number of times that “heads” appears.

This experiment is performed 160 times.


v = k − 1 − number of estimated .
parameters The observed frequencies are as follows:

For example, when testing the goodness


of fit to a normal distribution, the num- Number of “heads” Number of
experiments
ber of degrees of freedom equals:
xi (ni )
• k − 1 if the mean μ and the stan-
dard deviation σ of the population 0 17
are known; 1 52
• k−2 if one out of μ or σ is unknown and 2 54
will be estimated in order to proceed 3 31
with the test; 4 6
• k − 3 if both parameters μ and σ are
Total 160
unknown and both are estimated from
the corresponding values of the sam-
1. If the experiment was performed correct-
ple.
ly and the coin is not forged, the distri-
6. Reject H0 if the deviation between the
observed and estimated frequencies is bution of the number of “heads” obtained
big; that is: should follow the binomial distribution.
We then state a null hypothesis that the
if χ > χv,α ,
2 2 observed distribution can be approximat-
ed by the binomial distribution, and we
where χv,α 2 is the value given in the chi-
will proceed with a goodness of fit test in
square table for a particular significance order to determine whether this hypoth-
level α. esis can be accepted or not.
2. In this example, the different number of
DOMAINS AND LIMITATIONS “heads” that can be obtained per experi-
To apply the chi-square goodness of fit test, it ment (0, 1, 2, 3 and 4) are each considered
is important that n is big enough and that the to be a class.
74 Chi-square Goodness of Fit Test

3. The random variable X (number of 5. The χ 2 (chi-square) statistic is then:


“heads” obtained after four throws of
a coin) follows a binomial distribution if 
k
(ni − ei )2
χ2 = ,
ei
i=1
P(X = x) = Cnx · px · qn−x ,
where k is the number of possible values
where: of X.
n is the number of independent trials =
(17 − 10)2 (6 − 10)2
4; χ2 = +...+
p is the probability of a success 10 10
(“heads”) = 0.5; = 12.725 .
q is the probability of a failure (“tails”) 6. Choosing a significance level α of 5%, we
= 0.5; 2 for k − 1 = 4
find that the value of χv,α
Cnx is the number of combinations of degrees of freedom is:
x objects from n.
2
We then have the following theoretical χ4,0.05 = 9.488 .
probabilities for four throws:
Since the calculated value of χ 2 is greater
P(X = 0) = 1
16
than the value obtained from the table,
we reject the null hypothesis and con-
P(X = 1) = 4
16 clude that the binomial distribution does
P(X = 2) = 6
16 not give a good approximation to our
P(X = 3) = 4
16
observed distribution. We can then con-
clude that the coins were probably forged,
P(X = 4) = 1
16 or that they were not correctly thrown.
4. After the experiment has been performed
160 times, the expected number of heads Goodness of Fit to the Normal Distribution
for each possible value of X is given by: The diameters of cables produced by a fac-
tory were studied.
ei = 160 · P(X = xi ) . A frequency table of the observed distri-
bution of diameters is given below:
We obtain the following table:
Cable diameter Observed frequency
Number of Observed Expected
(in mm) ni
“heads” xi frequency frequency (ei )
(ni ) 19.70–19.80 5
19.80–19.90 12
0 17 10
19.90–20.00 35
1 52 40
20.00–20.10 42
2 54 60
20.10–20.20 28
3 31 40 20.20–20.30 14
4 6 10 20.30–20.40 4
Total 160 160 Total 140
Chi-square Goodness of Fit Test 75

1. We perform a goodness of fit test for p2 = P(19.8 ≤ X ≤ 19.9)


a normal distribution. The null hypoth- 19.8 − x̄ 19.9 − x̄


=P ≤Z≤
esis is therefore that the observed distri- S S
bution can be approximated by a normal = P(−1.835 ≤ Z ≤ −1.089)
distribution. C
= P(Z ≤ 1.835) − P(Z ≤ 1.089)
2. The previous table shows the observed
diameters divided up into classes. These probabilities can be found by con-
3. If the random variable X (cable diameter) sulting the normal table. We get:
follows the normal distribution, the ran-
dom variable p1 = P(X ≤ 19.8) = 0.03325
p2 = P(19.8 ≤ X ≤ 19.9) = 0.10476
X−μ
Z= p3 = P(19.9 ≤ X ≤ 20.0) = 0.22767
σ
p4 = P(20.0 ≤ X ≤ 20.1) = 0.29083
follows a standard normal distribution.
p5 = P(20.1 ≤ X ≤ 20.2) = 0.21825
The mean μ and the standard devia-
tion σ of the population are unknown and p6 = P(20.2 ≤ X ≤ 20.3) = 0.09622
are estimated using the mean x̄ and the p7 = P(X > 20.3) = 0.02902
standard deviation S of the sample:
4. The expected frequencies for the classes

7 are then given by:
δi · n i
i=1 2806.40
x̄ = = = 20.05 , ei = n · p i ,
" n 140
# 7
# which yields the following table:
# ni · (δi − x̄)2 %
$ i=1 2.477
S= =
n−1 139 Cable Observed Expected
= 0.134 diameter frequency frequency
(in mm) ni ei

where the δi are the centers of the class- 19.70–19.80 5 4.655


es (in this example, the mean diameter of 19.80–19.90 12 14.666
a class; for i = 1: δ1 = 19.75) and n is 19.90–20.00 35 31.874
the total number of observations. 20.00–20.10 42 40.716
We can then calculate the theoretical 20.10–20.20 28 30.555
probabilities associated with each class.
20.20–20.30 14 13.471
The detailed calculations for the first two
20.30–20.40 4 4.063
classes are:
Total 140 140
p1 = P(X ≤ 19.8)
19.8 − x̄ 5. The χ 2 (chi-square) statistic is then:
= P(Z ≤ )
S
= P(Z ≤ −1.835) 
k
(ni − ei )2
χ2 = ,
= 1 − P(Z ≤ 1.835) ei
i=1
76 Chi-Square Table

where k = 7 is the number of classes.


Chi-Square Table
(5 − 4.655)2 The chi-square table gives the values
χ2 =
4.655 obtained from the distribution function
(12 − 14.666)2 of a random variable that follows a chi-
+
14.666 square distribution.
(4 − 4.063)2
+ ...+
4.063 HISTORY
= 1.0927 . One of the first chi-square tables was pub-
lished in 1902, by Elderton. It contains
6. Choosing a significance level α = 5%,
2 with k − 3 = distribution function values that are given
we find that the value of χv,α
to six decimal places.
7 − 3 = 4 degrees of freedom in the chi-
In 1922, Pearson, Karl reported a table of
square table is:
values for the incomplete gamma function,
χ 2
= 9.49 . down to seven decimals.
4,0.05

Since the value calculated from χ 2 is MATHEMATICAL ASPECTS


smaller then the value obtained from the Let X be a random variable that follows
chi-square table, we do not reject the null a chi-square distribution with v degrees of
hypothesis and we conclude that the dif- freedom. The density function of the ran-
ference between the observed distribution dom variable X is given by:
and the normal distribution is not signif-
t 2 −1 exp − 2t
v

icant at a significance level of 5%. f (t) = v , t ≥0,


2 2 2v

FURTHER READING where represents the gamma function (see


 Chi-square distribution gamma distribution).
 Chi-square table The distribution function of the random
 Goodness of fit test variable X is defined by:
 x
 Hypothesis testing
F(x) = P(X ≤ x) = f (t) dt .
 Kolmogorov–Smirnov test 0
The chi-square table gives the values of the
REFERENCE distribution function F(x) for different val-
Pearson, K.: On the criterion, that a given ues of v.
system of deviations from the probable in We often use the chi-square table in the oppo-
the case of a correlated system of vari- site way, to find the value of x that corre-
ables is such that it can be reasonably sup- sponds to a given probability.
posed to have arisen from random sam- We generally denote as χv,α
2 the value of the

pling. In: Karl Pearson’s Early Statisti- random variable X for which
cal Papers. Cambridge University Press, 2
P(X ≤ χv,α )=1−α.
pp. 339–357. First published in 1900 in
Philos. Mag. (5th Ser) 50, 157–175 (1948) Note: the notation χ 2 is read “chi-square”.
Chi-Square Test 77

EXAMPLES might want to know, for example, whether


See Appendix F. the distribution observed for the sam-
The chi-square table allows us, for a given ple corresponds to a particular probabi-
number of degrees of freedom v, to deter- lity distribution (normal distribution,
mine: Poisson distribution, etc). C
1. The value of the distribution function • The chi-square test for an unknown vari-
F(x), given x. ance is used when we want to test whether
2. The value of χv,α
2 , given the probability this variance takes a particular constant
2 ).
P(X ≤ χv,α value.
• The chi-square test is used to test for
FURTHER READING homogeneity of the variances calculated
 Chi-square distribution for many samples drawn from a normally
 Chi-square goodness of fit test distributed population.
 Chi-square test
 Chi-square test of independence HISTORY
 Statistical table In 1937, Bartlett, M.S. proposed a method of
testing the homogeneity of the variance for
REFERENCES many samples drawn from a normally dis-
Elderton, W.P.: Tables for testing the good- tributed population.
ness of fit of theory to observation. See also chi-square test of independence
Biometrika 1, 155–163 (1902) and chi-square goodness of fit test.
Pearson, K.: Tables of the Incomplete -
MATHEMATICAL ASPECTS
function. H.M. Stationery Office (Cam-
The mathematical aspects of the chi-square
bridge University Press, Cambridge since
test of independence and those of the chi-
1934), London (1922)
square goodness of fit test are dealt with in
their corresponding entries.
Chi-Square Test The chi-square test used to check whether
an unknown variance takes a particular con-
There are a number of chi-square tests, all of stant value is the following:
which involve comparing the test results to Let (x1 , . . . , xn ) be a random sample com-
thevaluesfrom thechi-square distribution. ing from a normally distributed population
The most well-known of these tests are intro- of unknown mean μ and of unknown vari-
duced below: ance σ 2 .
• The chi-square test of independence is We have good reason to believe that the vari-
used to determine whether two qualita- ance of the population equals a presumed
tive categorical variables associated with value σ 2 . The hypotheses for each case are
0
a sample are independent. described below.
• The chi-square goodness of fit test is
A: Two-sided case:
used to determinewhether thedistribution
observed for a sample can be approxi- H0 : σ 2 = σ02
mated by a theoretical distribution. We H1 : σ 2 = σ02
78 Chi-Square Test

B: One-sided test: reject the null hypothesis H0 for the alterna-


tive hypothesis H1 .
H0 : σ 2 ≤ σ02
Other chi-square tests are proposed in the
H1 : σ 2 > σ02 work of Ostle, B. (1963).
C: One-sided test:
H0 : σ 2 ≥ σ02 DOMAINS AND LIMITATIONS
χ 2 (chi-square) statistic must be calculated
H1 : σ 2 < σ02
using absolute frequencies and not relative
We then determine the statistic of the given ones.
chi-square test using: Note that the chi-square test can beunreliable
n for small samples, especially when some of
(xi − x̄)2 the estimated frequencies are small (< 5).
i=1 This issue can often be resolved by group-
χ =
2
.
σ02 ing categories together, if such grouping is
sensible.
This statistic is, under H0 , chi-square dis-
tributed with n − 1 degrees of freedom. In
other words, we look for the value of χn−1,α
2 EXAMPLES
in the chi-square table, and we then com- Consider a batch of items produced by
pare that value to the calculated value χ 2 . a machine. They can be divided up into
The decision rules depend on the case, and classes depending on their diameters (in
are as follows. mm), as in the following table:

Case A Diameter (mm) xi Number of items ni


59.5 2
If χ 2 ≥ χn−1,α
2
1
or if χ 2 ≤ χn−1,1−α
2
2
we 59.6 6
reject the null hypothesis H0 for the alter- 59.7 7
native hypothesis H1 , where we have split 59.8 15
the significance level α into α1 and α2 such 59.9 17
that α1 +α2 = α. Otherwise we do not reject 60.0 25
the null hypothesis H0 . 60.1 15
60.2 7
Case B 60.3 3
If χ2< χn−1,α
2 we do not reject the null 60.4 1
60.5 2
hypothesis H0 . If χ 2 ≥ χn−1,α
2 we reject the
null hypothesis H0 for the alternative hypoth- Total N = 100

esis H1 .
We have a random sample drawn from
a normally distributed population where the
Case C
mean and variance are not known. The ven-
If χ 2 > χn−1,1−α
2 we do not reject the dor of these items would like the variance σ 2
null hypothesis H0 . If χ 2 ≤ χn−1,1−α
2 we to be smaller than or equal to 0.05. We test
Chi-square Test of Independence 79

the following hypotheses: Ostle, B.: Statistics in Research: Basic Con-


cepts and Techniques for Research Work-
null hypothesis H0 : σ 2 ≤ 0.05 ers. Iowa State College Press, Ames, IA
alternative hypothesis H1 : σ 2 > 0.05 . (1954)
C
In this case we use the one-tailed hypothesis
test. Chi-square Test
We start by calculating the mean for the sam-
of Independence
ple:
11 The chi-square test of independence aims to
i=1 ni · xi 5995 determine whether two variables associated
x̄ = = = 59.95 .
N 100 with a sample are independent or not. The
variables studied are categorical qualitative
We can then calculate the χ 2 statistics:
variables.
11 The chi-square independence test is per-
i=1 ni · (xi − x̄)2
χ2 = formed using a contingency table.
σ02
2 · (−0.45)2 + . . . + 2 · (0.55)2
= HISTORY
0.05
3.97 The first contingency tables were used only
= = 79.4 . for enumeration. However, encouraged by
0.05
the work of Quetelet, Adolphe (1849),
Using a significance level of α = 5%, we statisticians began to take an interest in the
then find the value of χ99,0.05
2 (= 123.2) in associations between the variables used in
the chi-square table. the tables. For example, Pearson, Karl
As χ 2 = 79.4 < χ99,0.05
2 , we do not reject (1900) performed fundamental work on con-
the null hypothesis, which means that the tingency tables.
vendor should be happy to sell these items Yule, George Udny (1900) proposed
since they are not significantly different in asomewhatdifferentapproach to thestudyof
diameter. contingency tables to Pearson’s, which lead
to a disagreement between them. Pearson
FURTHER READING also argued with Fisher, Ronald Aylmer
 Chi-square distribution about the number of degrees of freedom to
 Chi-square goodness of fit test use in the chi-square test of independence.
 Chi-square table Everyone used different numbers until Fish-
 Chi-square test of independence er, R.A. (1922) was eventually proved to be
correct.

REFERENCES
Bartlett, M.S.: Some examples of statisti- MATHEMATICAL ASPECTS
cal methods of research in agriculture and Consider two qualitative categorical vari-
applied biology. J. Roy. Stat. Soc. (Suppl.) ables X and Y. We have a sample contain-
4, 137–183 (1937) ing n observations of these variables.
80 Chi-square Test of Independence

These observations can be presented in 3. Choose the significance level α to be used


a contingency table. in the test and compare the calculated val-
We denote the observed frequency of the ue of χ 2 with the value obtained from
category i of the variable X and the category j the chi-square table, χv,α2 . The number

of the variable Y as nij . of degrees of freedom correspond to the


number of cases in the table that can take
Categories of variable Y arbitrary values; the values taken by the
Y1 . . . Yc Total other cases are imposed on them by the
Categories X 1 n 11 . . . n 1c n 1. row and column totals. So, the number of
of ... ... ... ... ... degrees of freedom is given by:
variable X Xr nr1 . . . nrc nr.
Total n.1 . . . n.c n.. v = (r − 1)(c − 1) .

The hypotheses to be tested are: 4. If the calculated χ 2 is smaller then the χv,α
2

from the table, we do not reject the null


Null hyp. H0 : the two variables hypothesis. The two variables can be con-
are independent,
sidered to be independent.
Alternative hyp. H1 : the two variables However, if the calculated χ 2 is greater
are not independent.
then the χv,α
2 from the table, we reject the

Steps Involved in the Test null hypothesis for the alternative hypoth-
1. Compute the expected frequencies, esis. We can then conclude that the two
denoted by eij , for each case in the con- variables are not independent.
tingency table under the independence
hypothesis: DOMAINS AND LIMITATIONS
ni. · n.j Certain conditions must be fulfilled in order
eij = , to be able to apply the chi-square test of inde-
n..
pendence:
 c r
ni. = nik and n.j = nkj , 1. The sample, which contains n observa-
k=1 k=1
tions, must be a random sample;
where c represents the number of columns 2. Each individual observation can only
(or number of categories of variable X in appear in one category for each vari-
the contingency table) and r the number of able. In other words, each individu-
rows (or the number of categories of vari- al observation can only appear in one
able Y). line and one column of the contingency
2. Calculate the value of the χ 2 (chi-square) table.
statistic, which is really a measure of the
Note that the chi-square test of indepen-
deviation of the observed frequencies nij
dence is not very reliable for small samples,
from the expected frequencies eij :
especially when the estimated frequencies
  (nij − eij )2
c r are small (< 5). To avoid this issue we can
χ2 = . group categories together, but only when this
eij
i=1 j=1 groups obtained are sensible.
Chi-square Test of Independence 81

35 · 31
EXAMPLES e21 = = 10.85
We want to determine whether the propor- 100
35 · 69
tion of smokers is independent of gender. e12 = = 24.15 .
100
The two variables to be studied are categor-
ical and qualitative and contain two cate- The estimated frequency table is given C
gories each: below:
• Variable “gender:” M or F;
• Variable “smoking status:” “smokes” or Smoking status
“does not smoke.” “smokes” “does Total
The hypotheses are then: not
smoke”
H0 : chance of being a smoker is Gender M 20.15 44.85 65
independent of gender F 10.85 24.15 35
H1 : chance of being a smoker is not Total 31 69 100
independent of gender.
If the null hypothesis H0 is true, the statistic
The contingency table obtained from
a sample of 100 individuals (n = 100) 
2 
2
2 (nij − eij )2
is shown below: χ =
eij
i=1 j=1

Smoking status
is chi-square-distributed with (r − 1)
“smokes” “does Total
(c − 1) = (2 − 1) (2 − 1) = 1 degree
not
smoke” of freedom and
Gender M 21 44 65
χ 2 = 0.036 + 0.016 + 0.066 + 0.030
F 10 25 35
Total 31 69 100 = 0.148 .

If a significance level of 5% is selected, the


We now denote the observed frequencies as
value of χ1,0.05
2 is 3.84, from the chi-square
nij (i = 1, 2, j = 1, 2).
table.
We then estimate all of the frequencies in the
Since the calculated value of χ 2 is smaller
table based on the hypothesis that the two
then the value found in the chi-square table,
variables are independent of each other. We
we do not reject the null hypothesis and we
denote these estimated frequencies by eij :
conclude that the two variables studied are
ni. · n.j independent.
eij = .
n..

We therefore obtain: FURTHER READING


 Chi-square distribution
65 · 31
e11 = = 20.15  Chi-square table
100
 Contingency table
65 · 69
e12 = = 44.85  Test of independence
100
82 Classification

REFERENCE ancient Greeks and the Romans all devel-


Fisher, R.A.: On the interpretation of χ 2 oped multiple typologies for human beings.
from contingency tables, and the calcula- The oldest comes from Galen (129–199
tion of P. J. Roy. Stat. Soc. Ser. A 85, 87–94 A.D.).
(1922) Later on, the concept of classification spread
Pearson, K.: On the criterion, that a given to the fields of biology and zoology; the
system of deviations from the probable in works of Linné (1707–1778) should be men-
the case of a correlated system of vari- tioned in this regard.
ables is such that it can be reasonably sup- The first ideas regarding actual methods of
posed to have arisen from random sam- cluster analysis are attributed to Adanson
pling. In: Karl Pearson’s Early Statisti- (eighteenth century). Zubin (1938), Tryon
cal Papers. Cambridge University Press, (1939) and Thorndike (1953) also attempted
pp. 339–357. First published in 1900 in to develop some methods, but the true devel-
Philos. Mag. (5th Ser) 50, 157–175 (1948) opment of classification methods coincides
with the advent of the computer.
Quetelet, A.: Letters addressed to H.R.H. the
Grand Duke of Saxe Coburg and Gotha,on
MATHEMATICAL ASPECTS
the Theory of Probabilities as Applied to
Classification methods can be divided into
the Moral and Political Sciences. (French
two large categories, one based on prob-
translation by Downs, Olinthus Gregory).
abilities and the other not.
Charles & Edwin Layton, London (1849)
The first category contains, for example, dis-
Yule, G.U.: On the association of attributes in criminating analysis. The second category
statistics: with illustration from the mate- can be further subdivided into two groups.
rial of the childhood society.Philos. Trans. The first group contains what are known as
Roy. Soc. Lond. Ser. A 194, 257–319 optimal classification methods. In the second
(1900) group, we can distinguish between several
subtypes of classification method:
• Partition methods that consist of distribut-
Classification ing n objects among g groups in such
a way that each object exclusively belongs
Classification is the grouping together of
to just one group. The number of groups g
similar objects. If each object is charac-
is fixed beforehand, and the partition
terized by p variables, classification can
applied most closely satisfies the classi-
be performed according to rational criteria.
fication criteria.
Depending on the criteria used, an object
• Partition methods incorporating infring-
could potentially belong to several classes.
ing classes, where an object is allowed to
belong to several groups simultaneously.
HISTORY • Hierarchical classification methods,
Classifying the residents of a locality or where the structure of the data at dif-
a country according to their sex and oth- ferent levels of classification is taken into
er physical characteristics is an activity that account. We can then take account for the
dates back to ancient times. The Hindus, the relationships that exist between the dif-
Cluster Analysis 83

ferent groups created during the partition. managers, workers, and so on, and one can
Two different kinds of hierarchical tech- calculate average salaries, average frequen-
niques exist: agglomerating techniques cy of health problems, and so on, for each
and dividing techniques. class.
Agglomerating methods start with sepa- In the second case, classification will lead to C
rated objects, meaning that the n objects a prediction and then to an action. For exam-
are initially distributed into n groups. Two ple, if the foxes in a particular region exhibit
groups are agglomerated in each subse- apathetic behavior and excessive salivation,
quent step until there is only one group we can conclude that there is a new rabies
left. epidemic. This should then prompt a vaccine
In contrast, dividing methods start with all campaign.
of the objects grouped together, meaning
that all of the n objects are in one single FURTHER READING
group to start with. New groups are creat-  Clusteranalysis
ed at each subsequent step until there are  Complete linkage method
n groups.  Data analysis
• Geometric classification, in which the
objects are depicted on a scatter plot and REFERENCES
then grouped according to position on the Everitt, B.S.: Cluster Analysis. Halstead,
plot. In a graphical representation, the London (1974)
proximities of the objects to each other in
Gordon, A.D.: Classification. Methods for
the graphic correspond to the similarities
the Exploratory Analysis of Multivariate
between the objects.
Data. Chapman & Hall, London (1981)
The first three types of classification are gen-
erally grouped together under the term clus- Kaufman, L., Rousseeuw, P.J.: Finding
ter analysis. What they have in common Groups in Data: An Introduction to Clus-
is the fact that the objects to be classified ter Analysis. Wiley, New York (1990)
must present a certain amount of structure Thorndike, R.L.: Who belongs in a family?
that allows us to measure the degree of sim- Psychometrika 18, 267–276 (1953)
ilarity between the objects. Tryon, R.C.: Cluster Analysis. McGraw-
Each type of classification contains a multi- Hill, New York (1939)
tudeof methodsthatallowusto createclasses
of similar objects. Zubin, J.: A technique for measuring like-
mindedness. J. Abnorm. Social Psychol.
DOMAINS AND LIMITATIONS 33, 508–516 (1938)
Classification can be used in two cases:
• Description cases;
• Prediction cases.
Cluster Analysis
In the first case, the classification is done Clustering is the partitioning of a data set
on the basis of some generally accepted into subsets or clusters, so that the degree of
standard characteristics. For example, pro- association is strong between members of the
fessions can be classified into freelance, same cluster and weak between members of
84 Cluster Analysis

different clusters according to some defined until all of the objects are gathered into the
distance measure. same class.
Several methods of performing cluster Note that the distance between the group
analysis exist: formed from aggregated elements and the
• Partitional clustering other elements can be defined in different
• Hierarchical clustering. ways, leading to different methods. Exam-
ples include the single link method and the
complete linkage method.
HISTORY
The single link method is a hierarchical
See classification and data analysis.
classification method that uses the Euclidean
distance to establish a distance table, and
MATHEMATICAL ASPECTS the distance between two classes is given by
To carry out cluster analysis on a set of the Euclidean distance between the two clos-
n objects, we need to define a distance est elements (the minimum distance).
between the objects (or more generally In the complete linkage method, the dis-
a measure of the similarity between the tance between two classes is given by the
objects) that need to be classified. The exis- Euclidean distance between the two ele-
tence of some kind of structure within the ments furthest away (the maximum dis-
set of objects is assumed. tance).
To carry out a hierarchical classification of Given that the only difference between these
a set E of objects {x1 , x2 , . . . , xn }, it is neces- two methods is that the distance between two
sary to define a distance associated with E classes is either the minimum and the max-
that can be used to obtain a distance table imum distance, only the single link method
between the objects of E. Similarly, a dis- will be considered here.
tance must also be defined for any subsets For a set E = {X1 , X2 , . . . , Xn }, the distance
of E. table for the elements of E is then estab-
One approach to hierarchical clustering is to lished.
use the agglomerating method. It can be sum- Since this table is symmetric and null along
marized in the following algorithm: its diagonal, only one half of the table is con-
1. Locate the pair of objects (xi , xj ) which sidered:
have the smallest distance between each d(X1 , X2 ) d(X1 , X3 ) . . . d(X1 , Xn )
other. d(X2 , X3 ) . . . d(X2 , Xn )
2. Aggregate the pair of objects (xi , xj ) into ..
a single element α and re-establish a new . ...
distance table. This is achieved by sup- d(Xn−1 , Xn )
pressing the lines and columns associat- where d(Xi , Xj ) is the Euclidean distance
ed with xi and xj and replacing them with between Xi and Xj for i < j, where the values
a line and a column associated with α. The of i and j are between 1 and n.
new distance table will have a line and The algorithm for the single link method is
a column less than the previous table. as follows:
3. Repeat these two operations until the • Search for the minimum d(Xi , Xj ) for
desired number of classes are obtained or i < j;
Cluster Analysis 85

• The elements Xi and Xj are aggregated into DOMAINS AND LIMITATIONS


a new group Ck = Xi ∪ Xj ; The choice of the distance between the
• The set E is then partitioned into group formed of aggregated elements and
the other elements can be operated in several
{X1 }, . . . , {Xi−1}, {Xi , Xj }, {Xi+1}, . . . , C
ways, according to the method that is used,
{Xj−1}, {Xj+1 }, . . . , {Xn} ; as for example in the single link method and
• The distance table is then recreated in the complete linkage method.
without the lines and columns associ-
ated with Xi and Xj , and with a line
and a column representing the distances EXAMPLES
between Xm and Ck , m = 1, 2, . . . , n m = Let us illustrate how the single link method
i and m = j, given by: of cluster analysis can be applied to the
examination grades obtained by five students
d(Ck , Xm ) = min{d(Xi , Xm ); d(Xj, Xm )}. each studying four courses: English, French,
maths and physics.
The algorithm is repeated until the desired
We want to divide these five students into two
number of groups is attained or until there
groups using the single link method.
is only one group containing all of the ele-
The grades obtained in the examinations,
ments.
which range from 1 to 6, are summarized in
In the general case, the distance between two
the following table:
groups is given by:

d(Ck , Cm ) = min{d(Xi , Xj ) with Xi English French Maths Physics


belonging to Ck and Xj to Cm } , Alain 5.0 3.5 4.0 4.5
Jean 5.5 4.0 5.0 4.5
The formula quoted previously applies to the Marc 4.5 4.5 4.0 3.5
particular case when the groups are com- Paul 4.0 5.5 3.5 4.0
posed, respectively, of two elements and one Pierre 4.0 4.5 3.0 3.5
single element.
This series of agglomerations can be repre-
We then work out the Euclidian distances
sented by a dendrogram, where the abscis-
between the students and use them to create
sa shows the distance separating the objects.
a distance table:
Note that we could find more than one pair
when we search for the pair of closest ele-
ments. In this case, the pair that is select- Alain Jean Marc Paul Pierre

ed for aggregation in the first step does not Alain 0 1.22 1.5 2.35 2

influence later steps (provided the algorithm Jean 1.22 0 1.8 2.65 2.74

does not finish at this step), because the oth- Marc 1.5 1.8 0 1.32 1.12

er pair of closest elements will be aggre- Paul 2.35 2.65 1.32 0 1.22

gated in the following step. The aggrega- Pierre 2 2.74 1.12 1.22 0

tion order is not shown on the dendrogram


because it reports the distance that separates By only considering the upper part of this
two grouped objects. symmetric table, we obtain:
86 Cluster Analysis

Jean Marc Paul Pierre of Marc and Pierre on the one side and Paul
Alain 1.22 1.5 2.35 2 on the other side (in other words, two pairs
Jean 1.8 2.65 2.74 exhibit the minimum distance); let us choose
Marc 1.32 1.12 to regroup Alain and Jean first. The other
Paul 1.22 pair will be aggregated in the next step. We
rebuild the distance table and obtain:
The minimum distance is 1.12, between
Marc and Pierre; we therefore form the first d({Alain,Jean}, {Marc,Pierre})
group from these two students. We then cal- = min{d(Alain,{Marc,Pierre}) ,
culate the new distances.
d(Jean,{Marc,Pierre})}
For example, we calculate the new dis-
tance between Marc and Pierre on one side = min{1.5; 1.8} = 1.5
and Alain on the other by taking the mini-
as well as:
mum distance between Marc and Alain and
the minimum distance between Pierre and d({Alain,Jean},Paul)
Alain: = min{d(Alain,Paul);d(Jean,Paul)}
d({Marc,Pierre},Alain) = min{2.35; 2.65} = 2.35 .
= min{d(Marc,Alain);d(Pierre,Alain)}
This gives the following distance table:
= min{1.5; 2} = 1.5 ,

also Marc and Pierre Paul


Alain and Jean 1.5 2.35
d({Marc,Pierre},Jean) Marc and Pierre 1.22
= min{d(Marc,Jean);d(Pierre,Jean)}
= min{1.8; 2.74} = 1.8 , Notice that Paul must now be integrated in
the group formed from Marc and Pierre, and
and the new distance will be:
d({Marc,Pierre},Paul) d({(Marc,Pierre),Paul},{Alain,Jean})
= min{d(Marc,Paul);d(Pierre,Paul)} = min{d({Marc,Pierre},{Alain,Jean}) ,
= min{1.32; 1.22} = 1.22 . d(Paul,{Alain,Jean})}
The new distance table takes the following = min{1.5; 2.35} = 1.5
form:
which gives the following distance table:
Jean Marc and Paul
Pierre Alain and Jean
Alain 1.22 1.5 2.35 Marc, Pierre and Paul 1.5
Jean 1.8 2.65
Marc and Pierre We finally obtain two groups:
1.22
{Alain and Jean} and {Marc, Pierre and
The minimum distance is now 1.22, between Paul} which are separated by a distance of
Alain and Jean and also between the group 1.5.
Cluster Sampling 87

The following dendrogram illustrates the


Cluster Sampling
successive aggregations:
In cluster sampling, the first step is to divide
the population into subsets called clusters.
Each cluster consists of individuals that are C
supposed to be representative of the popula-
tion.
Cluster sampling then involves choosing
a random sample of clusters and then observ-
ing all of the individuals that belong to each
of them.
FURTHER READING
 Classification HISTORY
 Complete linkage method See sampling.
 Dendrogram
 Distance MATHEMATICAL ASPECTS
 Distance table Cluster sampling is the process of random-
ly extracting representative sets (known as
clusters) from a larger population of units
REFERENCES
and then applying a questionnaire to all of
Celeux, G., Diday, E., Govaert, G., Lecheval-
the units in the clusters. The clusters often
lier, Y., Ralambondrainy, H.: Classifi-
consist of geographical units, like city dis-
cation automatique des données—aspects
tricts. In this case, the method involves divid-
statistiques et informatiques. Dunod,
ing a city into districts, and then selecting the
Paris (1989)
districts to be included in the sample. Finally,
Everitt, B.S.: Cluster Analysis. Halstead, all of the people or households in the chosen
London (1974) district are questioned.
Gordon, A.D.: Classification. Methods for There are two principal reasons to perform
the Exploratory Analysis of Multivariate cluster sampling. In many inquiries, there is
Data. Chapman & Hall, London (1981) no complete and reliable list of the popu-
Jambu, M., Lebeaux, M.O.: Classification lation units on which to base the sampling,
automatique pour l’analyse de données. or it may be that it is too expensive to cre-
Dunod, Paris (1978) ate such a list. For example, in many coun-
tries, including industrialized ones, it is rare
Kaufman, L., Rousseeuw, P.J.: Finding
to have complete and up-to-date lists of all of
Groups in Data: An Introduction to Clus-
the members of the population, households
ter Analysis. Wiley, New York (1990)
or rural estates. In this situation, sampling
Lerman, L.C.: Classification et analyse ordi- can be achieved in a geographical manner:
nale des données. Dunod, Paris (1981) each urban region is divided up into districts
Tomassone, R., Daudin, J.J., Danzart, M., and each rural region into rural estates. The
Masson, J.P.: Discrimination et classe- districts and the agricultural areas are con-
ment. Masson, Paris (1988) sidered to be clusters and we use the com-
88 Coefficient of Determination

plete list of clusters because we do not have icant than the increase in sampling variance
a complete and up-to-date list of all popu- that will result.
lation units. Therefore, we sample a requi-
site number of clusters from the list and then EXAMPLES
question all of the units in the selected clus- Consider N, the size of the population of
ter. town X. We want to study the distribution of
“Age” for town X without performing a cen-
DOMAINS AND LIMITATIONS sus. The population is divided into G parts.
The advantage of cluster sampling is that it Simple random sampling is performed
is not necessary to have a complete, up-to- amongst these G parts and we obtain g parts.
date list of all of the units of the population The final sample will be composed of all the
to perform analysis. individuals in the g selected parts.
For example, in many countries, there are no
updated lists of people or housing. The costs FURTHER READING
of creating such lists are often prohibitive.  Sampling
It is therefore easier to analyze subsets of the  Simple random sampling
population (known as clusters).
In general, cluster sampling provides esti- REFERENCES
mations that are not as precise as simple Hansen, M.H., Hurwitz, W.N., Madow,
random sampling, but this drop in accuracy M.G.: Sample Survey Methods and The-
is easily offset by the far lower cost of cluster ory. Vol. I. Methods and Applications.
sampling. Vol. II. Theory. Chapman & Hall, Lon-
In order to perform cluster sampling as effi- don (1953)
ciently as possible:
• The clusters should not be too big, and
there should be a large enough number of Coefficient
clusters, of Determination
• cluster sizes should be as uniform as pos-
The coefficient of determination, denoted
sible;
R2 , is the quotient of the explained varia-
• The individuals belonging to each clus-
tion (sum of squares due to regression) to the
ter must be as heterogenous as possible
total variation (total sum of squares total SS
with respect to the parameter being ob-
(TSS)) in a model of simple or multiple lin-
served.
ear regression:
Another reason to use cluster sampling is
cost. Even when a complete and up-to-date Explained variation
R2 = .
list of all population units exists, it may be Total variation
preferable to use cluster sampling from an It equals the square of the correlation coeffi-
economic point of view, since it is complet- cient, and it can take values between 0 and 1.
ed faster, involves fewer workers and min- It is often expressed as a percentage.
imizes transport costs. It is therefore more
appropriate to use cluster sampling if the HISTORY
money saved by doing so is far more signif- See correlation coefficient.
Coefficient of Determination 89

MATHEMATICAL ASPECTS These concepts and the relationships


Consider the following model for multiple between them are presented in the following
regression: graph:

Yi = β0 + β1 Xi1 + · · · + βp Xip + εi C
for i = 1, . . . , n, where

Yi are the dependent variables,


Xij (i = 1, . . . , n, j = 1, . . . , p) are the inde-
pendent variables,
εi are the random nonobservable error
terms,
βj (j = 1, . . . , p) are the parameters to be
estimated. Using these concepts, we can define R2 ,
which is the determination coefficient. It
Estimating the parameters β0 , β1 , . . . , βp
measures the proportion of variation in vari-
yields the estimation
able Y, which is described by the regression
Ŷi = β̂0 + β̂1 Xi1 + · · · + β̂p Xip . equation as:
n
(Ŷi − Ȳ)2
The coefficient of determination allows us to
REGSS i=1
measure the quality of fit of the regression R2 = = n
TSS 
equation to the measured values. (Yi − Ȳ)2
To determine the quality of the fit of the i=1
regression equation, consider the gap If the regression function is to be used to
between the observed value and the esti- make predictions about subsequent observa-
mated value for each observation of the tions, it is preferable to have a high value
sample. This gap (or residual) can also be of R2 , because the higher the value of R2 , the
expressed in the following way: smaller the unexplained variation.


n 
n
(Yi − Ȳ)2 = (Yi − Ŷi )2 EXAMPLES
i=1 i=1 The following table gives values for the

n Gross National Product (GNP) and the
+ (Ŷi − Ȳ)2 demand for domestic products covering the
i=1 1969–1980 period for a particular country.
TSS = RSS + REGSS
Year GNP Demand for dom-
where X estic products Y
1969 50 6
TSS is the total sum of squares,
1970 52 8
RSS the residual sum of squares and
1971 55 9
REGSS the sum of the squares of the
1972 59 10
regression.
90 Coefficient of Determination

Year GNP Demand for dom- We can calculate the mean using:
X estic products Y
1973 57 8 
12
Yi
1974 58 10
i=1
1975 62 12 Ȳ = = 9.833 .
n
1976 65 9
Yi Ŷi (Yi − Ȳ )2 (Ŷi − Ȳ )2
1977 68 11
6 7.260 14.694 6.622
1978 69 10
8 7.712 3.361 4.500
1979 70 11
9 8.390 0.694 2.083
1980 72 14 10 9.294 0.028 0.291
8 8.842 3.361 0.983
We will try to estimate the demand for small 10 9.068 0.028 0.586
goods as a function of GNP according to the 12 9.972 4.694 0.019
model 9 10.650 0.694 0.667
11 11.328 1.361 2.234
Yi = a + b · Xi + εi , i = 1, . . . , 12 . 10 11.554 0.028 2.961
11 11.780 1.361 3.789
Estimating the parameters a and b by the 14 12.232 17.361 4.754
least squares method yields the following Total 47.667 30.489
estimators:
We therefore obtain:

12
(Xi − X̄)(Yi − Ȳ) 30.489
R2 = = 0.6396
i=1 47.667
b̂ = = 0.226 ,

12
or, in percent:
(Xi − X̄)2
i=1 R2 = 63.96% .
â = Ȳ − b̂ · X̄ = −4.047 .
We can therefore conclude that, according to
The estimated line is written the model chosen, 63.96% of the variation in
the demand for small goods is explained by
Ŷ = −4.047 + 0.226 · X . the variation in the GNP.
Obviously the value of R2 cannot exceed
The quality of the fit of the measured points 100%. While 63.96% is relatively high, it is
to the regression line is given by the deter- not close enough to 100% to rule out trying
mination coefficient: to modify the model further.
This analysis also shows that other variables
12
apart from the GNP should be taken into
(Ŷi − Ȳ)2 account when determining the function cor-
REGSS i=1
R =
2
= . responding to the demand for small goods,
TSS 12
since the GNP only partially explains the
(Yi − Ȳ) 2

i=1
variation.
Coefficient of Kurtosis 91

FURTHER READING where x̄ is the arithmetic mean, n is the total


 Correlation coefficient number of observations and S4 is the stan-
 Multiple linear regression dard deviation of the sample raised to the
 Regression analysis fourth power.
 Simple linear regression For the case where a random vari- C
able X takes values xi with frequencies
fi , i = 1, 2, . . . , h, the centered fourth-order
Coefficient of Kurtosis moment of the sample is given by the for-
The coefficient of kurtosis is used to mea- mula:
1 
h
sure the peakness or flatness of a curve. It
m4 = · fi · (xi − x̄)4 .
is based on the moments of the distribution. n
i=1
This coefficient is one of the measures of
kurtosis. DOMAINS AND LIMITATIONS
For a normal distribution, the coefficient of
HISTORY kurtosis is equal to 3. Therefore a curve will
See coefficient of skewness. be called platikurtic (meaning flatter than the
normal distribution) if it has a kurtosis coef-
MATHEMATICAL ASPECTS ficient smaller than 3. It will be leptokur-
The coefficient of kurtosis (β2 ) is based tic (meaning sharper than the normal distri-
on the centered fourth-order moment of bution) if β2 is greater than 3.
a distribution which is equal to: Let us now prove that the coefficient of kurto-
  sis is equal to 3 for the normal distribution.
μ4 = E (X − μ)4 . We know that β2 = μ 4
, meaning that the
σ4
centered fourth-order moment is divided by
In order to obtain a coefficient of kurtosis
the standard deviation raised to the fourth
that is independent of the units of measu-
power. It can be proved that the centered sth
rement, the fourth-order moment is divided
order moment, denoted μs , satisfies the fol-
by the standard deviation of the popula-
lowing relation for a normal distribution:
tion σ raised to the fourth power. The coef-
ficient of kurtosis then becomes equal to: μs = (s − 1)σ 2 · μs−2 .
μ4 This formula is a recursive formula which
β2 = .
σ4 expresses higher order moments as a func-
For a sample (x1 , x2 , . . . , xn ), the estimator tion of lower order moments.
of this coefficient is denoted by b2. It is equal Given that μ0 = 1 (the zero-order moment
to: of any random variable is equal to 1, since it
m4
b2 = 4 , is the expected value of this variable raised
S to the power zero) and that μ1 = 0 (the cen-
where m4 is the centered fourth-order tered first-order moment is zero for any ran-
moment of the sample, given by: dom variable), we have:
1 
n
μ2 = σ 2
m4 = · (xi − x̄)4 ,
n μ3 = 0
i=1
92 Coefficient of Skewness

μ4 = 3σ 4 Since S = 33.88, the coefficient of kurtosis


etc. . is equal to:
The coefficient of kurtosis is then equal to:
75 (281020067.84)
1
μ4 3σ 4 β2 = = 2.84 .
β2 = 4 = 4 = 3 . (33.88)4
σ σ
Since β2 is smaller than 3, we can conclude
EXAMPLES that the distribution of the daily turnover in
We want to calculate the kurtosis of the distri- 75 bakeries is platikurtic, meaning that it is
bution of daily turnover for 75 bakeries. Let flatter than the normal distribution.
us calculate the coefficient of kurtosis β2
using the following data: FURTHER READING
Table categorizing the daily turnovers of 75  Measure
of kurtosis
bakeries  Measure of shape
Turnover Frequencies
215–235 4 REFERENCES
235–255 6 See coefficient of skewness β1 de Pearson.
255–275 13
275–295 22
295–315 15 Coefficient of Skewness
315–335 6 The coefficient of skewness measures the
335–355 5 skewness of a distribution. It is based on the
355–375 4 notion of the moment of the distribution.
The fourth-order moment of the sample is This coefficient is one of the measures of
given by: skewness.

1
h
m4 = fi (xi − x̄)4 , HISTORY
n
i=1 Between the end of the nineteenth centu-
where n = 75, x̄ = 290.60 and xi is the cen- ry and the beginning of the twentieth cen-
ter of class interval i. We can summarize the tury, Pearson, Karl studied large sets of
calculations in the following table: data which sometimes deviated significant-
ly from normality and exhibited consider-
xi xi − x̄ fi fi (xi − x̄ )4
able skewness.
225 −65.60 4 74075629.16
He first used the following coefficient as
245 −45.60 6 25942428.06
a measure of skewness:
265 −25.60 13 5583457.48
285 −5.60 22 21635.89 x̄ − mode
skewness = ,
305 14.40 15 644972.54 S
325 34.40 6 8402045.34
where x̄ represents thearithmetic mean and
345 54.40 5 43789058.05
S the standard deviation.
365 74.40 4 122560841.32
This measure is equal to zero if the data are
281020067.84 distributed symmetrically.
Coefficient of Skewness 93

He discovered empirically that for a mod- third power. The coefficient obtained, desig-

erately asymmetric distribution (the gamma nated by β1 , is equal to:
distribution):  μ3
β1 = 3 .
σ
Mo − x̄ ≈ 3 · (Md − x̄) , C
The estimator of this coefficient, calculat-
where Mo and Md denote the mode and ed for a sample (x1 , x2 , . . . xn ), is denoted by

the median of data set. By substituting this b1 . It is equal to:
expression into the previous coefficient, the  m3
following alternative formula is obtained: b1 = 3 ,
S
3 · (x̄ − Md ) where m3 is the centered third-order
skewness = .
S moment of the sample, given by:
Following this, Pearson, K. (1894,1895) 1 
n
introduced a coefficient of skewness, known m3 = · (xi − x̄)3 .
n
as the β1 coefficient, based on calculations i=1

of the centered moments. This coefficient Here x̄ is the arithmetic mean, n is the total
is more difficult to calculate but it is more number of observations and S3 is the stan-
descriptive and better adapted to large num- dard deviation raised to the third power.
bers of observations. For the case where a random variable X
Pearson, K. also created the coefficient of takes values xi with frequencies fi , i =
kurtosis (β2 ), which is used to measure the 1, 2, . . . , h, the centered third-order moment
oblateness of a curve. This coefficient is also of the sample is given by the formula:
based on the moments of the distribution
1 
h
being studied.
m3 = · fi · (xi − x̄)3 .
Tables giving the limit values of the coeffi- n
i=1
cients β1 and β2 can be found in the works
of Pearson and Hartley (1966, 1972). If the If the coefficient is positive, the distribution
sample estimates a fall outside the limit for spreads to the the right. If it is negative, the
β1 , β2 , we conclude that the population is distribution expands to the left. If it is close to
significantly curved or skewed. zero, the distribution is approximately sym-
metric.
If the sample is taken from a normal pop-
MATHEMATICAL ASPECTS √
ulation, the statistic b1 roughly follows
The skewness coefficient is based on the cen-
a normal distribution with  a mean of 0 and
tered third-order moment of the distribution
a standard deviation of 6n . If the size of
in question, which is equal to:
the sample n is bigger than 150, the nor-
 
μ3 = E (X − μ)3 . mal table can be used to test the skewness
hypothesis.
To obtain a coefficient of skewness that is
independent of the measuring unit, the third- DOMAINS AND LIMITATIONS
order moment is divided by the standard This coefficient (similar to the other mea-
deviation of the population σ raised to the sures of skewness) is only of interest if it
94 Coefficient of Skewness

can be used to compare the shapes of two or Since S = 33.88, the coefficient of skewness
more distributions. is equal to:

1
EXAMPLES  (821222.41) 10949.632
Suppose that we want to compare the shapes b1 = 75 =
(33.88)3 38889.307
of the daily turnover distributions obtained
for 75 bakeries for two different years. We = 0.282 .
then calculate the skewness coefficient in
For year 2, n = 75 and x̄ = 265.27. The
both cases.
calculations are summarized in the following
The data are categorized in the table below:
table:
Turnover Frequencies Frequencies
for year 1 for year 2 xi xi − x̄ fi fi (xi − x̄ )3
225 −40.27 25 −1632213.81
215–235 4 25
245 −20.27 15 −124864.28
235–255 6 15
265 −0.27 9 −0.17
255–275 13 9
285 19.73 8 61473.98
275–295 22 8
305 39.73 6 376371.09
295–315 15 6 325 59.73 5 1065663.90
315–335 6 5 345 79.73 4 2027588.19
335–355 5 4 365 99.73 3 2976063.94
355–375 4 3 4750082.83

The third-order moment of the sample is Since S = 42.01, the coefficient of skewness
given by: is equal to:
1 
h
m3 = · fi · (xi − x̄)3 . 1
n  (4750082.83) 63334.438
i=1 b1 = 75 =
(42.01) 3 74140.933
For year 1, n = 75, x̄ = 290.60 and xi is the
= 0.854 .
center of each class interval i. The calcula-
tions are summarized in the following table: The coefficient of skewness for year 1 is

close to zero ( b1 = 0.282), so the daily
xi xi − x̄ fi fi (xi − x̄ )3
turnover distribution for the 75 bakeries for
225 −65.60 4 −1129201.664
year 1 is very close to being a symmetrical
245 −45.60 6 −568912.896
distribution. For year 2, the skewness coef-
265 −25.60 13 −218103.808
ficient is higher; this means that the distri-
285 −5.60 22 −3863.652
bution spreads towards the right.
305 14.40 15 44789.760
325 34.40 6 244245.504
345 54.40 5 804945.920 FURTHER READING
365 74.40 4 1647323.136  Measure of shape
821222.410  Measure of skewness
Coefficient of Variation 95

REFERENCES in other words:


Pearson, E.S., Hartley, H.O.: Biometrika S
Tables for Statisticians, vols. I and II. CV = · 100

Cambridge University Press, Cambridge for a sample, where:
(1966,1972) C
S is the standard deviation of the sample,
Pearson, K.: Contributions to the mathe-
and
matical theory of evolution. I. In: Karl
x̄ is the arithmetic mean of the sample,
Pearson’s Early Statistical Papers. Cam-
bridge University Press, Cambridge, or: σ
pp. 1–40 (1948). First published in 1894 CV = · 100
μ
as: On the dissection of asymmetrical fre-
for a population, where
quency curves. Philos. Trans. Roy. Soc.
Lond. Ser. A 185, 71–110 σ is the standard deviation of the popula-
tion, and
Pearson, K.: Contributions to the mathe-
μ is the mean of the population.
matical theory of evolution. II: Skew
variation in homogeneous material. In: This coefficient is independent of the unit of
Karl Pearson’s Early Statistical Papers. measurement used for the variable.
Cambridge University Press, Cambridge,
pp. 41–112 (1948). First published in EXAMPLES
1895 in Philos. Trans. Roy. Soc. Lond. Let us study the salary distributions for two
Ser. A 186, 343–414 companies from two different countries.
According to a survey, the arithmetic
means and the standard deviations of the
Coefficient of Variation salaries are as follows:

The coefficient of variation is a measure of Company A:


relative dispersion. It describes the stan- x̄ = 2500 CHF ,
dard deviation as a percentage of the arith- S = 200 CHF ,
metic mean.
200
This coefficient can be used to compare the CV = · 100 = 8% .
2500
dispersions of quantitative variables that
are not expressed in the same units (for exam- Company B:
ple, when comparing the salaries in differ- x̄ = 1000 CHF ,
ent countries, given in different currencies), S = 50 CHF ,
or the dispersions of variables that have very
50
different means. CV = · 100 = 5% .
1000
The standard deviation represents 8% of
MATHEMATICAL ASPECTS the arithmetic mean for company A, and
The coefficient of variation CV is defined as 5% for company B. The salary distribution
the ratio of the standard deviation to the is a bit more homogeneous for company B
arithmetic mean for a set of observations; than for company A.
96 Collinearity

FURTHER READING MATHEMATICAL ASPECTS


 Arithmetic mean In the case of a matrix of explanatory vari-
 Measure of dispersion ables X, collinearity means that one of the
 Standard deviation columns of X is (approximately) a linear
combination of the other columns. This
implies that X X is almost singular. Con-
REFERENCES
sequently, the estimator obtained
by the
Johnson, N.L., Leone, F.C.: Statistics and
least squares method β  = X X −1 X Y
experimental design in engineering and
is obtained by inverting an almost singu-
the physical sciences. Wiley, New York
lar matrix, which causes its components
(1964)
to become unstable. The ridge regression
technique was created in order to deal with
these collinearity problems.
Collinearity A collinear relation between more than two
variables will not always be the result of
Variables are known to be mathematically
observing the pairwise correlations between
collinear if one of them is a linear com-
the variables. A better indication of the pres-
bination of the other variables. They are
ence of a collinearity problem is provided by
known as statistically collinear if one of
variance inflation factors, VIF. The variance
them is approximately a linear combination
inflation factor of an explanatory variable Xj
of other variables. In the case of a regres-
is defined by:
sion model where the explanatory variables
are strongly correlated to each other, we say 1
VIFj = ,
that there is collinearity (or multicollineari- 1 − Rj2
ty) between the explanatory variables. In the 2
first case, it is simply impossible to define where Rj is the coefficient of determina-
least squares estimators, and in the second tion of the model

case, these estimators can exhibit consider- Xj = β0 + βk X k + ε .
able variance. k =j

ThecoefficientVIF takesvaluesof between 1


HISTORY and ∞. If the Xj are mathematically collinear
The term “collinearity” was first used in with other variables, we get Rj2 = 1 and
mathematics at the beginning of the twen- VIFj = ∞. On the other hand, if the Xj are
tieth century, due to the rediscovery of the reciprocally independent, we have Rj2 = 0
theorem of Pappus of Alexandria (a third- and VIFj = 1. In practice, we consider that
century mathematician). Let A, B, C be three there is a real problem with collinearity when
points on a line and A’, B’, C’ be three points VIFj is greater then 100, which corresponds
on a different line. If we relate the pairs to a Rj2 that is greater then 0.99.
using AB’.A’B, CA’.AC’ and BC’.B’C, their
intersections will occur in a line; in other DOMAINS AND LIMITATIONS
words, the three intersection points will be Inverting a singular matrix, similar to invert-
collinear. ing 0, is not a valid operation. Using the
Collinearity 97

same principle, inverting an almost singu- Por- Ingre- Ingre- Ingre- Ingre- Heat
lar matrix is similar to inverting a very small tion dient dient dient dient

number. Some of the elements of the matrix i 1 X1 2 X2 3 X3 4 X4 Y


must therefore be very big. Consequently, 3 11 56 8 20 104.3
when the explanatory variables are collinear, 4 11 31 8 47 87.6
C
some elements of the matrix (X X)−1 of β  5 7 54 6 33 95.9
will probably be very large. This is why 6 11 55 9 22 109.2
collinearity leads to unstable regression esti-
7 3 71 17 6 102.7
mators. Aside from this problem, collinear-
8 1 31 22 44 72.5
ity also results in a calculation problem; it is
9 2 54 18 22 93.1
difficult to precisely calculate the inverse of
an almost singular matrix. 10 21 48 4 26 115.9
11 1 40 23 34 83.9
EXAMPLES 12 11 66 9 12 113.3
Thirteen portions of cement are examined 13 10 68 8 12 109.4
in the following example. Each portion con- Source: Birkes & Dodge (1993)
tains four ingredients, as described in the
table. The goal of the experiment is to deter-
We start with a simple linear regression.
mine how the quantities X1 , X2 , X3 and X4 ,
The model used for the linear regression
corresponding to the quantities of these four
is:
ingredients, affect the quantity Y of heat giv-
en out as the cement hardens. Y = β0
Yi quantity of heat given out during the + β1 X 1 + β2 X 2 + β3 X 3 + β4 X 4
hardening of the ith portion (in joules); +ε.
Xi1 quantity of ingredient 1 (tricalcium alu-
minate) in the ith portion; We obtain the following results:
Xi2 quantity of ingredient 2 (tricalcium sil-
icate) in the ith portion; Variable β̂i S.d. tc

Xi3 quantity of the ingredient 3 (tetracalci- Constant 62.41 70.07 0.89

um aluminoferrite) in the ith portion; X1 1.5511 0.7448 2.08


X2 0.5102 0.7238 0.70
Xi4 quantity of ingredient 4 (dicalcium sil-
X3 0.1019 0.7547 0.14
icate) in the ith portion.
X4 −0.1441 0.7091 −0.20

Table: Heat given out by the cement portions


during hardening We can see that only coefficient X1 is signif-
icantly greater then zero; in the other cases
Por- Ingre- Ingre- Ingre- Ingre- Heat
tion dient dient dient dient
tc < t(α/2,n−2) (value taken from the Student
table) for a significance level of α = 0.1.
i 1 X1 2 X2 3 X3 4 X4 Y
Moreover, the standard deviations of the esti-
1 7 26 6 60 78.5 3 and β
2 , β 4 are greater
mated coefficients β
2 1 29 12 52 74.3
then the coefficients themselves. When the
98 Combination

variables are strongly correlated, it is known FURTHER READING


that the effect of one can mask the effect of  Multiple linear regression
another. Because of this, the coefficients can  Ridge regression
appear to be insignificantly different from
zero. REFERENCES
To verify the presence of multicollinearity Birkes, D., Dodge, Y.: Alternative Methods
for acoupleof variables,wecalculatethecor- of Regression. Wiley, New York (1993)
relation matrix.
Bock, R.D.: Multivariate Statistical Meth-
ods in Behavioral Research. McGraw-
X1 X2 X3 X4
Hill, New York (1975)
X1 1 0.229 −0.824 0.245
X2 0.229 1 −0.139 −0.973
X3 −0.824 −0.139 1 0.030
X4 0.245 −0.973 0.030 1 Combination
A combination is an un-ordened collection
We note that a strong negative correlation of unique elements or objects.
(−0.973) exists between X2 and X4 . Look- A k-combination is a subset with k elements.
ing at the data, we can see the reason for The number of k-combinations from a set of
that. Aside from portions 1 to 5, the total n elements is the number of arrangements.
quantity of silicates (X2 + X4 ) is almost
constant across the portions, and is approx-
HISTORY
imately 77; therefore, X4 is approximately
See combinatory analysis.
77 − X2 . this situation does not allow us
to distinguish between the individual effects
MATHEMATICAL ASPECTS
of X2 and those of X4 . For example, we see
that the four largest values of X4 (60, 52, 47 1. Combination without repetition
and 44) correspond to values of Y smaller Combination without repetition describe
then the mean heat 95.4. Therefore, at first the situation where each object drawn is
not placed back for the next drawing.Each
sight it seems that large quantities of ingre-
object can therefore only occur once in
dient 4 will lead to small amounts of heat.
each group.
However, we also note that the four largest
The number of combination without rep-
values of X4 correspond to the four small-
etition of k objects among n is given by:
est values of X2 , giving a negative correla-
tion between X2 and X4 . This suggests that

n n!
ingredient 4 taken alone has a small effect Cn =
k
= .
k k! · (n − k)!
on variable Y, and that the small quanti-
ties of 2 taken alone can explain the small 2. Combination with repetitions
amount of heat emitted. Hence, the linear Combination with repetitions (or with
dependence between two explanatory vari- remittance) are used when each drawn
ables (X2 and X4 ) makes it more complicat- object is placed back for the next draw-
ed to see the effect of each variable alone on ing. Each object can then occur r times in
the response variable Y. each group, r = 0, 1, . . . , k.
Combinatory Analysis 99

The number of combinations with repeti- FURTHER READING


tions of k objects among n is given by:  Arrangement


 Binomial distribution
n+k−1 (n + k − 1)!  Combinatory analysis
Kn =
k
= . C
k k! · (n − 1)!  Permutation

EXAMPLES
1. Combinations without repetition
Combinatory Analysis
Consider the situation where we must Combinatory analysis refers to a group of
choose a committee of three people from techniques that can be used to determine the
an assembly of eight people. How many number of elements in a particular set with-
different committees could potentially be out having to count them one-by-one.
picked from this assembly, if each person The elements in question could be the results
can only be selected once in each group? from a scientific experiment or the different
Here we need to calculate the number potential outcomes of a random event.
of possible combinations of three people Three particular concepts are important in
from eight: combinatory analysis:
• Permutations;
n! 8!
Cnk = = • Combinations;
k! · (n − k)! 3! · (8 − 3)!
• Arrangements.
40320
= = 56
6 · 120
HISTORY
Therefore, it is possible to form 56 dif- Combinatory analysis has interested mathe-
ferent committees containing three peo- maticians for centuries. According to Takacs
ple from an assembly of eight people. (1982), such analysis dates back to ancient
2. Combinations with repetitions Consider Greece. However, the Hindus, the Per-
an urn containing six numbered balls. We sians (including the poet and mathematician
carry out four successive drawings, and Khayyâm, Omar) and (especially) the Chi-
place the drawn ball back into the urn after nese also studied such problems. A 3000
each drawing. How many different com- year-old Chinese book “I Ching” describes
binations could occur from this drawing? the possible arrangements of a set of n
In this case we want to find the number elements, where n ≤ 6. In 1303, Chu, Shih-
of combinations with repetition (because chieh published a work entitled “Ssu Yuan
each drawn ball is placed back in the urn Yü Chien” (Precious mirror of the four
before the next drawing). We obtain elements). The cover of the book depict-
(n + k − 1)! 9! sa triangle that shows the combinations of
Knk = = k elements taken from a set of size n where
k! · (n − 1)! 4! · (6 − 1)!
0 ≤ k ≤ n. This arithmetic triangle was also
362880
= = 126 explored by several European mathemati-
24 · 120
cians such as Stifel, Tartaglia and Hérigone,
different combinations. and especially Pascal, who wrote the “Trai-
100 Compatibility

té du triangle arithmétique” (Treatise of the  Combination


arithmetic triangle) in 1654 (although it was  Permutation
not published until after his death in 1665).
Another document on combinations was REFERENCES
published in 1617 by Puteanus, Erycius MacMahon, P.A.: Combinatory Analysis,
called “Erycii Puteani Pretatis Thaumata vols. I and II. Cambridge University Press,
in Bernardi Bauhusii è Societate Jesu Pro- Cambridge (1915–1916)
teum Parthenium”. However, combinatory
Stigler, S.: The History of Statistics, the
analysis only revealed its true power with
Measurement of Uncertainty Before
the works of Fermat (1601–1665) and Pas-
1900. Belknap, London (1986)
cal (1623–1662). The term “combinatory
analysis” was introduced by Leibniz (1646– Takács, L.: Combinatorics. In: Kotz, S.,
1716) in 1666. In his work “Dissertatio de Johnson, N.L. (eds.) Encyclopedia of Sta-
Arte Combinatoria,” he systematically stud- tistical Sciences, vol. 2. Wiley, New York
ied problems related to arrangements, per- (1982)
mutations and combinations.
Other works in this field should be mentioned
here, such as those of Wallis, J. (1616–1703), Compatibility
reported in “The Doctrine of Permutations Two events are said to be compatible if the
and Combinations” (an essential and funda- occurrence of the first event does not pre-
mental part of the “Doctrines of Chances”), vent the occurrence of the second (or in other
or those of Bernoulli, J., Moivre, A. de, Car- words, if the intersection between the two
dano, G. (1501–1576), and Galileo (1564– events is not null):
1642).
In the second half of the nineteenth centu- P(A ∩ B) = 0 .
ry, Cayley (1829–1895) solved some prob-
lems related to this type of analysis via graph- We can represent two compatible events A
ics that he called “trees”. Finally, we should and B schematically in the following way:
also mention the important work of MacMa-
hon (1854–1929), “Combinatory Analysis”
(1915–1916).

EXAMPLES
See arrangement, combination and per-
Two events A and B are incompatible (or
mutation.
mutually exclusive) if the occurrence of A
prevents the occurrence of B, or vice versa.
FURTHER READING We can represent this in the following way:
 Arithmetic triangle
 Arrangement
 Binomial
 Binomial distribution
Compatibility 101

This means that the probability that these The events A and B are compatible, because
two events happen at the same time is zero: it is possible to draw both a heart and a queen
at the same time (the queen of hearts). There-
A ∩ B = φ −→ P(A ∩ B) = 0 . fore, the intersection between A and B is the
queen of hearts. The probability of this event C
MATHEMATICAL ASPECTS is given by:
If two events A and B are compatible, the
probability that at least one of the events 1
P(A ∩ B) = .
occurs can be obtained using the following 52
formula:
The probability of the union of the two
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) . events A and B (drawing either a heart or
a queen) is then equal to:
On the other hand, if the two events A and B
are incompatible, the probability that the P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
two events A and B happen at the same time 13 4 1
= + −
is zero: 52 52 52
P(A ∩ B) = 0 . 4
= .
13
The probability that at least one of the events
occurs can be obtained by simply adding the On the other hand, the events A and C
individual probabilities of A and B: are incompatible, because a card cannot be
both a heart and a club! The intersection
P(A ∪ B) = P(A) + P(B) . between A and C is an empty set.

EXAMPLES A∩C =φ.


Consider a random experiment that
involves drawing a card from a pack of The probability of the union of the two
52 cards. We are interested in the three fol- events A and C (drawing a heart or a club) is
lowing events: simply given by the sum of the probabilities
of each event:
A = “draw a heart”
B = “draw a queen” P(A ∪ C) = P(A) + P(C)
C = “draw a club” . =
13 13
+
52 52
The probabilities associated with each of 1
= .
these events are: 2
13
P(A) =
52 FURTHER READING
4
P(B) = Event
52
13 Independence
P(C) = . Probability
52
102 Complementary

Complementary Complete Linkage Method


Consider the sample space for a random The complete linkage method is a hierarchi-
experiment. cal classification method where the distance
For any event A, an element of , we can between two classes is defined as the greatest
determine a new event B that contains all of distance that could be obtained if we select
the elements of the sample space that are one element from each class and measure
not included in A. the distance between these elements. In oth-
This event B is called the complement of A er words, it is the distance between the most
with respect to and is obtained by the nega- distant elements from each class.
tion of A. For example, the distance used to construct
the distance table is the Euclidian distance.
MATHEMATICAL ASPECTS Using the complete linkage method, the dis-
Consider an event A, which is an element of tance between two classes is given by the
the sample space . Euclidian distance between the most distant
The compliment of A with respect to is elements (the maximum distance).:
denoted Ā. It is given by the negation of A:
Ā = − A MATHEMATICAL ASPECTS
= {w ∈ ; w ∈
/ A} . See cluster analysis.

EXAMPLES FURTHER READING


Consider a random experiment that con-  Classification
sists of flipping a coin three times.  Cluster analysis
The sample space of this experiment is  Dendrogram
 Distance
= {TTT, TTH, THT, THH,
 Distance table
HTT, HTH, HHT, HHH} .
Consider the event
Completely
A = “Heads (H) occurs twice”
Randomized Design
= {THH, HTH, HHT} .
A completely randomized design is a type of
The compliment of A with respect to is experimental design where the experimental
equal to units are randomly assigned to the different
Ā = {TTT, TTH, THT, HTT, HHH} treatments.
= “Heads (H) does not occur twice” . It is used when the experimental units are
believed to be “uniform;” that is, when there
is no uncontrolled factor in the experiment.
FURTHER READING
 Event
 Random experiment HISTORY
 Sample space See design of experiments.
Composite Index Number 103

EXAMPLES Dow-Jones index are all examples of com-


We want to test five different drugs based on posite index numbers.
aspirin. To do this, we randomly distribute The aim of using composite index numbers
the five types of drug to 40 patients. Denoting is to summarize all of the simple index num-
the five drugs by A, B, C, D and E, we obtain bers contained in a complex number (a value C
the following random distribution: formed from a set of simple values) in just
one index.
A is attributed to 10 people; The most commonly used composite index
B is attributed to 12 people; numbers are:
• The Laspeyres index;
C is attributed to 4 people;
• The Paasche index;
D is attributed to 7 people; • The Fisher index.
E is attributed to 7 people.
HISTORY
We have then a completely randomized See index number.
design where the treatments (drugs) are ran-
domly attributed to the experimental units
MATHEMATICAL ASPECTS
(patients), and each patient receives only one
There are several methods of creating com-
treatment. We also assume that the patients
posite index numbers.
are “uniform:” that there are no differences
To illustrate these methods, let us use a sce-
between them. Moreover, we assume that
nario where a price index is determined for
there is no uncontrolled factor that inter-
the current period n with respect to a refer-
venes during the treatment.
ence period 0.
In this example, the completely randomized
1. Index number of the arithmetic means
design is a factorial experiment that uses
(the sum method):
only one factor: the aspirin. The five types
of aspirin are different levels of the factor. Pn
In/0 = · 100 ,
P0

FURTHER READING where Pn is the sum of the prices of the

 Design of experiments items at the current period, and P0 is
 Experiment the sum of the prices of the items at the
reference period.
2. Arithmetic mean of simple index num-
Composite bers:

Index Number 1  Pn
In/0 = · · 100 ,
N P0
Composite index numbers allow us to mea-
sure, with a single number, the relative varia- where N is the number of goods consid-
tions within a group of variables upon mov- ered and PPn0 is the simple index number
ing from one situation to another. of each item.
The consumer price index, the wholesale In these two methods, each item has
price index, the employment index and the the same importance. This is a situation
104 Conditional Probability

which often does not correspond to real- According to this method, the price index
ity. has increased by 405.9% (505.9 − 100)
3. Index number of weighted arithmetic between the reference year and the current
means (the weighted sum method): year.
The general formula for an index number 2. Arithmetic mean of simple index num-
calculated by the weighted sum method is bers:
as follows:

1  Pn
Pn · Q I n/0 = · · 100
In/0 = · 100 . N P0
P0 · Q

1 1.20 1.10 2.00


= · + + · 100
Choosing the quantity Q for each item 3 0.20 0.15 0.50
considered could prove problematic here: = 577.8 .
Q must be the same for both the numera-
tor and the denominator when calculating This method gives a slightly different
a price index. result from the previous one, since we
In the Laspeyres index, the value of Q obtain an increase of 477.8% (577.8 −
corresponding to the reference year is 100) in the price index between the ref-
used. In the Paasche index, the value of Q erence year and the current year.
for the current year is used. Other statisti- 3. Index number of weighted arithmetic
cians have proposed using the value of Q means (the weighted sum method):
for a given year.
Pn · Q
In/0 = · 100 .
EXAMPLES P0 · Q
Consider the following table indicating This method is used in conjunction with
the(fictitious) prices of three consumer the Laspeyres index or the Paasche
goods in the reference year (1970) and their index.
current prices.

Price (francs) FURTHER READING


Goods 1970 (P0 ) Now (Pn )  Fisher index
Milk 0.20 1.20  Index number
Bread 0.15 1.10  Laspeyres index
Butter 0.50 2.00  Paasche index
 Simple index number
Using these numbers, we now examine the
three main methods of constructing compos-
ite index numbers.
1. Index number of arithmetic means (the
Conditional Probability
sum method): The probability of an event given that another
event is known to have occured.
Pn
In/0 = · 100 The conditional probability is denoted
P0
4.30 P(A|B), which is read as the “probability
= · 100 = 505.9 . of A conditioned by B.”
0.85
Conditional Probability 105

HISTORY ory. This importance is mainly due to the fol-


The concept of independence dominated lowing points:
probability theory until the beginning of 1. We are often interested in calculating the
the twentieth century. In 1933, Kolmogorov, probability of an event when some infor-
Andrei Nikolaievich introduced the concept mation about the result is already known. C
of conditional probability; this concept now In this case, the probability required is
plays an essential role in theoretical and a conditional probability.
applied probability and statistics. 2. Even when partial information on the
result is not known, conditional probabi-
MATHEMATICAL ASPECTS lities can be useful when calculating the
Consider a random experiment for which probabilities required.
we know the sample space . Consider two
events A and B from this space. The probabi- EXAMPLES
lity of A, P(A) depends on the set of possible Consider a group of 100 cars distributed
events in the experiment ( ). according to two criteria, comfort and speed.
We will make the following distinctions:

fast
a car can be ,
slow

comfortable
a car can be .
uncomfortable

A partition of the 100 cars based on these cri-


teria is provided in the following table:
Now consider that we have supplementary
information concerning the experiment: that fast slow total
the event B has occurred. The probability of comfortable 40 10 50
the event A occurring will then be a function uncomfortable 20 30 50
of the space B rather than a function of . The total 60 40 100
probability of A conditioned by B is calculat-
Consider the following events:
ed as follows:
P(A ∩ B) A = “a fast car is chosen”
P(A|B) = .
P(B) and B = “a comfortable car is chosen.”
If A and B are two incompatible events,
the intersection between A and B is an The probability of these two events are:
empty space. We will then have P(A|B) = P(A) = 0.6 ,
P(B|A) = 0.
P(B) = 0.5 .

DOMAINS AND LIMITATIONS The probability of choosing a fast car is then


The concept of conditional probability is one of 0.6, and that of choosing a comfortable car
of the most important ones in probability the- is 0.5.
106 Confidence Interval

Now imagine that we are given supplemen- the concept of the confidence interval. Bow-
tary information: a fast car was chosen. ley presented his first confidence interval cal-
What, then, is the probability that this car is culations to the Royal Statistical Society in
also comfortable? 1906.
We calculate the probability of B knowing
that A has occurred, or the conditional proba- MATHEMATICAL ASPECTS
bility of B depending on A: In order to construct a confidence interval
We find in the table that that contains the true value of the param-
P(A ∩ B) = 0.4 . eter θ with a given probability, an equation
P(A ∩ B) 0.4 of the following form must be solved:
⇒ P(B|A) = = = 0.667 .
P(A) 0.6 P(Li ≤ θ ≤ Ls ) = 1 − α ,
The probability that the car is comfortable,
where
given that we know that it is fast, is there-
fore 23 . θ is the parameter to be estimated,
Li is the lower limit of the interval,
FURTHER READING Ls is the upper limit of the interval and
 Event 1 − α is the given probability, called the
 Probability confidence level of the interval.
 Random experiment
The probability α measures the error risk of
 Sample space
the interval, meaning the probability that the
interval does not contain the true value of the
REFERENCES
parameter θ .
Kolmogorov, A.N.: Grundbegriffe der
In order to solve this equation, a function
Wahrscheinlichkeitsrechnung. Springer,
f (t, θ ) must be defined where t is an estima-
Berlin Heidelberg New York (1933)
tor of θ , for which the probability distri-
Kolmogorov, A.N.: Foundations of the The- bution is known.
ory of Probability. Chelsea Publishing Defining this interval for f (t, θ ) involves
Company, New York (1956) writing the equation:

P(k1 ≤ f (t, θ ) ≤ k2 ) = 1 − α ,
Confidence Interval
where the constants k1 and k2 are given by
A confidence interval is any interval con- the probability distribution of the function
structed around an estimator that has a par- f (t, θ ). Generally, the error risk α is divid-
ticular probability of containing the true ed into two equal parts at α2 distributed on
value of the corresponding parameter of each side of the distribution of f (t, θ ). If, for
a population. example, the function f (t, θ ) follows a cen-
tered and reduced normal distribution, the
HISTORY constantsk1 and k2 willbesymmetricand can
According to Desrosières, A. (1988), Bow- be represented by −z α2 and +z α2 , as shown
ley, A.L. was one of the first to beinterested in in the following figure.
Confidence Interval 107

In order to estimate μ, the business burns out


n = 25 lightbulbs.
It obtains an average lifespan of x̄ = 860
hours. It wants to establish a confidence
interval around the estimator x̄ at a confi- C
dence level of 0.95. Therefore, the first step
is to obtain a function f (t, θ ) = f (x̄, μ) for
Once the constants k1 and k2 have been the known distribution. Here we use:
found, the parameter θ must be isolated in x̄ − μ
the equation given above. The confidence f (t, θ ) = f (x̄, μ) = σ ,

n
interval θ is found in this way for the con-
fidence level 1 − α. which follows a centered and reduced nor-
mal distribution. The equation P(k1 ≤
DOMAINS AND LIMITATIONS f (t, θ ) ≤ k2 ) = 1 − α becomes:
One should be very careful when interpret-  
ing a confidence interval. If, for a confidence x̄ − μ
P −z0.025 ≤ σ ≤ z0.025 = 0.95 ,
level of 95%, we find a confidence interval √
n
for a mean of μ where the lower and upper
limits are k1 and k2 respectively, we can con- because the error risk α has been divided
clude the following (for example): into two equal parts at α2 = 0.025.
“On the basis of the studied sample, we can The table for the centered and reduced nor-
affirm that it is probable that the mean of mal distribution, the normal table, gives
the population can be found in the interval z0.025 = 1.96. Therefore:
established.”  
x̄ − μ
It would not be correct to conclude that there P −1.96 ≤ σ ≤ 1.96 = 0.95 .

is a 95% chance of finding the mean of the n
population in the interval. Indeed, since μ To obtain the confidence interval for μ at
and the limits k1 and k2 of the interval are con- a confidence level of 0.95, μ must be iso-
stants, the interval may or may not contain μ. lated in the equation above:
However, if the statistician has the ability
 
to repeat the experiment (which consists of x̄ − μ
drawing a sample from the population) sev- P −1.96 ≤ σ ≤ 1.96 = 0.95

eral times, 95% of the intervals obtained will n

contain the true value of μ.

σ σ
P −1.96 √ ≤ x̄ − μ ≤ 1.96 √
EXAMPLES n n
A business that fabricates lightbulbs wants = 0.95
to test the average lifespan of its lightbulbs.
The distribution of the random variable X,

σ σ
which represents the life span in hours, is P x̄ − 1.96 √ ≤ μ ≤ x̄ + 1.96 √
n n
a normal distribution with mean μ and
standard deviation σ = 30. = 0.95 .
108 Confidence Level

By replacing x̄, σ and n with their respective We designate the confidence level by (1 −
values, we obtain: α), where α corresponds to the risk of error;
that is, to the probability that the confidence
30
P 860 − 1.96 √ ≤ μ ≤ 860 interval does not contain the true value of the
25

parameter.
30
+ 1.96 √ = 0.95
25
HISTORY
P(848.24 ≤ μ ≤ 871.76) = 0.95 . The first example of a confidence inter-
The confidence interval for μ at the confi- val appears in the work of Laplace (1812).
dence level 0.95 is therefore: According to Desrosières, A. (1988), Bow-
ley, A.L. was one of the first to become inter-
[848.24, 871.76] .
ested in the concept of the confidence inter-
This means that we can affirm with a proba- val.
bility of 95% that this interval contains the See hypothesis testing.
true value of the parameter μ that corre-
sponds to the average lifespan of the light-
MATHEMATICAL ASPECTS
bulbs.
Let θ be a parameter associated with a pop-
ulation. θ is to be estimated and T is its esti-
FURTHER READING mator from a random sample. We evaluate
 Confidence level
the precision of T as the estimator of θ by
 Estimation
constructing a confidence interval around
 Estimator
the estimator, which is often interpreted as
an error margin.
REFERENCES In order to construct this confidence interval,
Bowley, A.L.: Presidential address to the we generally proceed in the following man-
economic section of the British Asso- ner. From the distribution of the estimator T,
ciation. J. Roy. Stat. Soc. 69, 540–558 we determine an interval that is likely to con-
(1906) tain the true value of the parameter. Let us
Desrosières, A. La partie pour le tout: com- denote this interval by (T − ε, T + ε) and
ment généraliser? La préhistoire de la con- the probability of true value of the paramter
trainte de représentativité. Estimation et being in this interval as (1 − α). We can then
sondages. Cinq contributions à l’histoire say that the error margin ε is related to α by
de la statistique. Economica, Paris (1988) the probability:

P(T − ε ≤ θ ≤ T + ε) = 1 − α .
Confidence Level The level of probability associated with an
The confidence level is the probability that interval of estimation is called the confidence
the confidence interval constructed around level or the confidence degree.
an estimator contains the true value of the The interval T − ε ≤ θ ≤ T + ε is called
corresponding parameter of the popula- the confidence interval forθ at the confidence
tion. level 1−α. Let us use, for example, α = 5%,
Confidence Level 109

which will give the confidence interval of the The standard deviation σ of the population
parameter θ to a probability level of 95%. is known; the value of ε is zα/2 · σX̄ . The val-
This means that, if we use T as an estima- ue of zα/2 is obtained from the normal table,
tor of θ , then the interval indicated will on and it depends on the probability attributed
average contain the true value of the param- to the parameter α. We then deduce the con- C
eter 95 times out of 100 samplings, and it will fidence interval of the estimator of μ at the
not contain it 5 times. probability level 1 − α:
The quantity ε of the confidence interval cor-
responds to half od the length of the interval. X̄ − zα/2 σX̄ ≤ μ ≤ X̄ + zα/2 σX̄ .
This parameter therefore gives us an idea of From the hypothesis that the bulb lifetime X
the error margin for the estimator. For a given follows a normal distribution with mean μ
confidence level 1 −α, the smaller the confi- and standard deviation σ = 30, we deduce
dence interval, more efficient the estimator. that the expression
x̄ − μ
σ

DOMAINS AND LIMITATIONS n
The most commonly used confidence levels
follows a standard normal distribution.
are 90%, 95% and 99%. However, if neces- From this we obtain:
sary, other levels can be used instead.  
Although we would like to use the highest x̄ − μ
P −z0.025 ≤ σ ≤ z0.025 = 0.95 ,
confidence level in order to maximize the √
n
probability that the confidence interval con-
tains the true value of the parameter, but if where the risk of error α is divided into two
we increase the confidence level,the interval parts that both equal α2 = 0.025.
increases as well. Therefore, what we gain in The table of the standard normal distribution
terms of confidence is lost in terms of preci- (the normal table) gives z0.025 = 1.96. We
sion, so have to find a compromise. then have:
 
x̄ − μ
P −1.96 ≤ σ ≤ 1.96 = 0.95 .

n
EXAMPLES
A company that produces ligthtbulbs wants To get the confidence interval for μ, we must
to study the mean lifetime of its bulbs. The isolate μ in the following equation via the
distribution of the random variable X that following transformations:
represents the lifetime in hours is a normal

distribution of mean μ and standard devi- σ σ


P −1.96 √ ≤ x̄ − μ ≤ 1.96 √
ation σ = 30. n n
In order to estimate μ, the company burns out = 0.95 ,
a random sample of n = 25 bulbs.
The company obtains an average bulb life-

σ σ
time of x̄ = 860 hours. It then wants to P x̄ − 1.96 √ ≤ μ ≤ x̄ + 1.96 √
n n
construct a 95% confidence interval for μ
around the estimator x̄. = 0.95 .
110 Contingency Table

Substituting x̄, σ and n for their respective HISTORY


values, we evaluate the confidence interval: The term “contingency,” used in relation to
a crossed table of categorical data, seems to
30
860 − 1.96 · √ have originated with Pearson, Karl (1904),
25
who used he term “contingency” to mean
30
≤ μ ≤ 860 + 1.96 · √ a measure of the total deviation relative to
25 the independence.
848.24 ≤ μ ≤ 871.76 . See also chi-square test of independence.
We can affirm with a confidence level of 95%
that this interval contains the true value of MATHEMATICAL ASPECTS
the parameter μ, which corresponds to the If we consider a two-dimensional table, con-
mean bulb lifetime. taining entries for two qualitative categori-
cal variables X and Y that have, respectively,
r and c categories, the contingency table is:
FURTHER READING
 Confidence interval
Categories of the variable Y
 Estimation
Y1 . . . Yc Total
 Estimator
Categories X 1 n 11 . . . n1c n1.
 Hypothesis testing
of the ... ... ... ... ...
variable X Xr nr1 ... nrc nr.
Total n.1 ... n.c n..
Contingency Table
A contingency table is a crossed table con- where
taining various attributes of a population or nij represents the observed frequency for
an observed sample. Contingency table anal- category i of variable X and category j
ysis consists of discovering and studying of variable Y;
the relations (if they exist) between these ni. represents the sum of the frequencies
attributes. observed for category i of variable X,
A contingency table can be a two-dimen- n.j represents the sum of the observed fre-
sional table with r lines and c columns relat- quencies for category j of variable Y, and
ing to two qualitative categorical variables n.. indicates the total number of observa-
possessing, respectively, r and c categories. tions.
It can also be multidimensional when the
number of qualitative variables is greater In the case of a multidimensional table, the
then two: if, for example, the elements of elements of the table are denoted by nijk, rep-
a population or a sample are characterized resenting the observed frequency for catego-
by three attributes, the associated contingen- ry i of variable X, category j of variable Y and
cy table has the dimensions I×J×K, where I category k of variable Z.
represents the number of categories defining
the first attribute, J the number of categories DOMAINS AND LIMITATIONS
of the second attribute and K the number of The independence of two categorical quali-
the categories of the third attribute. tative variables represented in the contingen-
Continuous Distribution Function 111

cy table can be assessed by performing a chi- Properties of the Continuous Distribution


square test of independence. Function
1. F(x) is a continually increasing function
for all x;
FURTHER READING
2. F takes its values in the interval [0, 1]; C
 Chi-square test of independence
3. lim F(b) = 0;
 Frequency distribution b→−∞
4. lim F(b) = 1;
b→∞
5. F(x) is a continuous and differentiable
REFERENCES function.
Fienberg, S.E.: The Analysis of Cross- This distribution function can be graphical-
Classified Categorical Data, 2nd edn. ly represented on a system of axes. The dif-
MIT Press, Cambridge, MA (1980) ferent values of the random variable X are
Pearson, K.: On the theory of contingency plotted on the abscissa and the correspond-
and its relation to association and normal ing values of F(x) on the ordinate.
correlation. Drapers’ Company Research
Memoirs, Biometric Ser. I., pp. 1–35
(1904)

Continuous Distribution
Function
The distribution function of a continuous
random variable is defined to be the proba-
bility that the random variable takes a value
less than or equal to a real number. DOMAINS AND LIMITATIONS
The probability that the continuous ran-
dom variable X takes a value in the inter-
HISTORY
val ]a, b] for all a < b, meaning that P(a <
See probability.
X ≤ b), is equal to F(b) − F(a), where F is
the distribution function of the random vari-
MATHEMATICAL ASPECTS able X.
The function defined by Demonstration: The event {X ≤ b} can be
written as the union of two mutually exclu-
 b
sive events: {X ≤ a} and {a < X ≤ b}:
F(b) = P(X ≤ b) = f (x) dx .
−∞
{X ≤ b} = {X ≤ a} ∪ {a < X ≤ b} .
is called the distribution function of a con-
By finding the probability on each side of
tinuous random variable.
the equation, we obtain:
In other words, the density function f is
the derivative of the continuous distribution P(X ≤ b) = P({X ≤ a} ∪ {a < X ≤ b})
function. = P(X ≤ a) + P(a < X ≤ b) .
112 Continuous Probability Distribution

The sum of probabilities result from the fact


that the two events are exclusive.
By subtracting P(X ≤ a) on each side, we
have:

P(a < X ≤ b) = P(X ≤ b) − P(X ≤ a) .

Finally, from the definition of the distri-


bution function, we obtain: FURTHER READING
 Density function
P(a < X ≤ b) = F(b) − F(a) .  Probability
 Random experiment
 Random variable
 Value
EXAMPLES
Consider a continuous random variable X
for which the density function is given by: Continuous Probability
 Distribution
1 if 0 < x < 1 Every random variable has a correspond-
f (x) = .
0 if not ing frequency distribution. For a continu-
ous random variable, this distribution is con-
The probability that X takes a value in the tinuous too.
interval [a, b], with 0 < a and b < 1, is as A continuous probability distribution is
follows: a model that represents the frequency
 distribution of a continuous variable in
b
P(a ≤ X ≤ b) = f (x) dx the best way.
a
 b MATHEMATICAL ASPECTS
= 1 dxThe probability distribution of a continu-
a

 ous random variable X is given by its den-
= x ba
sity function f (x) or its distribution func-
=b−a. tion F(x).
It can generally be characterized by its
Therefore, for 0 < x < 1 the distribution expected value:
function is: 
E[X] = x · f (x) dx = μ
D
F(x) = P(X ≤ x)
and its variance:
= P(0 ≤ X ≤ x) 
=x. Var(X) = (x − μ)2 · f (x) dx
D
 
This function is presented in the following = E (X − μ)2

!
figure: = E X 2 − E[X]2 ,
Contrast 113

where D represents the interval covering the butions, vols. 1 and 2. Wiley, New York
range of values that X can take. (1970)
One essential property of a continuous ran-
dom variable is that the probability that it
will take a specific numerical value is zero, Contrast C
whereas the probability that it will take a val- In analysis of variance a contrast is a linear
ue over an interval (finite or infinite) is usu- combination of the observations or factor
ally nonzero. levels or treatments in a factorial experi-
ment, where the sum of the coefficients is
DOMAINS AND LIMITATIONS zero.
The most famous continuous probability
distribution is the normal distribution.
HISTORY
Continuous probability distributions are
According to Scheffé, H. (1953), Tukey, J.W.
often used to approximate discrete proba-
(1949 and 1951) was the first to propose
bility distributions. They are used in model
a method of simultaneously estimating all
construction just as much as they are used
of the contrasts.
when applying statistical techniques.

MATHEMATICAL ASPECTS
FURTHER READING
 Beta distribution
Consider T1 , T2 , . . . , Tk , which are the sums
 Cauchy distribution
of n1 , n2 , . . . , nk observations. The linear
 Chi-square distribution
function
 Continuous distribution function cj = c1j · T1 + c2j · T2 + · · · + ckj · Tk
 Density function
 Discrete probability distribution is a contrast if and only if
 Expected value 
k
 Exponential distribution ni · cij = 0 .
 Fisher distribution i=1
 Gamma distribution If each ni = n, meaning that if Ti is the sum of
 Laplace distribution the same number of observations, the con-
 Lognormal distribution dition is reduced to:
 Normal distribution

k
 Probability cij = 0 .
 Probability distribution i=1
 Random variable
 Student distribution DOMAINS AND LIMITATIONS
 Uniform distribution In most experiments involving several
 Variance of a random variable treatments, it is interesting for the exper-
imenter to make comparisons between the
REFERENCES different treatments. The statistician uses
Johnson, N.L., Kotz, S.: Distributions in contrasts to carry out this type of compari-
Statistics: Continuous Univariate Distri- son.
114 Convergence

EXAMPLES MATHEMATICAL ASPECTS


When an analysis of variance is carried out Different types of stochastic convergence
for a three-level factor, some contrasts of can be defined. Let {xn }n∈N be a set of ran-
interest are: dom variables. The most important types of
stochastic convergence are:
c1 = T1 − T2
1. {Xn }n∈⺞ converges in distribution to
c2 = T1 − T3 a random variable X if
c3 = T2 − T3
c4 = T1 − 2 · T2 + T3 . lim FXn (z) = FX (z) ∀z ,
n→∞

where FXn and FX are the distribution


FURTHER READING
 Analysis of variance
functions of Xn and X, respectively.
 Experiment
This convergence is simply the point con-
 Factorial experiment
vergence (well-known in mathematics) of
the set of the distribution functions of
the Xn .
REFERENCES 2. {Xn }n∈⺞ converges in probability to a ran-
Ostle, B.: Statistics in Research: Basic Con- dom variable X if:
cepts and Techniques for Research Work-
ers. Iowa State College Press, Ames, IA
lim P (|Xn − X| > ε) = 0 ,
(1954) n→∞

Scheffé, H.: A method for judging all con- for every ε > 0 .
trasts in the analysis of variance. Biometri-
ka 40, 87–104 (1953) 3. {Xn }n∈⺞ exhibits almost sure conver-
gence to a random variable X if:
Tukey, J.W.: Comparing individual means in
the analysis of variance. Biometrics 5, 99– & '
P w| lim Xn (w) = X(w) = 1 .
114 (1949) n→∞

Tukey, J.W.: Quick and Dirty Methods in 4. Suppose that all elements of X have
n
Statistics. Part II: Simple Analyses for a finite expectancy. The set {Xn }n∈⺞ con-
Standard Designs. Quality Control Con- verges in mean square to X if:
ference Papers 1951. American Society
 
for Quality Control, New York, pp. 189–
lim E (Xn − X)2 = 0 .
197 (1951) n→∞

Note that:
Convergence
• Almost sure convergence and mean
In statistics, the term “convergence” is relat- square convergence both imply a con-
ed to probability theory. This statistical con- vergence in probability;
vergence is often termed stochastic conver- • Convergence in probability (weak con-
gence in order to distinguish it from classical vergence) implies convergence in distri-
convergence. bution.
Correlation Coefficient 115

EXAMPLES REFERENCES
Let Xi be independent random variables uni- Le Cam, L.M., Yang, C.L.: Asymptotics
formly distributed over [0, 1]. We define the in Statistics: Some Basic Concepts.
following set of random variables from Xi : Springer, Berlin Heidelberg New York
(1990) C
Zn = n · min Xi .
i=1,...,n Staudte, R.G., Sheater, S.J.: Robust Esti-
We can show that the set {Zn }n∈⺞ converges mation and Testing. Wiley, New York
in distribution to an exponential distri- (1990)
bution Z with a parameter of 1 as follows:

1 − FZn (t) = P(Zn > t) Convergence Theorem


t
= P min Xi > The convergence theorem leads to the most
i=1,...,n n important theoretical results in probability
 t t
= P X1 > and X2 > and theory. Among them, we find the law of
n n
t  large numbers and the central limit the-
. . . Xn > orem.
n

n  t   t n
ind.
= P X i > = 1 − .
EXAMPLES
n n
i=1
The central limit theorem and the law of

t n
Now, for limn→∞ , 1 − n = exp (−t) large numbers are both convergence the-
since: orems.
The law of large numbers states that the
lim FZn = lim P(Zn ≤ t) mean of a sum of identically distributed ran-
n→∞ n→∞
= 1 − exp(−t) = FZ . dom variables converges to their common
mathematical expectation.
Finally, let Sn be the number of success- On the other hand, the central limit theorem
es obtained during n Bernoulli trials with states that the distribution of the sum of a suf-
a probability of success p. Bernoulli’s theo- ficiently large number of random variables
rem tells us that Snn converges in probability tends to approximate the normal distri-
to a “random” variable that takes the value p bution.
with probability 1.

FURTHER READING
FURTHER READING  Central limit theorem
 Bernoulli’s theorem
 Law of large numbers
 Central limit theorem
 Convergence theorem
 De Moivre–Laplace Theorem
 Law of large numbers
Correlation Coefficient
 Probability The simple correlation coefficient is a mea-
 Random variable sure of the strength of the linear relation
 Stochastic process between two random variables.
116 Correlation Coefficient

The correlation coefficient can take val- cept of a negative correlation. According to
ues that occur in the interval [−1; 1]. The Stigler (1989), Galton only appeared to sug-
two extreme values of this interval represent gest that correlation was a positive relation-
a perfectly linear relation between the vari- ship.
ables, “positive” in the first case and “nega- Pearson, Karl wrote in 1920 that correla-
tive” in the other. The value 0 (zero) implies tion had been discovered by Galton, whose
the absence of a linear relation. work “Natural inheritance” (1889) pushed
The correlation coefficient presented here is him to study this concept too, along with two
also called the Bravais–Pearson correlation other researchers, Weldon and Edgeworth.
coefficient. Pearson and Edgeworth then developed the
theory of correlation.
HISTORY Weldon thought the correlation coefficient
The concept of correlation originated in the should be called the “Galton function.” How-
1880s with the works of Galton, F.. In his ever, Edgeworth replaced Galton’s term “co-
autobiography Memories of My Life (1890), relation index” and Weldon’s term “Galton
he writes that he thought of this concept dur- function” by the term “correlation coeffi-
ing a walk in the grounds of Naworth Castle, cient.”
when a rain shower forced him to find shelter. According to Mudholkar (1982), Pear-
According to Stigler, S.M. (1989), Por- son, K. systemized the analysis of correla-
ter, T.M. (1986) was carrying out historical tion and established a theory of correlation
research when he found a forgotten article for three variables. Researchers from Uni-
written by Galton in 1890 in The North Ame- versity College, most notably his assistant
rican Review, under the title “Kinship and Yule, G.U., were also interested in develop-
correlation”. In this article, which he pub- ing multiple correlation.
lished right after its discovery, Galton (1908) Spearman published the first study on rank
explained the nature and the importance of correlation in 1904.
the concept of correlation. Among the works that were carried out
This discovery was related to previous works in this field, it is worth highlighting those
of the mathematician, notably those on of Yule, who in an article entitled “Why
heredity and linear regression. Galton had do we sometimes get non-sense-correlation
been interested in this field of study since between time-series” (1926) discussed the
1860. He published a work entitled “Natural problem of correlation analysis interpre-
inheritance” (1889), which was the starting tation. Finally, correlation robustness was
point for his thoughts on correlation. investigated by Mosteller and Tukey (1977).
In 1888, in an article sent to the Royal Statis-
tical Society entitled “Co-relations and their
measurement chiefly from anthropometric MATHEMATICAL ASPECTS
data,” Galton used the term “correlation” for Simple Linear Correlation Coefficient
the first time, although he was still alter- Simple linear correlation is the term used to
nating between the terms “co-relation” and describe a linear dependence between two
“correlation” and he spoke of a “co-relation quantitative variables X and Y (see simple
index.” On the other hand, he invoke the con- linear regression).
Correlation Coefficient 117

If X and Y are random variables that fol- we calculate the statistic t:


low an unknown joint distribution, then the r
simple linear correlation coefficient is equal t= ,
Sr
to the covariance between X and Y divid-
ed by the product of their standard devia- where Sr is the standard deviation of the C
tions: estimator r:

Cov(X, Y) 1 − r2
ρ= . S = .
σX σY r
n−2
Here Cov(X, Y) is the measured covariance Under H0 , the statistic t follows a Student
between X and Y; σX and σY are the respec- distribution with n−2 degrees of freedom.
tive standard deviations of X and Y. For a given significance level α, H0 is reject-
Given a sample of size n, (X1 , Y1 ), ed if |t| ≥ t α2 ,n−2 ; the value of t α2 ,n−2 is the
(X2 , Y2 ), . . . , (Xn , Yn ) from the joint distri- critical value of the test given in the Student
bution, the quantity table.


n
(Xi − X̄)(Yi − Ȳ) Multiple Correlation Coefficient
i=1 Known as the coefficient of determina-
r=" tion denoted by R2 , determines whether
# n
#  n
$ (Xi − X̄)2 (Yi − Ȳ)2 the hyperplane estimated from a multiple
i=1 i=1 linear regression is correctly adjusted to
the data points.
is an estimation of ρ; it is the sampling cor- The value of the multiple determination
relation. coefficient R2 is equal to:
If we denote (Xi − X̄) by xi and (Yi − Ȳ) by
Explained variation
yi , we can write this equation as: R2 =
Total variation
n

n
(Ŷi − Ȳ)2
xi yi
i=1
i=1 = .
r = "  n . 
n
# n
#   (Yi − Ȳ)2
$ x2 y2i i i=1
i=1 i=1
It corresponds to the square of the multiple
Test of Hypothesis correlation coefficient. Notice that
To test the null hypothesis
0 ≤ R2 ≤ 1 .
H0 : ρ = 0
In the case of simple linear regression, the
against the alternative hypothesis following relation can be derived:

H1 : ρ  = 0 , r = sign(β̂1 ) R2 ,
118 Correlation Coefficient

where β̂1 is the estimator of the regression • The characteristics of the distribution of
coefficient β1 , and it is given by: the scores (linear or nonlinear).


n
EXAMPLES
(Xi − X̄)(Yi − Ȳ)
The data for two variables X and Y are
i=1
β̂1 = . shown in the table below:

n
(Xi − X̄) 2
No of x= y=
i=1
order X Y X − X̄ Y − Ȳ xy x2 y2
1 174 64 −1.5 −1.3 1.95 2.25 1.69
DOMAINS AND LIMITATIONS 2 175 59 −0.5 −6.3 3.14 0.25 36.69
If there is a linear relation between two vari- 3 180 64 4.5 −1.3 −5.85 20.25 1.69
ables, the correlation coefficient is equal to 4 168 62 −7.5 −3.3 24.75 56.25 10.89
1 or −1. 5 175 51 −0.5 −14.3 7.15 0.25 204.49
A positive relation (+) means that the two 6 170 60 −5.5 −5.3 29.15 30.25 28.09
variables vary in the same direction. If the 7 170 68 −5.5 2.7 −14.85 30.25 7.29
individuals obtain high scores in the first 8 178 63 2.5 −2.3 −5.75 6.25 5.29
variable (for example the independent 9 187 92 11.5 26.7 307.05 132.25 712.89
variable), they will have a tendency to 10 178 70 2.5 4.7 11.75 6.25 22.09
obtain high scores in the second variable Total 1755 653 358.5 284.5 1034.1
(the dependant variable). The opposite is
also true.
X̄ = 175.5 and Ȳ = 65.3 .
A negative relation (−) means that the indi-
viduals that obtain high scores in the first We now perform the necessary calculations
variable will have a tendency to obtain low to obtain the correlation coefficient between
scores in the second one, and vice versa. the two variables. Applying the formula
Note that if the variables are independent the gives:
correlation coefficient is equal to zero. The

n
reciprocal conclusion is not necessarily true. xi yi
The fact that two or more variables are relat- i=1
r = "  n 
ed in a statistical way is not sufficient to con- # n
#  
clude that a cause and effect relation exists. $ x 2 y 2
i i
The existence of a statistical correlation is i=1 i=1
not a proof of causality. 358.5
Statistics provides numerous correlation = √ = 0.66 .
284.5 · 1034.1
coefficients. The choice of which to use for
a particular set of data depends on different Test of Hypothesis
factors, such as: We can calculate the estimated standard
• The type of scale used to express the vari- deviation of r:
able;  %
• The nature of the underlying distribution 1 − r2 0.56
(continuous or discrete); Sr = = = 0.266 .
n−2 10 − 2
Correspondence Analysis 119

Calculating the statistic t gives: Pearson, K.: Studies in the history of statis-
r−0 0.66 tics and probability. Biometrika 13, 25–
t= = = 2.485 . 45 (1920). Reprinted in: Pearson, E.S.,
Sr 0.266
If we choose a significance level α of 5%, Kendall, M.G. (eds.) Studies in the Histo-
ry of Statistics and Probability, vol. I. Grif- C
the value from the Student table, t0.025,8, is
equal to 2.306. fin, London
Since |t| = 2.485 > t0.025,8 = 2.306, the Porter, T.M.: The Rise of Statistical Think-
null hypothesis ing, 1820–1900. Princeton University
H0 : ρ = 0 Press, Princeton, NJ (1986)
Stigler, S.: Francis Galton’s account of the
is rejected.
invention of correlation. Stat. Sci. 4, 73–
FURTHER READING 79 (1989)
 Coefficient of determination Yule, G.U.: On the theory of correlation.
 Covariance J. Roy. Stat. Soc. 60, 812–854 (1897)
 Dependence Yule, G.U. (1926) Why do we sometimes
 Kendall rank correlation coefficient get nonsense-correlations between time-
 Multiple linear regression series? A study in sampling and the nature
 Regression analysis of time-series. J. Roy. Stat. Soc. (2) 89, 1–
 Simple linear regression 64
 Spearman rank correlation coefficient

REFERENCES Correspondence Analysis


Galton, F.: Co-relations and their measu-
rement, chiefly from anthropological Correspondence analysis is a data analysis
data. Proc. Roy. Soc. Lond. 45, 135–145 technique that is used to describe contingen-
(1888) cy tables (or crossed tables). This analysis
takes the form of a graphical representa-
Galton, F.: Natural Inheritance. Macmillan,
tion of the associations and the “correspon-
London (1889)
dence” between rows and columns.
Galton, F.: Kinship and correlation. North
Am. Rev. 150, 419–431 (1890)
HISTORY
Galton, F.: Memories of My Life. Methuen,
The theoretical principles of correspondence
London (1908)
analysis date back to the works of Hart-
Mosteller, F., Tukey, J.W.: Data Analysis ley, H.O. (1935) (published under his orig-
and Regression: A Second Course in inal name Hirschfeld) and of Fisher, R.A.
Statistics. Addison-Wesley, Reading, MA (1940) on contingency tables. They were
(1977) first presented in the framework of inferen-
Mudholkar, G.S.: Multiple correlation coef- tial statistics.
ficient. In: Kotz, S., Johnson, N.L. (eds.) The term “correspondence analysis” first
Encyclopedia of Statistical Sciences, appeared in the autumn of 1962, and the first
vol. 5. Wiley, New York (1982) presentation of this method that referred to
120 Correspondence Analysis

this term was given by Benzécri, J.P. in the by dividing each element in this row (col-
winter of 1963. In 1976 the works of Ben- umn) by the sum of the elements in the line
zécri, J.P., which retraced twelve years of his (column).
laboratory work, were published, and since The line profile of row i is obtained by
then the algebraic and geometrical proper- dividing each term of row i by ni. , which
ties of this descriptive analytical tool have is the sum of the observed frequencies in
become more widely known and used. the row.
The table of row profiles is constructed
MATHEMATICAL ASPECTS by replacing each row of the contingency
Consider a contingency table relating to two table with its profile:
categorial qualitative variables X and Y
Y1 Y2 ... Yc Total
that have, respectively, r and c categories: n11 n12 n1c
X1 n1. n1. ... n1. 1
n21 n22 n2c
Y1 Y2 ... Yc Total X2 n2. n2. ... n2. 1
X1 n11 n12 ... n1c n1. ... ... ... ... ... ...
nr1 nr2 nrc
X2 n21 n22 ... n2c n2. Xr nr. nr. ... nr. 1
... ... ... ... ... ... Total n.1 n.2 ... n.c r
Xr nr1 nr2 ... nrc nr.
Total n.1 n.2 ... n.c n.. It is also common to multiply each ele-
ment of the table by 100 in order to convert
where the terms into percentages and to make the
sum of terms in each row 100%.
nij represents the frequency that category i
The column profile matrix is constructed
of variable X and category j of variable
in a similar way, but this time each col-
Y is observed,
umn of the contingency table is replaced
ni. represents the sum of the observed fre- with its profile: the column profile of col-
quencies for category i of variable X, umn j is obtained by dividing each term of
n.j represents the sum of the observed fre- column j by n.j , which is the sum of fre-
quencies for category j of variable Y, quencies observed for the category cor-
responding to this column.
n.. represents the total number of observa-
tions.
Y1 Y2 ... Yc Total
n1.
n11 n12 n1c
We will assume that r ≥ c; if not we take X1 n.1 n.2 ... n.c
n2.
n21 n22 n2c
the transpose of the initial table and use X2 n.1 n.2 ... n.c
this transpose as the new contingency table. ... ... ... ... ... ...
nr.
nr1 nr2 nrc
The correspondence analysis of a contingen- Xr n.1 n.2 ... n.c
cy table with more lines than columns, is per- Total 1 1 ... 1 c
formed as follows:
1. Tables of row profiles XI and column pro- The tables of row profiles and column pro-
files XJ are constructed.. files correspond to a transformation of the
For a fixed line (column), the line (col- contingency table that is used to make
umn) profile is the line (column) obtained the rows and columns comparable.
Correspondence Analysis 121

2. Determine the inertia matrix V. If we want to know, for example, the num-
This is done in the following way: ber of eigenvalues and therefore the fac-
• The weighted mean of the r column torial axes that explain at least 3/4 of the
coordinates is calculated: total inertia, we sum the explained iner-
 ni. nij
r tia from each of the eigenvalues until we C
n.j
gj = · = , j = 1, . . . , c . obtain 75%.
n.. ni. n..
i=1 We then calculate the main axes of inertia
• The c obtained values gj are written from these factorial axes.
r times in the rows of a matrix G; 5. The main axes of inertia, denoted ul , are
• The diagonal matrix DI is constructed then given by:
with diagonal elements of nni... ; 
ul = M −1 · vl ,
• Finally, the inertia matrix V is calcu-
lated using the following formula: meaning that its jth component is:

%
V = (XI − G) · DI · (XI − G) . n.j
ujl = · vjl .
n..
3. Using the matrix M, which consists of nn.j..
terms on its diagonal and zero terms else- 6. We then calculate the main components,
where, we determine the matrix C: denoted by yk , which are the orthogonal
√ √ projections of the row coordinates on the
C = M·V· M. main axes of inertia: the ith coordinate of
4. Find the eigenvalues (denoted kl ) and the lth main component takes the follow-
the eigenvectors (denoted v ) of this ing value:
l
matrix C. yil = xi · M · ul ,
The c eigenvalues kc , kc−1 , . . . , k1 (writ-
ten in decreasing order) are the iner- meaning that
tia. The corresponding eigenvectors are c %
nij n..
called the factorial axes (or axes of iner- y il = · vjl
ni. n.j
j=1
tia).
For each eigenvalue, we calculate the cor- is the coordinate of row i on the lth axis.
responding inertia explained by the facto- 7. After the main components yl (of the
rial axis. For example, the first factorial row coordinates) have been calculated,
axis explains: we determine the main components of the
100 · k1 column coordinates (denoted by zl ) using
(in %) of inertia. the yl , thanks to the transaction formulae:
c
kl
1  nij
r
l=1 zjl = √ · yil ,
kl i=1 n.j
In the same way, the two first factorial
axes explain: for j = 1, 2, . . . , c;
1  nij
c
100 · (k1 + k2 )
(in %) of inertia. yil = √ · zjl ,
 c kl i=1 ni.
kl
l=1 for i = 1, 2, . . . , r .
122 Correspondence Analysis

zjl is the coordinate of the column j on DOMAINS AND LIMITATIONS


the lth factorial axis. When simultaneously representing the row
8. The simultaneous representation of the coordinate and the column coordinate, it
row coordinates and the column coordi- is not advisable to interpret eventual prox-
nates on a scatter plot with two factori- imities crossed between lines and columns
al axes, gives a signification to the axis because the two points are not in the same
depending on the points it is related to. initial space. On the other hand it is interest-
The quality of the representation is related ing to interpret the position of a row coor-
to the proportion of the inertia explained dinate by comparing it to the set of column
by the two main axes used. The closer the coordinate (and vice versa).
explained inertia is to 1, the better the
quality.
EXAMPLES
Consider the example of a company that
wants to find out how healthy its staff. One
of the subjects covered in the question-
naire concerns the number of medical visits
(dentists included) per year. Three possible
answers are proposed:
• Between 0 and 6 visits;
• Between 7 and 12 visits;
9. It can be useful to insert additional point- • More than 12 visits per year.
rows or column coordinates (illustrative The staff questioned are distributed into five
variables) along with the active variables categories:
used for the correspondence analysis. • Managerial staff members over 40;
Consider an extra row-coordinate: • Managerial staff members under 40;
(ns1, ns2 , ns3, . . . , nsc ) • Employees over 40;
• Employees under 40;
of profile-row • Office personnel.

ns1 ns2 nsc The results are reported in the following con-
, ,..., .
ns. ns. ns. tingency table, to which margins have been
The lth main component of the extra row- added:
dot is given by the second formula of
point 7): Number of visits
per year
1 
c
nsj
yil = √ · zjl , 0 to 6 7 to 12 > 12 Total
kl ns.
j=1 Managerial
staff members 5 7 3 15
for i = 1, 2, . . . , r .
> 40
We proceed the same way for a column Managerial
coordinate, but apply the first formula of staff members 5 5 2 12
point 7 instead. < 40
Correspondence Analysis 123

Number of visits • We start by calculating the weighted


per year mean of the five line-dots:
0 to 6 7 to 12 > 12 Total

5
ni. nij n.j
Employees gj = · =
> 40
2 12 6 20
i=1
n.. ni. n.. C
Employees
24 18 12 54 for j = 1, 2 and 3;
< 40
Office • We then write these three values five
12 8 4 24
personnel times in the matrix G, as shown below:
⎡ ⎤
Total 48 50 27 125 0.384 0.400 0.216
⎢ 0.384 0.400 0.216 ⎥
⎢ ⎥
⎢ ⎥
Using these data we will describe the eight G = ⎢ 0.384 0.400 0.216 ⎥ .
⎢ ⎥
main steps that should be followed to obtain ⎣ 0.384 0.400 0.216 ⎦
a graphical representation of employee 0.384 0.400 0.216
health via correspondence analysis.
• We then construct a diagonal matrix
1. We first determine the table of line profiles
DI containing the nni... ;
matrix XI , by dividing each element by the
• Finally, we calculate the matrix of
sum of the elements of the line in which
inertia V, as given by the following
it is located:
formula:
0.333 0.467 0.2 1 V = (XI − G) · DI · (XI − G) ,
0.417 0.417 0.167 1
which, in this case, gives:
0.1 0.6 0.3 1
⎡ ⎤
0.444 0.333 0.222 1 0.0175 −0.0127 −0.0048
0.5 0.333 0.167 1 V = ⎣ −0.0127 0.0097 0.0029 ⎦ .
1.794 2.15 1.056 5 −0.0048 0.0029 0.0019
3. We define a third-order square matrix M
and the table of column profiles by divid-
that contains nn...j terms on its diagonal and
ing each element by the sum of the ele-
zero terms everywhere else:
ments in the corresponding column: ⎡ ⎤
2.604 0 0
0.104 0.14 0.111 0.355
M=⎣ 0 2.5 0 ⎦.
0.104 0.1 0.074 0.278
0 0 4.630

0.042 0.24 0.222 0.504 The square root of M, denoted M, is
0.5 0.36 0.444 1.304 obtained by taking the square root of each
0.25 0.16 0.148 0.558 diagonal element of M. Using this new
1 1 1 3 matrix, we determine
√ √
C = M·V · M
2. We then calculate the matrix of inertia V ⎡ ⎤
for the frequencies (given by nni... for val- 0.0455 −0.0323 −0.0167
ues of i from 1 to 5) by proceeding in the C = ⎣ −0.0323 0.0243 0.0100 ⎦ .
following way: −0.0167 0.0100 0.0087
124 Correspondence Analysis

4. The eigenvalues of C are obtained by We find:


diagonalizing the matrix. Arranging ⎡ ⎤
these values in decreasing order gives: 0.4838
u1 = ⎣ −0.3527 ⎦ and
k1 = 0.0746 −0.1311
⎡ ⎤
k2 = 0.0039 0.0500
u2 = ⎣ 0.3401 ⎦ .
k3 = 0 .
−0.3901

The explained inertia is determined for 6. We then calculate the main components
each of these values. For example: by projecting the rows onto the main
axes of inertia. Constructing an auxiliary
0.0746
k1 : = 95.03% , matrix U formed from the two vectors u1
0.0746 + 0.0039 and u2 , we define:
0.0039
k2 : = 4.97% . Y = XI · M · U
0.0746 + 0.0039 ⎡ ⎤
−0.1129 0.0790
⎢ 0.0564 ⎥
For the last one, k3 , the explained inertia ⎢ 0.1075 ⎥
⎢ ⎥
is zero. = ⎢ −0.5851 0.0186 ⎥.
⎢ ⎥
The first two factorial axes, associated ⎣ 0.1311 −0.0600 ⎦
with the eigenvalues k1 and k2, explain all 0.2349 0.0475
of the inertia. Since the third eigenvalue
is zero it is not necessary to calculate the We can see the coordinates of the five rows
eigenvector that is associated with it. We written horizontally in this matrix Y; for
focus on calculating the first two normal- example, the first column indicates the
ized eigenvectors then: components related to the first factorial
axis and the second indicates the compo-
⎡ ⎤
0.7807 nents related to the second axis.
v1 = ⎣ −0.5576 ⎦ and 7. We then use the coordinates of the rows to
−0.2821 find those of the columns via the following
⎡ ⎤ transition formulae written as matrices:
0.0807
v2 = ⎣ 0.5377 ⎦ .
Z = K · Y  · XJ ,
−0.8393
where:
5. We then calculate the main axes of inertia
by: K is the second-order diagonal matrix

that has 1/ ki terms on its diagonal

ui = M −1 · vi , for i = 1 and 2 , and zero terms elsewhere;
Y  is the transpose of the matrix contain-

where M −1 is obtained ing the coordinates of the rows, and;
√ by inverting the
diagonal elements of M. XJ is the column profile matrix.
Correspondence Analysis 125

We obtain the matrix: employees under 40 (Y4 ) and the office per-
  sonnel (Y5 ). We can verify from the table of
0.3442 −0.2409 −0.1658
Z= , line profiles that the percentages for these
0.0081 0.0531 −0.1128 two categories are indeed very similar.
Similarly, the proximity between two C
where each column contains the compo-
columns (representing two categories relat-
nents of one of the three column coor-
ed to the number of medical visits) indicates
dinates; for example, the first line corre-
similar distributions of people within the
sponds to each coordinate on the first fac-
business for these categories. This can be
torial axis and the second line to each
seen for the modalities Z2 (from 7 to 12
coordinate on the second axis.
medical visits per year) and Z3 (more than
We can verify the transition formula that
12 visits per year).
gives Y from Z:
If we consider the rows and the columns
Y = XI · Z  · K simultaneously (and not separately as we did
previously), it becomes possible to identi-
using the same notation as seen previous- fy similarities between categories for cer-
ly. tain modalities. For example, the employ-
8. We can now represent the five categories ees under 40 (Y4 ) and the office personnel
of people questioned and the three cate- (Y5 ) seem to have the same behavior towards
gories of answers proposed on the same health: high proportions of them (0.44 and
factorial plot: 0.5 respectively) go to the doctor less than
6 times per year (Z1 ).
In conclusion, axis 1 is confronted on one
side with the categories indicating an aver-
age or high number of visits (Z2 and Z3 )—the
employees or the managerial staff members
over 40 (Y1 and Y3 )—and on the other side
with the modalities associated with low num-
bers of visits (Z1): Y2 , Y4 and Y5 . The first fac-
tor can then be interpreted as the importance
of medical control according to age.
We can study this factorial plot at three dif-
ferent levels of analysis, depending: FURTHER READING
• The set of categories for the people ques-  Contingency table
tioned;  Data analysis
• The set of modalities for the medical vis-  Eigenvalue
its;  Eigenvector
• Both at the same time.  Factorial axis
In the first case, close proximity between  Graphical representation
two rows(between two categoriesof person-  Inertia matrix
nel) signifies similar medical visit profiles.  Matrix
On the factorial plot, this is the case for the  Scatterplot
126 Covariance

REFERENCES Developing the right side of the equation


Benzécri, J.P.: L’Analyse des données. gives:
Vol. 1: La Taxinomie. Vol. 2: L’Analyse
factorielle des correspondances. Dunod, Cov(X, Y) = E[XY − E[X]Y − XE[Y]
Paris (1976) + E[X]E[Y]]
Benzécri, J.P.: Histoire et préhistoire de = E[XY] − E[X]E[Y]
l’analyse des données, les cahiers de − E[X]E[Y] + E[X]E[Y]
l’analyse des données 1, no. 1–4. Dunod,
= E[XY] − E[X]E[Y] .
Paris (1976)
Fisher, R.A.: The precision of discriminant
Properties of Covariance
functions.Ann.Eugen.(London)10,422–
Consider X, Y and Z, which are random
429 (1940)
variables defined in the same sample space
Greenacre, M.: Theory and Applications , and a, b, c and d, which are constants. We
of Correspondence Analysis. Academic, find that:
London (1984) 1. Cov(X, Y) = Cov(Y, X)
Hirschfeld, H.O.: A connection between cor- 2. Cov(X, c) = 0
relation and contingency. Proc. Camb. 3. Cov(aX + bY, Z) =
Philos. Soc. 31, 520–524 (1935) a Cov(X, Z) + bCov(Y, Z)
Lebart, L., Salem, A.: Analyse Statistique 4. Cov(X, cY + dZ) =
des Données Textuelles. Dunod, Paris c Cov(X, Y) + dCov(X, Z)
(1988) 5. Cov(aX + b, cY + d) = ac Cov(X, Y).
Saporta, G.: Probabilité, analyse de données
et statistiques. Technip, Paris (1990) Consequences of the Definition
1. If X and Y are independent random vari-
ables,
Covariance Cov(X, Y) = 0 .
The covariance between two random vari- In fact E[XY] = E[X]E[Y], meaning that:
ables X and Y is the measure of how much
two random variables vary together. Cov(X, Y) = E[XY] − E[X]E[Y] = 0 .
If X and Y are independent random vari-
ables, the covariance of X and Y is zero. TheThe reverse is not generally true:
converse, however, is not true. Cov(X, Y) = 0 does not necessarily
imply that X and Y are independent.
MATHEMATICAL ASPECTS 2. Cov(X, X) = Var(X)
Consider X and Y, two random variables where Var(X) represents the variance
defined in the same sample space . The of X.
covarianceofX and Y,denoted by Cov(X, Y), In fact:
is defined by Cov(X, X) = E[XX] − E[X]E[X]
Cov(X, Y) = E[(X − E[X])(Y − E[Y])] , = E[X 2 ] − (E[X])2
where E[.] is the expected value. = Var(X) .
Covariance 127

DOMAINS AND LIMITATIONS having an expected value equal to E(Xi )


Consider two random variables X and Y, and a variance equal to Var(Xi ). We then
and their sum X + Y. We then have: have:

E[X + Y] = E[X] + E[Y] and E[X1 + X2 + · · · + Xn ] C


Var(X + Y) = Var(X) + Var(Y) = E[X1 ] + E[X2 ] + · · · + E[Xn ]
+ 2Cov(X, Y) . 
n
= E[Xi] .
We now show these results for discrete vari- i=1
ables. If Pji = P(X = xi , Y = yj ) we have: Var(X1 + X2 + · · · + Xn )
 = Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
E[X + Y] = (xi + yj )Pji .
i j + 2[Cov(X1 , X2 ) + · · · + Cov(X1 , Xn )
 
= xi Pji + yj Pji + Cov(X2 , X3 ) + · · · + Cov(X2 , Xn )
i j i j + · · · + Cov(Xn−1 , Xn )]
⎛ ⎞
  
n 
n−1 
= xi ⎝ Pji ⎠ = Var(Xi ) + 2 Cov(Xi , Xj ) .
i j i=1 i=1 j>i
 
 
+ yj Pji
j i
  EXAMPLES
= xi P i + yj Pj Consider two psychological tests carried
i j out in succession. Each subject receives
= E[X] + E[Y] . a grade X of between 0 and 3 for the first test
and a grade Y of between 0 and 2 for the sec-
Moreover: ond test. Given that the probabilities of X
being equal to 0, 1, 2 and 3 are respectively
Var(X + Y) = E[(X + Y)2 ] − (E[X + Y])2
0.16, 0.3, 0.41 and 0.13, and that the probabi-
= E[X 2] + 2E[XY] + E[Y 2 ] lities of Y being equal to 0, 1 and 2 are respec-
− (E[X + Y])2 tively 0.55, 0.32 and 0.13, we have:
= E[X 2] + 2E[XY] + E[Y 2 ]
E[X] = 0 · 0.16 + 1 · 0.3 + 2 · 0.41
− (E[X] + E[Y])2
+ 3 · 0.13
= E[X 2] + 2E[XY] + E[Y 2 ]
= 1.51
− (E[X])2 − 2E[X]E[Y] − (E[Y])2
E[Y] = 0 · 0.55 + 1 · 0.32 + 2 · 0.13
= Var(X) + Var(Y) + 2(E[XY]
= 0.58
− E[X]E[Y])
E[XY] = 0 · 0 · 0.16 · 0.55
= Var(X) + Var(Y) + 2Cov(X, Y) .
+ 0 · 1 · 0.16 · 0.32 + . . .
These results can be generalized for n + 3 · 2 · 0.13 · 0.13
random variables X1 , X2 , . . . , Xn , with xi = 0.88 .
128 Covariance Analysis

We can then calculate the covariance of X MATHEMATICAL ASPECTS


and Y: We consider here a covariance analysis of
a completely randomized design implying
Cov(X, Y) = E[XY] − E[X]E[Y] one factor.
= 0.88 − (1.51 · 0.58) The linear model that we will consider is the
= 0.00428 . following:

Yij = μ + τi + βXij + εij ,


FURTHER READING
 Correlation coefficient i = 1, 2, . . . , t , j = 1, 2, . . . , ni
 Expected value
 Random variable
where
 Variance of a random variable
Yij represents observation j, receiving
treatment i,
μ is the general mean common to all treat-
Covariance Analysis ments,
Covariance analysis is a method used to esti- τi is the actual effect of treatment i on th
mate and test the effects of treatments. It observation,
checks whether there is a significant differ- Xij is the value of the concomitant variable,
ence between the means of several treat- and
ments by taking into account the observed εij is the experimental error in observation
values of the variable before the treatment. Yij .
Covariance analysis is a precise way of per-
forming treatment comparisons because it
Calculations
involves adjusting the response variable Y to
In order to calculate the F ratio that will help
a concomitant variable X which corresponds
us to determine whether there is a significant
to the values observed before the treatment.
difference between treatments, we need to
work out sums of squares and sums of prod-
HISTORY ucts. Therefore, if X̄i. and Ȳi. are respectively
Covariance analysis dates back to 1930. It the means of the X values and the Y values
was first developed by Fisher, R.A. (1932). for treatment i, and if X̄.. and Ȳ.. are respec-
After that, other authors applied covariance tively the means of all the values of X and Y,
analysis to agricultural and medical prob- we obtain the formulae given below.
lems. For example, Bartlett, M.S. (1937) 1. The total sum of squares for X:
applied covariance analysis to his studies on
cotton cultivation in Egyptand on milk yields  t  ni
SXX = (Xij − X̄.. )2 .
from cows in winter. i=1 j=1
Delurry, D.B. (1948) used covariance anal-
ysis to compare the effects of different med- 2. The total sum of squares for Y (SYY )
ications (atropine, quinidine, atrophine) on is calculated in the same way as SXX , but X
rat muscles. is replaced by Y.
Covariance Analysis 129

3. The total sum of products of X and Y: Adjustment of the variable Y to the con-
comitant variable X yields two new sums of

t 
ni
SXY = (Xij − X̄.. )(Yij − Ȳ.. ) . squares:
i=1 j=1 1. The adjusted total sum of squares:
C
4. The sum of squares of the treatments for 2
SXY
SStot = SYY − .
X: SXX

t 
ni
2. The adjusted sum of the squares of the
TXX = (X̄i. − X̄.. )2 .
errors:
i=1 j=1
2
EXY
5. The sum of squares of the treatments for Y SSerr = EYY − .
(TYY ) EXX
is calculated in the same way as TXX, but X The new degrees of freedom for these two
is replaced by Y. sums are:

6. The sum of the products of the treatments 1. ti=1 ni − 2;
of X and Y: t
2. i=1 ni − t − 1, where a degree of free-

t 
ni dom is subtracted due to the adjustment.
TXY = (X̄i. − X̄.. )(Ȳi. − Ȳ.. ) . The third adjusted sum of squares, the adjust-
i=1 j=1 ed sum of the squares of the treatments, is
7. The sum of squares of the errors for X: given by:


t 
ni SStr = SStot − SSerr .
EXX = (Xij − X̄i. )2 .
i=1 j=1 Thishasthesamenumber of degrees of free-
dom t − 1 as before.
8. The sum of the squares of the errors for Y:
is calculated in the same way as EXX , but X
is replaced by Y. Covariance Analysis Table
9. The sum of products of theerrors X and Y: We now have all of the elements needed to
establish the covariance analysis table. The
t  ni
sums of squares divided by the number of
EXY = (Xij − X̄i. )(Yij − Ȳi. ) .
degrees of freedom gives the means of the
i=1 j=1
squares.
Substituting in appropriate values and calcu-
lating these formulae corresponds to an anal- Source Degrees Sum of squares
ysis of variance for each of X, Y and XY. of var- of free- and of products
The degrees of freedom associated with iation dom t t t
xi2 xi yi yi2
these different formulae are as follows: i =1 i =1 i =1
 t
Treat-
1. For the total sum: ni − 1. t−1 TXX TXY TYY
ments
i=1 t
2. For the sum of the treatments: t − 1. Errors i=1 ni −t EXX EXY EYY

t
t
3. For the sum of the errors: ni − t. Total i=1 ni − 1 SXX SXY SYY
i=1
130 Covariance Analysis

2 2
Note: the numbers in the xi and yi ables before the application of the treatment.
columns cannot be negative; on the other To test this hypothesis, we will assume the

hand, the numbers in the ti=1 xi yi column null hypothesis to be
can be negative.
H0 : β = 0
Adjustment

Source Degrees of Sum of Mean of and the alternative hypothesis to be


of var- freedom squares squares
iation H1 : β  = 0 .
Treat-
t−1 SStr MCtr
ments The F ratio can be established:
t
Errors i=1 ni −t−1 SSerr MCerr 2 /E
EXY XX
t F= .
MCerr
Total i=1 ni − 2 TSS

It follows a Fisher distribution with 1 and


t
F Ratio: Testing the Treatments i=1 ni − t − 1 degrees of freedom. The
The F ratio, used to test the null hypothesis null hypothesis will be rejected at the sig-
that there is a significant difference between nificance level α if the F ratio is superior to
the means of the treatments once adjusted or equal to the value in the Fisher table, in
to the variable X, is given by: other words if:

MCtr F ≥ F1, t .
F= . i=1 ni −t−1,α
MCerr
The ratio follows a Fisher distribution with DOMAINS AND LIMITATIONS

t−1 and ti=1 ni −t−1 degrees of freedom. The basic hypotheses that need to be con-
The null hypothesis structed before initiating a covariance anal-
ysis are the same as those used for an analysis
H 0 : τ 1 = τ2 = . . . = τt
of variance or a regression analysis. These
will be rejected at the significance level α if are the hypotheses of normality, homogene-
the F ratio is superior or equal to the value ity (homoscedasticity), variances and inde-
of the Fisher table, in other words if: pendence.
In covariance analysis, as in an analysis of
F ≥ Ft−1, t ni −t−1,α . variance, the null hypothesis stipulates that
i=1
the independent samples come from differ-
It is clear that we assume that the β coeffi- ent populations that have identical means.
cient is different from zero when perform- Moreover, since there are always conditions
ing covariance analysis. If this is not the case, associated with any statistical technique,
a simple analysis of variance is sufficient. those that apply to covariance analysis are as
follows:
Test Concerning the β Slope 1. The population distributions must be
So, we would like to know whether there is approximately normal, if not completely
a significant effect of the concomitant vari- normal.
Covariance Analysis 131

= (32 − 29.8667)2 + . . .
2. The populations from which the samples
are taken must have the same variance + (32 − 29.8667)2
σ 2 , meaning: = 453.73 .
σ12 = σ22 = . . . = σk2 , 2. The total sum of squares for Y: C
where k is the number of populations to
be compared.  3  5
SYY = (Yij − Ȳ.. )2
3. The samples must be chosen random- i=1 j=1
ly and all of the samples must be inde-
= (167 − 167.4667)2 + . . .
pendent.
We must also add a basic hypothesis specif- + (162 − 167.4667)2
ic to covariance analysis, which is that the = 3885.73 .
treatments that were carried out must not
influence the values of the concomitant vari- 3. The total sum of the products of X and Y:
able X.
3  5

EXAMPLES S XY = (Xij − X̄.. )(Yij − Ȳ.. )


i=1 j=1
Consider an experiment consisting of com-
paring the effects of three different diets on = (32 − 29.8667)
a population of cows. · (167 − 167.4667) + . . .
The data are presented in the form of a table, + (32 − 29.8667)
which includes the three different diets, each
· (162 − 167.4667)
of which was administered to five porch. The
initial weights are denoted by the concomi- = 158.93 .
tant variable X (in kg), and the gains in weight
(after treatment) are denoted by Y: 4. The sum of the squares of the treatments
for X:
Diets

1 2 3 
3 
5
TXX = (X̄i. − X̄.. )2
X Y X Y X Y
i=1 j=1
32 167 26 182 36 158
29 172 33 171 34 191
= 5(28.2 − 29.8667)2
22 132 22 173 37 140 + 5(26.2 − 29.8667)2
23 158 28 163 37 192 + 5(35.2 − 29.8667)2
35 169 22 182 32 162
= 223.33 .
We first calculate the various sums of squares
and products: 5. The sum of the squares of the treatments
1. The total sum of squares for X: for Y:


3 
5 
3 
5
SXX = (Xij − X̄.. )2 TYY = (Ȳi. − Ȳ.. )2
i=1 j=1 i=1 j=1
132 Covariance Analysis

= 5(159.6 − 167.4667)2
The degrees of freedom associated with
+ 5(174.2 − 167.4667)2 these different calculations are as follows:
+ 5(168.6 − 167.4667)2 1. For the total sums:
= 542.53 . 
3
ni − 1 = 15 − 1 = 14 .
6. The sum of the products of the treatments
i=1
of X and Y:
2. For the sums of treatments:

3 
5
TXY = (X̄i. − X̄.. )(Ȳi. − Ȳ.. ) t−1=3−1=2.
i=1 j=1

= 5(28.2 − 29.8667) 3. For the sums of errors:

· (159.6 − 167.4667) + . . . 
3
ni − t = 15 − 3 = 12 .
+ 5(35.2 − 29.8667) i=1
· (168.6 − 167.4667)
Adjusting the variable Y to the concomitant
= −27.67 . variable X yields two new sums of squares:
7. The sum of the squares of the errors for X: 1. The total adjusted sum of squares:
2
SXY

3 
5
SStot = SYY −
EXX = (Xij − X̄i. )2 SXX
i=1 j=1 158.932
= 3885.73 −
= (32 − 28.2) + . . . + (32 − 35.2)
2 2 453.73
= 230.40 . = 3830.06 .

8. The sum of the squares of the errors for Y: 2. The adjusted sum of the squares of the
errors:
 3  5
EYY = (Yij − Ȳi. )2 E2
SSerr = EYY − XY
i=1 j=1 EXX
= (167 − 159.6) + . . .
2 186.602
= 3343.20 −
230.40
+ (162 − 168.6)2
= 3192.07 .
= 3343.20 .
The new degrees of freedom for these two
9. The sum of the products of the errors of X sums are:
and Y:
1. 3i=1 ni − 2 = 15 − 2 = 13;

 3  5 2. 3i=1 ni − t − 1 = 15 − 3 − 1 = 11.
EXY = (Xij − X̄i. )(Yij − Ȳi. ) The adjusted sum of the squares of the treat-
i=1 j=1 ments is given by:
= (32 − 28.2)(167 − 159.6) + . . .
SStr = SStot − SSerr
+ (32 − 35.2)(162 − 168.6)
= 3830.06 − 3192.07
= 186.60 .
= 637.99 .
Covariation 133

Thishasthesamenumber of degrees of free- FURTHER READING


dom as before:  Analysis of variance
 Design of experiments
t−1=3−1=2.
 Missing data analysis
We now have all of the elements required in  Regression analysis C
order to establish a covariance analysis table.
The sums of squares divided by the degrees
of freedom gives the means of the squares.
REFERENCES
Source Degrees Sum of squares Bartlett, M.S.: Some examples of statisti-
of var- of free- and of products cal methods of research in agriculture and
iation dom 3
2 3 3
2 applied biology. J. Roy. Stat. Soc. (Suppl.)
xi xi yi yi
i =1 i =1 i =1 4, 137–183 (1937)
Treat- DeLury, D.B.: The analysis of covariance.
2 223.33 −27.67 543.53
ments
Biometrics 4, 153–170 (1948)
Errors 12 230.40 186.60 3343.20
Fisher, R.A.: Statistical Methods for
Total 14 453.73 158.93 3885.73
Research Workers. Oliver & Boyd, Edin-
burgh (1925)
Adjustment
Huitema, B.E.: The Analysis of Covariance
Source of Degrees Sum of Mean of and Alternatives. Wiley, New York (1980)
variation of free- squares squares Wildt, A.R., Ahtola, O.: Analysis of Covari-
dom
ance (Sage University Papers Series on
Treat-
2 637.99 318.995 Quantitative Applications in the Social
ments
Sciences, Paper 12). Sage, Thousand
Errors 11 3192.07 290.188
Oaks, CA (1978)
Total 13 3830.06

The F ratio, which is used to test the null


hypothesis that there is no significant differ-
ence between the means of the treatments
Covariation
once adjusted to the variable Y, is given by: It is often interesting, particularly in eco-
MCtr 318.995 nomics, to compare two time series.
F= = = 1.099 .
MCerr 290.188 Since we wish to measure the level of depen-
If we choose a significance level of α = 0.05, dence between two variables, this is some-
the value of F in the Fisher table is equal to: what reminiscent of the concept of correla-
tion. However, in this case, since the time
F2,11,0.05 = 3.98 .
series are bound by a third variable, time,
Since F < F2,11,0.05, we cannot reject the finding the correlation coefficient would
null hypothesis and so we conclude that only lead to an artificial relation.
there is no significant difference between the Indeed, if two time series are considered, xt
responses to the three diets once the vari- and yt , which represent completely inde-
able Y is adjusted to the initial weight X. pendent phenomena and are linear functions
134 Covariation

of time: but here the calculations do not have the


xt = a · t + b , same grasp because the goal is to detect
yt = c · t + d , the eventual existence of relation between
where a, b, c and d are constants, it is variations that are themselves related to
always possible to eliminate the time fac- time and to measure the order of magni-
tor t between the two equations and to obtain tude
a functional relation of the type y = e · 
n

x + f . This relation states that there is a lin- (xt − x̄) · (yt − ȳ)
t=1
ear dependence between the two time series, C= " .
# n
which is not the case. # 
n
$ (xt − x̄)2 · (yt − ȳ)2
Therefore, measuring the correlation
t=1 t=1
between the evolutions of two phenome-
na over time does not imply the existence of This yields values of between −1 and +1.
a link between them. The term covariation is If it is close to ±1, there is a linear rela-
therefore used instead of correlation, and this tion between the time evolutions of the
dependence is measured using a covariation two variables.
coefficient. We can distinguish between: Notice that:
• The linear covariation coefficient; 
n
• The tendency covariation coefficient. Xt · Yt
t=1
C= .
HISTORY n
See correlation coefficient and time series. Here n is the number of observations,
while Yt and Xt are the centered and
MATHEMATICAL ASPECTS reduced series obtained by a change of
In order to compare two time series yt and xt , variable, respectively.
the first step is to attempt to represent them • The tendency covariation coefficient
on the same graphic. The influence exerted by the means is
However, visual comparison is generally dif- eliminated by calculating:
ficult. The following change of variables is n
performed: (xt − Txt ) · (yt − Tyt )
t=1
yt − ȳ xt − x̄ K= " .
# n
Yt = and Xt = , # 
n
Sy Sx $ (xt − Tx )2 · (yt − Tyt )2
t
t=1 t=1
which are the centered and reduced variables
where Sy and Sx are the standard deviations The means x̄ and ȳ have simply been
of the respective time series. replaced with the values of the secular
We can distinguish between the following trends Txt and Tyt of each time series.
covariation coefficients: The tendency covariation coefficient K
• The linear covariation coefficient also takes values between −1 to +1, and
The form of this expression is identical to the closer it gets to ±1, the stronger the
the one for the correlation coefficient r, covariation between the time series.
Covariation 135

DOMAINS AND LIMITATIONS EXAMPLES


There are many examples of the need to Let us study the covariation between two
compare two time series in economics: for time series.
example, when comparing the evolution of The variable xt represents the annual pro-
the price of a product to the evolution of duction of an agricultural product; the vari- C
the quantity of the product on the market, able yt is its average annual price per unit in
or the evolution of the national revenue to constant euros.
the evolution of real estate transactions. It
is important to know whether there is some t xt yt
kind of dependence between the two phe- 1 320 5.3
nomena that evolve over time: this is the goal 2 660 3.2
of measuring the covariation. 3 300 2.2
Visually comparing two time series is an 4 190 3.4
important operation, but this is often a dif- 5 320 2.7
ficult task because: 6 240 3.5
• The data undergoing comparison may 7 360 2.0
come from very different domains and 8 170 2.5
present very different orders of magni-
tude, so it is preferable to study the devi- 
8 
8
ations from the mean. xt = 2560 , yt = 24.8 ,
• The peaks and troughs of two time series t=1 t=1
may have very different amplitudes; it is
giving x̄ = 320 and ȳ = 3.1.
then preferable to homogenize the dis-
persions by linking the variations back
xt − x̄ (xt − x̄ )2 yt − ȳ (yt − ȳ )2
to the standard deviation of the time
0 0 2.2 4.84
series.
340 115600 0.1 0.01
Visual comparison is simplified if we
−20 400 −0.9 0.81
consider the centered and reduced vari-
−130 16900 0.3 0.09
ables obtained via the following variable
0 0 −0.4 0.16
changes:
−80 6400 0.4 0.16
yt − ȳ xt − x̄ −1.1
Yt = and Xt = . 40 1600 1.21
Sy Sx −150 22500 0.6 0.36

Also, in a similar way to the correlation 8


coefficient, nonlinear relations can exist (xt − x̄)2 = 163400 ,
between two variables that give a C value t=1

that is close to zero. 8

It is therefore important to be cautious during (yt − ȳ)2 = 7.64 ,


t=1
interpretation.
The tendency covariation coefficient is pref- giving σx = 142.9 and σy = 0.98.
erentially used when the relation between the The centered and reduced values Xt and Yt
time series is not linear. are then calculated.
136 Cox, David R.

Xt Yt Xt · Yt Xt · Yt −1
0.00 2.25 0.00 5.36 Cox, David R.
2.38 0.10 0.24 −0.01
Cox, David R. was born in 1924 in Bir-
−0.14 −0.92 0.13 0.84
mingham in England. He studied mathe-
−0.91 0.31 −0.28 0.00
matics at the University of Cambridge and
0.00 −0.41 0.00 0.23
obtained his doctorate in applied mathe-
−0.56 0.41 −0.23 0.11
matics at the University of Leeds in 1949.
0.28 −1.13 −0.32 1.18
From 1966 to 1988, he was a professor of
−1.05 −0.61 0.64
statistics at Imperial College London, and
The linear covariation coefficient is then cal- then from 1988 to 1994 he taught at Nuffield
culated: College, Oxford.
Cox, David is an eminent statistician. He was

8
knighted by Queen Elizabeth II in 1982 in
Xt · Yt
0.18 gratitude for his contributions to statistical
t=1
C= = = 0.0225 . science, and has been named Doctor Hon-
8 8
oris Causa by many universities in England
If the observations Xt are compared and elsewhere. From 1981 to 1983 he was
with Yt−1 , meaning the production this year President of the Royal Statistical Society;
is compared with that of the previous year, he was also President of the Bernoulli Soci-
we obtain: ety from 1972 to 1983 and President of the

8 International Statistical Institute from 1995
Xt · Yt−1 to 1997.
t=2 7.71 Due to the variety of subjects that he has stud-
C= = = 0.964 .
8 8 ied and developed, Professor Cox has had
The linear covariation coefficient for a shift a profound impact in his field. He was named
of one year is very strong. There is a strong Doctor Honoris Causa of the University of
(positive) covariation with a shift of one year Neuchâtel in 1992.
between the two variables. Cox, Sir David is the author and the coauthor
of more then 250 articles and 16 books, and
FURTHER READING between 1966 and 1991 he was the editor of
 Correlation coefficient Biometrika.
 Moving average
 Secular trend Some principal works and articles of Cox,
 Standard deviation Sir David:
 Time series
1964 (and Box, G.E.P.) An analysis of
transformations (with discussion)
REFERENCES
J. Roy. Stat. Soc. Ser. B 26, 211–
Kendall, M.G.: Time Series. Griffin, London
243.
(1973)
Py, B.: Statistique Déscriptive. Economica, 1970 The Analysis of Binary Data.
Paris (1987) Methuen, London.
Cp Criterion 137

1973 (and Hinkley, D.V.) Theoretical she agreed to, although she remained in the
Statistics. Chapman & Hall, London. field of psychology because she worked on
1974 (and Atkinson, A.C.) Planning expe- the evaluation of statistical test in psych-
riments for discriminating between ology and the analysis of psychological C
models. J. Roy. Stat. Soc. Ser. B 36, data.
321–348. On 1st November 1940, she became the
director of the Department of Experimental
1978 (and Hinkley, D.V.) Problems and
Statistics of the State of North Carolina.
Solutions in Theoretical Statistics.
In 1945, the General Education Board gave
Chapman & Hall, London.
her permission to create an institute of statis-
1981 (and Snell, E.J.) Applied Statistics: tics at the University of North Carolina, with
Principles and Examples. Chapman a department of mathematical statistics at
& Hall. Chapel Hill.
1981 Theory and general principles in She was a founder member of the Journal of
statistics. J. Roy. Stat. Soc. Ser. A the International Biometric Society in 1947,
144, 289–297. and she was a director of it from 1947 to
2000 (and Reid, N.) Theory of Design 1955 and president of it from 1968 to 1969.
Experiments. Chapman & Hall, Lon- She was president of the American Statisti-
don. cal Association (ASA) in 1956. She died in
1978.

Principal work of Cox, M. Gertrude:


Cox, Mary Gertrude
Cox, Gertrude Mary was born in 1900, 1957 (and Cochran, W.) Experimental
near Dayton, Iowa, USA. Her ambition was Designs, 2nd edn. Wiley, New York
to help people, and so she initially stud-
ied a social sciences course for two years.
Then she worked in orphanage for young
boys in Montana for two years. In order Cp Criterion
to become a director of the orphanage, she The Cp citerion is a model selection citerion
decided to continue her education at Iowa in linear regression. For a linear regression
State College. She graduated from Iowa model with p parameters including any con-
State College in 1929. To pay for her stud- stant term, in the model, a rule of thumb is
ies, Cox, Gertrude worked with Snedecor, to select a model in which the value of Cp is
George Waddel, her professor, in his sta- close to the number of terms in the model.
tistical laboratory, which led to her becom-
ing interested in statistics. After graduating,
she started studying for a doctorate in psych- HISTORY
ology. In 1933, before finishing her doctor- Introduced by Mallows, Colin L. in 1964, the
ate, Snedecor, George, then the director of model selection criterion Cp has been used
the Iowa State Statistical Laboratory, con- ever since as a criterion for evaluating the
vinced her to become his assistant, which goodness of fit of a regression model.
138 Cp Criterion

MATHEMATICAL ASPECTS In practice, we estimate p by


p−1
Let Yi = β0 + j=1 Xji βj + εi , i =
RSS
1, . . . , n be a multiple linear regression Cp = − n + 2p ,
model. Denote the mean square error as σ2
MSE( yi ). The criterion introduced in this where σ 2 is an estimator of σ 2. Here we esti-
section can be used to choose the model with mate σ 2 using the s2 of the full model. For
the minimal sum of mean square errors: this full model, we actually obtain Cp = p,
    which is not an interesting result. Howev-
MSE ( yi ) = E ( y i − y i )2 er, for all of the other models, where we use
 only a subset of the explanatory variables, the
− σ 2 (1 − 2hii)
  coefficient Cp can have values that are differ-
=E y i − y i )2
( ent from p. From the models that incorporate
 only a subset of the explanatory variables, we
− σ2 (1 − 2hii) then choose those for which the value of Cp
= E(RSS) − σ 2 (n − 2p) , is the closest to p.
If we have k explanatory variables, we can
where also define the coefficient Cp for a model that
incorporates a subset X1 , . . . , Xp−1 , p ≤ k of
n is the number of observations, the k explanatory variables in the following
p is the number of estimated parameters, manner:
hii are the diagonal elements of the hat
matrix, and (n − k) RSS X1 , . . . , Xp−1
Cp =

yi is the estimator of yi . RSS (X1 , . . . , Xk )
− n + 2p ,
Recall the following property of the hii :

 where RSS X1 , . . . , Xp−1 is the sum of the
hii = p . squares of the residuals related to the mod-
el with p − 1 explanatory variables, and
Define the coefficient
RSS (X1 , . . . , Xk ) is the sum of the squares of

MSE ( yi ) the residuals related to the model with all k
p =
σ2 explanatory variables. Out of the two models
E (RSS) − σ 2 (n − 2p) that incorporate p − 1 explanatory variables,
= we choose, according to the criterion, the one
σ2
E (RSS) for which the value of the coefficient Cp is the
= − n + 2p .
σ2 closest to p.

If the model is correct, we must have:


DOMAINS AND LIMITATIONS
E (RSS) = (n − p) σ , 2 The Cp criterion is used when selecting vari-
ables. When used with the R2 criterion, this
which implies criterion can tell us about the goodness of
fit of the chosen model. The underlying idea
p = p . of such a procedure is the following: instead
Cp Criterion 139

of trying to explain one variable using all 4. Set D contains one equation with three
of the available explanatory variables, it is explanatory variables:
sometimes possible to determine an under-
lying model with a subset of these variables, Y = β 0 + β1 X 1 + β2 X 2 + β3 X 3 + ε .
and the explanatory power of this model is C
almost the same as that of the model contain-
ing all of the explanatory variables. Anoth- District X1 X2 X3 Y

er reason for this is that the collinearity of i x i1 x i2 x i3 yi

the explanatory variables tends to decrease 1 0.604 29 11.744 1.825

estimator precision, and so it can be useful 2 0.765 44 9.323 2.251

to delete some superficial variables. 3 0.735 36 9.948 2.351


4 0.669 37 10.656 2.041
5 0.814 53 9.730 2.152
EXAMPLES
6 0.526 68 8.231 3.529
Consider some data related to the Chicago
7 0.426 75 21.480 2.398
fires of 1975. We denote the variable cor-
8 0.785 18 11.104 1.932
responding to the logarithm of the number
9 0.901 31 10.694 1.988
of fires per 1000 households per district i of
10 0.898 25 9.631 2.715
Chicago in 1975 by Y, and the variables cor-
11 0.827 34 7.995 3.371
responding to the proportion of households
12 0.402 14 13.722 0.788
constructed before 1940, to the number of
13 0.279 11 16.250 1.740
thefts and to the median revenue per district
14 0.077 11 13.686 0.693
i by X1 , X2 and X3 , respectively.
15 0.638 22 12.405 0.916
Sincethissetcontainsthreeexplanatory vari-
16 0.512 17 12.198 1.099
ables, we have 23 = 8 possible models. We
17 0.851 27 11.600 1.686
divide the eight possible equations into four
18 0.444 9 12.765 0.788
sets:
19 0.842 29 11.084 1.974
1. Set A contains the only equation without
20 0.898 30 10.510 2.715
explanatory variables:
21 0.727 40 9.784 2.803
Y = β0 + ε . 22 0.729 32 7.342 2.912
2. Set B contains three equations with one 23 0.631 41 6.565 3.589
explanatory variable: 24 0.830 147 7.459 3.681
25 0.783 22 8.014 2.918
Y = β0 + β1 X 1 + ε
26 0.790 29 8.177 3.148
Y = β0 + β2 X 2 + ε 27 0.480 46 8.212 2.501
Y = β0 + β3 X 3 + ε . 28 0.715 23 11.230 1.723
3. Set C contains three equations with two 29 0.731 4 8.330 3.082
explanatory variables: 30 0.650 31 5.583 3.073
31 0.754 39 8.564 2.197
Y = β0 + β1 X 1 + β2 X 2 + ε
32 0.208 15 12.102 1.281
Y = β0 + β1 X 1 + β3 X 3 + ε 33 0.618 32 11.876 1.609
Y = β0 + β2 X 2 + β3 X 3 + ε . 34 0.781 27 9.742 3.353
140 Cramér, Harald

District X1 X2 X3 Y also the number of estimated parameters, but


i xi 1 xi 2 xi 3 yi this is not an interesting result as previous-
35 0.686 32 7.520 2.856 ly explained. Therefore, the most reasonable
36 0.734 34 7.388 2.425 choice for the model appears to be:
37 0.020 17 13.842 1.224
38 0.570 46 11.040 2.477 Y = β 0 + β2 X 2 + β3 X 3 + ε .
39 0.559 42 10.332 2.351
40 0.675 43 10.908 2.370
FURTHER READING
41 0.580 34 11.156 2.380
 Coefficientof determination
42 0.152 19 13.323 1.569
 Collinearity
43 0.408 25 12.960 2.342
 Hat matrix
44 0.578 28 11.260 2.747
 Mean squared error
45 0.114 3 10.080 1.946
 Regression analysis
46 0.492 23 11.428 1.960
47 0.466 27 13.731 1.589
Source: Andrews and Herzberg (1985) REFERENCES
Andrews D.F., Hertzberg, A.M.: Data:
We denote the resulting models in the follow- A Collection of Problems from Many
ing way: 1 for X1 , 2 for X2 , 3 for X3 , 12 for Fields for Students and Research Workers.
X1 and X2 , 13 for X1 and X3 , 23 for X2 and Springer, Berlin Heidelberg New York
X3 , and 123 for the full model. The follow- (1985)
ing table represents the results obtained for Mallows, C.L.: Choosing variables in a lin-
the Cp criterion for each model: ear regression: A graphical aid. Present-
ed at the Central Regional Meeting of the
Model Cp Institute of Mathematical Statistics, Man-
1 32.356 hattan, KS, 7–9 May 1964 (1964)
2 29.937 Mallows, C.L.: Some comments on Cp.
3 18.354 Technometrics 15, 661–675 (1973)
12 18.936
13 14.817
23 3.681
Cramér, Harald
123 4.000
Cramér, Harald (1893–1985) entered the
If we consider a model from group B (con- University of Stockholm in 1912 in order
taining one explanatory variable), the num- to study chemistry and mathematics; he
ber of estimated parameters is p = 2 and became a student of Leffler, Mittag and
none of the Cp values for the three models Riesz, Marcel. In 1919, Cramér was named
approaches 2. If we now consider a model assistantprofessor attheUniversity of Stock-
from group C (with two explanatory vari- holm. At the same time, he worked as an
ables), p equals 3 and the Cp of model 23 actuary for an insurance company, Svenska
approaches this. Finally, for the complete Life Assurance, which allowed him to study
model we find that Cp = 4, which is probability and statistics.
Criterion of Total Mean Squared Error 141

His main work in actuarial mathematics is where E (.) and V (.) are the usual symbols
Collective Risk Theory. In 1929, he was used for the expected value and the vari-
asked to create a new department in Stock- ance. We define the total mean squared error
holm, and he became the first Swedish pro- as:
fessor of actuarial and statistical mathe- C

p−1

matics. At the end of the Second World War j) =
TMSE(β j
MSE β
he wrote his principal work Mathematical j=1
Methods of Statistics, which was published

p−1  2 
for the first time in 1945 and was recently (in = E j − βj
β
1999) republished. j=1
Some principal works and articles of Cramér, p−1 
 
Harald = j + E β
Var β j −βj 2
j=1
1946 Mathematical Methods of Statistics.
Princeton University Press, Prince- = (p − 1)σ 2 · Trace (V)
ton, NJ. 
p−1

+ j − βj 2 .
E β
1946 Collective Risk Theory: A Survey of
j=1
the Theory from the Point of View of
the Theory of Stochastic Processes. where V is the variance-covariance matrix

Skandia Jubilee Volume, Stockholm. of β.

DOMAINS AND LIMITATIONS


Criterion Of Total Unfortunately, when we want to calculate the
Mean Squared Error total mean squared error

  
p−1
The criterion of total mean squared error
 =
TMSE β E β j − βj 2
is a way of comparing estimations of the
j=1
parameters of a biased or unbiased model.
of a vector of estimators
MATHEMATICAL ASPECTS
β= β 1 , . . . , β
p−1
Let
= β
β p−1
1 , . . . , β
for the parameters of the model
be a vector of estimators for the parameters of
a regression model. We define the total mean Yi = β0 + β1 Xi1
s
+ . . . + βp−1 Xip−1
s
+ εi ,
square error, TMSE, of the vector β  of esti-
we need to know the values of the model
mators as being the sum of the mean squared
parameters βj, which are obviously unknown
errors (MSE) of its components.
for all real data. The notation Xjs means that
We recall that
the data for the jth explanatory variable were
 
MSE β j = E β j − βj 2 standardized (see standardized data).
Therefore, in order to estimate these TMSE
=V β j + E β j − βj 2 , we generate data from a structural similarly
142 Criterion of Total Mean Squared Error

model. Using a generator of pseudo-random Portion Ingredient Heat


numbers, we can simulate all of the data 1 2 3 4
for the artificial model, analyze it with dif- i xi 1 xi 2 xi 3 xi 4 yi
ferent models of regression (such as simple 4 11 31 8 47 87.6
regression or ridge regression), and calcu- 5 7 52 6 33 95.9
late what we call the total squared error, 6 11 55 9 22 109.2
 relat-
TSE, of the vectors of the estimators β 7 3 71 17 6 102.7
ed to each method: 8 1 31 22 44 72.5
9 2 54 18 22 93.1

p−1
10 21 47 4 26 115.9
TSE(β̂) = j − βj ) .
(β 2
11 1 40 23 34 83.9
j=1
12 11 66 9 12 113.3
We repeat this operation 100 times, ensur- 13 10 68 8 12 109.4
ing that the 100 data sets are pseudo-
independent. For each model, the average
of 100 TSE gives a good estimation of the yi quantity of heat given out due to the
TMSE. Note that some statisticians prefer hardening of the ith portion (in joules);
the model obtained by selecting variables xi1 quantity of ingredient 1 (tricalcium alu-
due to its simplicity. On the other hand, other minate) in the ith portion;
x quantity of ingredient 2 (tricalcium sil-
statisticians prefer the ridge method because i2
icate) in the ith portion;
it uses all of the available information.
xi3 quantity of ingredient 3 (tetracalcium
alumino-ferrite) in the ith portion;
EXAMPLES xi4 quantity of ingredient 4 (dicalcium sil-
We can generally compare the following icate) in the ith portion.
regression methods: linear regression by
mean squares, ridge regression, or the vari- In this paragraph we will compare the esti-
able selection method. mators obtained by least squares (LS)
In the following example, thirteen portions regression with those obtained by ridge
of cement have been examined. Each portion regression (R) and those obtained with the
is composed of four ingredients, given in the variable selection method (SV) via the total
table. The aim is to determine how the quan- mean squared error TMSE. The three esti-
tities xi1 , xi2 , xi3 and xi4 of these four ingredi- mation vectors obtained from each method
ents influence the quantity yi , the heat given are:
out due to the hardening of the cement. 
YLS = 95.4 + 9.12X1s + 7.94X2s
Heat given out by the cement
+ 0.65X3s − 2.41X4s
Portion Ingredient Heat 
YR = 95.4 + 7.64X1s + 4.67X2s
1 2 3 4
− 0.91X3s − 5.84X4s
i xi 1 xi 2 xi 3 xi 4 yi

YSV = 95.4 + 8.64X1s + 10.3X2s
1 7 26 6 60 78.5
2 1 29 15 52 74.3 We note here that all of the estimations
3 11 56 8 20 104.3 were obtained from standardized explanato-
Criterion of Total Mean Squared Error 143

ry variables. We compare these three estima- these three methods are applied to these
tions using the total mean squared errors of 100 samples without any influence from the
the three vectors: results from the equations
 
β̂LS = β̂LS1 , β̂LS2 , β̂LS3 , β̂LS4 
YiLS = 95.4 + 9.12Xi1s s
+ 7.94Xi2
C
  s s
β̂R = β̂R1 , β̂R2 , β̂R3 , β̂R4 + 0.65Xi3 − 2.41Xi4 ,
  
YiR = 95.4 + 7.64Xi1s
+ 4.67Xi2
s
β̂SV = β̂SV1 , β̂SV2 , β̂SV3 , β̂SV4 .
− 0.91Xi3
s
− 5.84Xi4
s
and
Here the subscript LS corresponds to the YiSV = 95.4 + 8.64Xi1 + 10.3Xi2
s s

method of least squares, R to the ridge


method and SV to the variable selection obtained for the original sample. Despite the
method, respectively. For this latter method, fact that the variable selection method has
chosen the variables Xi1 and Xi2 in 
the estimations for the coefficients of the uns- YiSV =
elected variables in the model are considered 95.4 + 8.64Xi1 + 10.3Xi2, it is possible that,
s s

to be zero. In our case we have: for one of these 100 samples, the method has
selected better Xi2 and Xi3 variables, or only
β̂LS = (9.12, 7.94, 0.65, −2.41) Xi3 , or all of the subset of the four available
β̂R = (7.64, 4.67, −0.91, −5.84) variables. In the same way, despite the fact
that the equation  YiR = 95.4 + 7.64Xi1 s
+
β̂SV = (8.64, 10.3, 0, 0) .
4.67xXi2 − 0.91Xi3 − 5.84Xi4 was obtained
s s s

We have chosen to approximate the under- with k = 0.157, the value of k is recalculat-
lying process that results in the cement data ed for each of these 100 samples (for more
by the following least squares equation: details refer to the ridge regression example).
From these 100 estimations of β̂LS , β̂SV and
YiMC = 95.4 + 9.12Xi1 s
+ 7.94Xi2
s
β̂R , we can calculate 100 ETCs for each
+ 0.65Xi3 s
− 2.41Xi4s
+ εi . method, which we label as:

The procedure consists of generating  2


ETCLS = β̂LS1 − 9.12
13 random error terms ε1 , . . . , ε13 100
 2
times based on a normal distribution with
+ β̂LS2 − 7.94
mean 0 and standard deviation 2.446 (recall
 2
that 2.446 is the least squares estimator + β̂LS3 − 0.65
of σ for the cement data). We then cal-  2
culate Y1LS , . . . , Y13LS using the Xijs val- + β̂LS4 + 2.41
ues in the data table. In this way, we gen-  2
erate 100 Y1LS , . . . , Y13LS samples from ETCR = β̂R1 − 9.12
100 ε1 , . . . , ε13 samples.  2
The three methods are applied to each of + β̂R2 − 7.94
these 100 Y1LS , . . . , Y13LS samples(always  2
using the same values for Xijs ), which yields + β̂R3 − 0.65
2
100 estimators β̂LS , β̂R and β̂SV . Note that + βR4 + 2.41
144 Critical Value

 2
ETCSV = β̂SV1 − 9.12 We can therefore conclude (at least for the
 2 particular model used to generate the simu-
+ β̂SV2 − 7.94 lated data, and taking into account our aim—
 2 to get a small TMSE), that the ridge method is
+ β̂SV3 − 0.65 the best of the methods considered, followed
 2 by the varible selection procedure.
+ β̂SV4 + 2.41 .

The means of the 100 values of ETCLS, ETCR FURTHER READING


and ETCSV are the estimations of TMSELS ,  Bias

TMSESV and TMSER : the TMSE estimations  Expected value

for the three considered methods.  Hat matrix

After this simulation was performed, the fol-  Mean squared error

lowing estimations were obtained:  Ridge regression


 Standardized data
TMSELS = 270 ,  Variance

TMSER = 75 .  Variance–covariance matrix


 Weighted least-squares method
TMSESV = 166 .

These give the following differences: REFERENCES


Box, G.E.P, Draper, N.R.: A basis for the
TMSELS − TMSESV = 104 , selection of response surface design.
TMSELS − TMSER = 195 , J. Am. Stat. Assoc. 54, 622–654 (1959)
TMSESV − TMSER = 91 . Dodge, Y.: Analyse de régression appliquée.
Dunod, Paris (1999)
Since the standard deviations of 100 ob-
served differences are respectively 350,
290 and 280, we can calculate the approxi-
Critical Value
mate 95% confidence intervals for the differ-
ences between the TMSEs of two methods In hypothesis testing, the critical value is the
limit value at which we take the decision to
2 · 350
TMSELS − TMSESV = 104 ± √ , reject the null hypothesis H0 , for a given sig-
100 nificance level.
2 · 290
TMSELS − TMSER = 195 ± √ ,
100
HISTORY
2 · 280
TMSESV − TMSER = 91 ± √ . The concept of a critical value was intro-
100 duced by Neyman, Jerzy and Pearson,
We get Egon Sharpe in 1928.

34 < TMSELS − TMSESV < 174 , MATHEMATICAL ASPECTS


137 < TMSELS − TMSER < 253 , The critical value depends on the type of
35 < TMSESV − TMSER < 147 . the test used (two-sided test or one-sided
Cyclical Fluctuation 145

test on the right or the left), the probability and II. Biometrika 20A, 175–240, 263–
distribution and the significance level α. 294 (1928)

DOMAINS AND LIMITATIONS C


The critical value is determined from the Cyclical Fluctuation
probability distribution of the statistic Cyclical fluctuations is a term used to
associated with the test. It is determined by describe oscillations that occur over long
consulting the statistical table correspond- periods about the secular trend line or curve
ing to this probability distribution (normal of a time series.
table, Student table, Fisher table, chi-
square table, etc).
HISTORY
See time series.
EXAMPLES
A company produces steel cables. Using
a sample of size n = 100, it wants to ver- MATHEMATICAL ASPECTS
ify whether the diameters of the cables con- Consider Yt , a time series given by its com-
form closely enough to the required diameter ponents; Yt can be written as:
0.9 cm in general. • Yt = Tt ·St ·Ct ·It (multiplicative model),
The standard deviation σ of the popula- or;
tion is known and equals 0.05 cm. • Yt = Tt + St + Ct + It (additive model).
In this case, hypothesis testing involves where
a two-sided test. The hypotheses are the fol-
Yt is the data at time t;
lowing: Tt is the secular trend at time t;
St is the seasonal variation at time t;
null hypothesis H0 : μ = 0.9
Ct is the cyclical fluctuation at time t, and;
alternative hypothesis H1 : μ = 0.9 . It is the irregular variation at time t.

To a significance level of α = 5%, by look- The first step when investigating a time series
ing at the normal table we find that the crit- is always to determine the secular trend Tt ,
ical value z α2 equals 1.96. and then to determine the seasonal varia-
tion St . It is then possible to adjust the initial
FURTHER READING data of the time series Yt according to these
 Confidence interval two components:
Yt
 Hypothesis testing • = Ct · It (multiplicative model);
 Significance level
S t · Tt
• Yt − St − Tt = Ct + It (additive model).
 Statistical table
To avoid cyclical fluctuations, a weighted
moving average is established over a few
REFERENCE months only. The use of moving averages
Neyman, J., Pearson, E.S.: On the use and allows use to smooth the irregular varia-
interpretation of certain test criteria for tions It by preserving the cyclical fluctua-
purposes of statistical inference, Parts I tions Ct .
146 Cyclical Fluctuation

The choice of a weighted moving average usually vary in length and amplitude. This
allows us to give more weight to the cen- is due to the presence of a multitude of fac-
tral values compared to the extreme val- tors, where the effects of these factors can
ues, in order to reproduce cyclical fluctua- change from one cycle to the other. None
tionsin amoreaccurateway. Therefore,large of the models used to explain and predict
weights will be given to the central values such fluctuations have been found to be
and small weights to the extreme values. completely satisfactory.
For example, for a moving average con-
sidered for an interval of five months, the EXAMPLES
weights −0.1, 0.3, 0.6, 0.3 and −0.1 can be Let us establish a moving average con-
used; since their sum is 1, there will be no sidered over five months of data adjusted
need for normalization. according to the secular trend and seasonal
If the values of Ct · It (resulting from the variations.
adjustments performed with respect to the Let us also use the weights −0.1, 0.3, 0.6, 0.3,
secular trend and to the seasonal varia- −0.1; since their sum is equal to 1, there is
tions) are denoted by Xt , the value of the no need for normalization.
cyclical fluctuation for the month t is deter- Let Xi be the adjusted values of Ci · Ii :
mined by:
Ci = −0.1 · Xi−2 + 0.3 · Xi−1 + 0.6 · Xi
Ct = −0.1 · Xt−2 + 0.3 · Xt−1 + 0.6 · Xt + 0.3 · Xi+1 − 0.1 · Xi+2 .
+ 0.3 · Xt+1 − 0.1 · Xt+2 . The table below shows the electrical pow-
er consumed by street lights every month in
DOMAINS AND LIMITATIONS millions of kilowatt hours during the years
Estimating cyclical fluctuations allows us to: 1952 and 1953. The data have been adjusted
• Determine the maxima or minima that according to the secular trend and seasonal
a time series can attain. variations.
• Perform short- or medium-term forecast-
Year Month Data Moving average
ing. Xi for 5 months Ci
• Identify the cyclical components.
1952 J 99.9
The limitations and advantages of the use of
F 100.4
weighted moving averageswhen evaluating
M 100.2 100.1
cyclical fluctuations are the following:
• Weighted moving averages can smooth A 99.0 99.1

a curve with cyclical fluctuations which M 98.1 98.4


still retaining most of the original fluc- J 99.0 98.7
tuation, because they preserve the ampli- J 98.5 98.3
tudes of the cycles in an accurate way. A 97.8 98.3
• The use of an odd number of months to S 100.3 99.9
establish the moving average facilitates
O 101.1 101.3
better centering of the values obtained.
N 101.2 101.1
• It is difficult to study the cyclical fluctu-
D 100.4 100.7
ation of a time series because the cycles
Cyclical Fluctuation 147

Year Month Data Moving average cycle is often sought, but this only appears
Xi for 5 months Ci every 20 years.
1953 J 100.6 100.3
F 100.1 100.4 FURTHER READING
M 100.5 100.1  Forecasting
C
A 99.2 99.5  Irregular variation

M 98.9 98.7  Moving average


 Seasonal variation
J 98.2 98.2
 Secular trend
J 98.4 98.5
 Time series
A 99.7 99.5
S 100.4 100.5
REFERENCE
O 101.0 101.0
Box, G.E.P., Jenkins, G.M.: Time Series
N 101.1 Analysis: Forecasting and Control (Series
D 101.1 in TimeSeriesAnalysis).Holden Day,San
Francisco (1970)
No significant cyclical effects appear in these Wold, H.O. (ed.): Bibliography on Time
data and non significant effects do not mean Series and Stochastic Processes. Oliver &
no effects. The beginning of an economic Boyd, Edinburgh (1965)

You might also like