Professional Documents
Culture Documents
Yes No REFERENCES
Votes 8546 5455 See analysis of categorical data.
60 Category
Category
A category represents a set of people or
objects that have a common characteris-
tic.
If we want to study the people in a popu-
lation, we can sort them into “natural” cat-
egories, by gender (men and women) for Cauchy distribution, θ = 1, α = 0
example, or into categories defined by other
criteria, such as vocation (managers, secre-
taries, farmers . . . ). MATHEMATICAL ASPECTS
The expected value E[X] and the variance
Var(X) do not exist.
If α = 0 and θ = 1, the Cauchy distribution
FURTHER READING is identical to the Student distribution with
Binary data
one degree of freedom.
Categorical data
Dichotomous variable
Population DOMAINS AND LIMITATIONS
Random variable Its importance in physics is mainly due to the
Variable fact that it is the solution to the differential
equation describing force resonance.
MATHEMATICAL ASPECTS
If we compare the proportion of smokers that
See cause and effect in epidemiology,
contract lung cancer to the proportion of non-
odds and odds ratio, relative risk, attri-
smokers that do, we get 5%/1% = 5, so we
butable risk, avoidable risk, incidence
can conclude that the risk of developing lung
rate, prevalence rate.
cancer is five times higher in smokers than
in nonsmokers. We generally also evaluate
DOMAINS AND LIMITATIONS the significance of this result computing p-
Studies of the relationship between tobac- value. Suppose that the chi-square test gave
co smoking and the development of lungial p < 0.001. It normally accepted that if the
cancers and the relationship between HIV p-value is smaller then 0.05, then the results
and the AIDS are examples of causal epide- obtained are statistically significant.
miology. Research into a causal relation is Suppose that we perform the same study, but
often very complex, requiring many stud- the dimension of each group (smokers and
ies and the incorporation and combination of nonsmokers) is 100 instead of 1000, and we
various data sets from biological and animal observe the same proportions of people that
experiments, to clinical trials. contract lung cancer:
While causes cannot always be identified
precisely, a knowledge of the risk factors Smokers Non-
associated with an illness and therefore the smokers
groups of people at risk allows us to inter- Number of cases of 5 1
vene with preventative measures that could lung cancer (/100)
preserve health. Proportion of 5% 1%
individuals in this
group that
EXAMPLES contracted lungs
As an example, we can investigate the cancer
relationship between smoking and the
development of lung cancer. Consider If we perform the same statistical test, the
a study of 2000 subjects: 1000 smokers and p-value is found to be 0.212. Since this is
1000 nonsmokers. The age distributions greater than 0.05, we can cannot draw any
and male/female proportions are identical solid conclud that there is about the existence
in both groups. Let us analyze a summary of of a significant statistical relation between
data obtained over many years. The results smoking and lung cancer. This illustrates
are presented in the following table: that, in order to have a statistically signifi-
62 Cause and Effect in Epidemiology
and therefore enables us to answer the ques- Popu- Risk for the Risk for the
tion: if we had suppressed the cause, what lation exposed (%) unexposed (%)
of an extremely detailed record. This treatise ioners that were maintained by pastors; in
was written by Kautilya, Prime Minister in 1668, a decree made it obligatory to be count-
the reign of Chandragupta Maurya (313–289 ed in these registers, and instead of being reli-
BC). gious, the registers became administrative.
Another very important civilization, the 1736 saw the appearance of another decree, C
Incans, also used censuses. They used stating that the governor of each province
a statistics system called “quipos”. Each had to report any changes in the population
quipo was both an instrument and a reg- of the province to parliament. Swedish pop-
istry of information. Formed from a series of ulation statistics were officially recognized
cords, the colors, combinations and knots on on the 3rd of February 1748 due to creation
the cords had precise meanings. The quipos of the “Tabellverket” (administrative tables).
were passed to specially initiated guards that The first summary for all Swedish provinces,
gathered together all of these statistics. realized in 1749, can be considered to be
In Europe, the ancient Greeks and Romans the first proper census in Europe, and the
also practiced censuses. Aristotle reported 11th of November 1756 marked the creation
that the Greeks donated a measure of wheat of the “Tabellkommissionen,” the first offi-
per birth and a measure of barley per death to cial Division of Statistics. Since 1762, these
the goddess Athéna. In Rome, the first census tables of figures have been maintained by the
was performed at the behest of King Servius Academy of Sciences.
Tullius (578–534 BC) in order to monitor Initially, the Swedish censuses were orga-
revenues, and consequently raise taxes. nized annually (1749–1751), and then every
Later, depending on the country, census- three years (1754–1772), but since 1775 they
es were practiced with different frequencies have been conducted every five years.
and on different scales. At the end of the eighteenth century an offi-
In 786, Charlemagne ordered a count of all cial institute for censuses was created in
of his subjects over twelve years old; pop- France (in 1791). In 1787 the principle of
ulation counts were also initiated in Italy in census has been registered in the Constitu-
the twelfth century; many cities performed tional Charter of the USA (C. C. USA).
censuses of their inhabitants in the fifteenth See also official statistics.
century, including Nuremberg (in 1449) and
Strasbourg (in 1470). In the sixteenth centu-
FURTHER READING
ry, France initiated marital status registers.
Data collection
The seventeenth century saw the develop-
Demography
ment of three different schools of thought:
Official statistics
a German school associated with descrip-
Population
tive statistics, a French school associated
Sample
with census ideology and methodology, and
Survey
an English school that led to modern statis-
tics.
In the history of censuses in Europe, there REFERENCES
is a country that occupies a special place. In Hecht, J.: L’idée du dénombrement jusqu’à
1665, Sweden initiated registers of parish- la Révolution. Pour une histoire de la
66 Central Limit Theorem
statistique, tome 1, pp. 21–82 . Econo- ed (with any distribution) with a mean μ and
mica/INSEE (1978) a finite variance σ 2 .
We define the sum Sn = X1 + X2 + . . . + Xn
and we establish the ratio:
Central Limit Theorem Sn − n · μ
√ ,
σ· n
The central limit theorem is a fundamental √
theorem of statistics. In its simplest form, it where n·μ and σ · n represent the mean and
prescribes that the sum of a sufficiently large the standard deviation of Sn , respectively.
number of independent identically distribut- The central limit theorem establishes that the
ed random variables approximately follows distribution of this ratio tends to the standard
a normal distribution. normal distribution when n tends to infinity.
This means that:
Sn − n · μ
HISTORY P √ ≤ x −→ (x)
σ n n→+∞
The central limit theorem was first estab-
where (x) is the distribution func-
lished within the framework of binomi-
tion of the standard normal distribution,
al distribution by Moivre, Abraham de
expressed by:
(1733). Laplace, Pierre Simon de (1810)
formulated the proof of the theorem. x 2
1 x
Poisson, Siméon Denis (1824) also worked (x) = √ exp − dx ,
−∞ 2π 2
on this theorem, and Chebyshev, Pafnu-
−∞<x <∞.
tii Lvovich (1890–1891) gave a rigorous
demonstration of it in the middle of the nine-
DOMAINS AND LIMITATIONS
teenth century.
The central limit theorem provides a simple
At the beginning of the twentieth centu-
method of approximately calculating proba-
ry, the Russian mathematician Liapounov,
bilities related to the sums of random vari-
Aleksandr Mikhailovich (1901) created the
ables.
generally recognized form of the central lim-
Besides its interest in relation to the sam-
it theorem by introducing its characteris-
pling theorem, where the sums and the
tic functions. Markov, Andrei Andreevich
means play an important role, the central lim-
(1908) also worked on it and was the first to
it theorem is used to approximate normal
generalize the theorem to the case of inde-
distributions derived from summing iden-
pendent variables.
tical distributions. We can for example, with
According to Le Cam, L. (1986), the qualifi-
the help of the central limit theorem, use
er “central” was given to it by George Polyà
the normal distribution to approximate the
(1920) due to the essential role that it plays
binomial distribution, the Poisson distri-
in probability theory.
bution, the gamma distribution, the chi-
square distribution, the Student distri-
MATHEMATICAL ASPECTS bution, the hypergeometric distribution,
Let X1 , X2 , . . . , Xn be n independent ran- the Fisher distribution and the lognormal
dom variables that are identically distribut- distribution.
Central Limit Theorem 67
EXAMPLES We have:
In a large batch of electrical items, theproba-
Sn − n · p
bility of choosing a defective item equals z= √
n · p (1 − p)
p = 18 . What is the probability that 4 defec-
4.5 − 3.125 C
tive items are chosen when 25 items are =
selected? 1.654
Let X the dichotomous variable be the = 0.832 .
result of a trial:
From the normal table, we obtain the
1 if the selected item is defective; probability:
X=
0 if it is not. P(Z > z) = P(Z > 0.832)
Liapounov, A.M.: Sur une proposition de akovsky, Viktor Yakovlevich at St. Peters-
la théorie des probabilités. Bulletin de bourg University.
l’Academie Imperiale des Sciences de St.- His name lives on through the Chebyshev
Petersbourg 8, 1–24 (1900) inequality (also known as the Bienaymé–
Markov, A.A.: Extension des théorèmes lim- Chebyshev inequality), which he proved.
ites du calcul des probabilités aux sommes This was published in French just after Bien-
des quantités liées en chaîne. Mem. Acad. aymé, Irénée-Jules had an article published
Sci. St. Petersburg 8, 365–397 (1908) on the same topic in the Journal de Math-
Moivre, A. de: Approximatio ad summam ématiques Pures et Appliquées (also called
terminorum binomii (a + b)n , in seriem the Journal of Liouville).
expansi. Supplementum II to Miscellanae He initiated rigorous work into establishing
Analytica, pp. 1–7 (1733). Photograph- a general version of the central limit theo-
ically reprinted in a rare pamphlet on rem and is considered to be the founder of
Moivre and some of his discoveries. Pub- the mathematical school of St. Petersbourg.
lished by Archibald, R.C. Isis 8, 671–683
(1926) Some principal works and articles of
Poisson, S.D.: Sur la probabilité des résultats Chebyshev, Pafnutii Lvovich:
moyens des observations. Connaissance
1845 An Essay on Elementary Analysis of
des temps pour l’an 1827, pp. 273–302
the Theory of Probabilities (thesis)
(1824)
Crelle’s Journal.
Polyà, G.: Ueber den zentralen Grenzwert-
satz der Wahrscheinlichkeitsrechnung 1867 Preuve de l’inégalité de Tchebychev.
und das Momentproblem. Mathematische J. Math. Pure. Appl., 12, 177–184.
Zeitschrift 8, 171–181 (1920)
Tchebychev, P.L. (1890–1891). Sur deux FURTHER READING
théorèmes relatifs aux probabilités. Acta Central limit theorem
Math. 14, 305–315
REFERENCES
Heyde, C.E., Seneta, E.: I.J. Bienaymé.
Chebyshev, Pafnutii Lvovich Statistical Theory Anticipated. Springer,
Berlin Heidelberg New York (1977)
Chhebyshev, Pafnutii Lvovich (1821–1894)
began studying at Moscow University in
1837, where he was influenced by Zernov, Chebyshev’s Inequality
Nikolai Efimovich (the first Russian to get
See law of large numbers.
a doctorate in mathematical sciences) and
Brashman, Nikolai Dmitrievich. After gain-
ing his degree he could not find any teaching
work in Moscow, so he went to St. Peters-
Chi-Square Distance
bourg where he organized conferences on Consider a frequency table with n rows
algebra and probability theory. In 1859, he and p columns, it is possible to calculate row
took the probability course given by Buni- profiles and column profiles. Let us then plot
Chi-Square Distance 69
the n or p points from each profile. We can increases the importance of small devia-
define the distances between these points. tions in the rows (or columns) which have
The Euclidean distance between the compo- a small sum with respect to those with more
nents of the profiles, on which a weighting important sum package.
is defined (each term has a weight that is the The chi-square distance has the property of C
inverse of its frequency), is called the chi- distributional equivalence, meaning that it
square distance. The name of the distance is ensures that the distances between rows and
derived from the fact that the mathematical columns are invariant when two columns (or
expression defining the distance is identical two rows) with identical profiles are aggre-
to that encountered in the elaboration of the gated.
chi square goodness of fit test.
EXAMPLES
MATHEMATICAL ASPECTS Consider a contingency table charting how
Let (fij ), be the frequency of the ith row and satisfied employees working for three differ-
jth column in a frequency table with n rowsan ent businesses are. Let us establish a distance
p columns. The chi-square distance between table using the chi-square distance.
two rows i and i is given by the formula: Values for the studied variable X can fall into
" one of three categories:
# p
2
# f f 1 • X1 : high satisfaction;
d(i, i ) = $
ij ij
− · , • X2 : medium satisfaction;
fi. fi . f.j
j=1
• X3 : low satisfaction.
where The observations collected from samples of
individuals from the three businesses are giv-
fi. is the sum of the components of the ith en below:
row;
f.j is the sum of the components of the jth Busi- Busi- Busi- Total
ness 1 ness 2 ness 3
column;
fij X1 20 55 30 105
[ fi. ] is the ith row profile for j = 1, 2, . . . , p.
X2 18 40 15 73
Likewise, the distance between two co- X3 12 5 5 22
lumns j and j is given by:
Total 50 100 50 200
"
# n
# fij fij 2
1
d(j, j ) = $
The relative frequency table is obtained by
− · ,
f.j f.j fi. dividing all of the elements of the table by
i=1
200, the total number of observations:
f
where [ fij.j ] is the jth column profile for j =
Busi- Busi- Busi- Total
1, . . . , n. ness 1 ness 2 ness 3
X1 0.1 0.275 0.15 0.525
DOMAINS AND LIMITATIONS
X2 0.09 0.2 0.075 0.365
The chi-square distance incorporates
X3 0.06 0.025 0.025 0.11
a weight that is inversely proportional to
Total 0.25 0.5 0.25 1
the total of each row (or column), which
70 Chi-square Distribution
We can calculate the difference in employ- This allows us to calculate the distances
ee satisfaction between the the 3 enterprises. between the different rows:
The column profile matrix is given below:
1
d 2 (1, 2) = · (0.19 − 0.246)2
Busi- Busi- Busi- Total 0.25
ness 1 ness 2 ness 3 1
+ · (0.524 − 0.548)2
X1 0.4 0.55 0.6 1.55 0.5
1
X2 0.36 0.4 0.3 1.06 + · (0.286 − 0.206)2
X3 0.24 0.05 0.1 0.39 0.25
= 0.039296
Total 1 1 1 3
d(1, 2) = 0.198
This allows us to calculate the distances
between the different columns: We can calculate d(1, 3) and d(2, 3) in a simi-
1 lar way. The differences between the degrees
d 2 (1, 2) = · (0.4 − 0.55)2
0.525 of employee satisfaction are finally summa-
1 rized in the following distance table:
+ · (0.36 − 0.4)2
0.365
1
+ · (0.24 − 0.05)2 X1 X2 X3
0.11
X1 0 0.198 0.835
= 0.375423
X2 0.198 0 0.754
d(1, 2) = 0.613
X3 0.835 0.754 0
We can calculate d(1, 3) and d(2, 3) in a sim-
ilar way. The distances obtained are summa-
rized in the following distance table: FURTHER READING
Contingency table
Busi- Busi- Busi-
Distance
ness 1 ness 2 ness 3
Distance table
Business 1 0 0.613 0.514
Frequency
Business 2 0.613 0 0.234
Business 3 0.514 0.234 0
MATHEMATICAL ASPECTS
The chi-square distribution appears in the
theory of random variables distributed
according to a normal distribution. In this,
it is the distribution of the sum of squares of C
normal, centered and reduced random vari-
ables (with a mean equal to 0 and a variance
equal to 1).
χ 2 distribution, ν = 12 Consider Z1 , Z2 , . . . , Zn , n independent,
standard normal random variables. Their
The chi-square distribution is a continuous sum of squares:
probability distribution.
n
X = Z12 + Z22 + . . . + Zn2 = Zi2
HISTORY i=1
According to Sheynin (1977), the chi-square
is a random variable distributed according to
distribution was discovered by Ernst Karl
a chi-square distribution with n degrees of
Abbe in 1863. Maxwell obtained it for
freedom.
three degrees of freedom a few years before
The expected value of the chi-square distri-
(1860), and Boltzman discovered the general
bution is given by:
case in 1881.
However, according to Lancaster (1966), E[X] = n .
Bienaymé obtained the chi-square distri-
bution in 1838 as the limit of the discrete ran- The variance is equal to:
dom variable
Var(X) = 2n .
k
(ni − npi)2
, The chi-square distribution is related to other
npi
i=1 continuous probability distributions:
if (N1 , N2 , . . . , Nk ) follow a joint multi- • The chi-square distribution is a particular
nomial distribution of parameters n, p1, case of the gamma distribution.
p2 , . . . , pk . • If two random variables X1 and X2 fol-
Ellis demonstrated in 1844 that the sum of low a chi-square distribution with, respec-
k random variables distributed according tively, n1 and n2 degrees of freedom,
to a chi-square distribution with two degrees then the random variable
of freedom follows a chi-square distribution X1 /n1
with 2k degrees of freedom. The general Y=
X2 /n2
result was demonstrated in 1852 by Bien-
aymé. follows a Fisher distribution with n1 and
The works of Pearson, Karl are very impor- n2 degrees of freedom.
tant in this field. In 1900 he used the chi- • When the number of degrees of free-
square distribution to approximate the chi- dom n tends towards infinity, the chi-
square statistic used in different tests based square distribution tends (relatively slow-
on contingency tables. ly) towards a normal distribution.
72 Chi-square Goodness of Fit Test
4. Obtain the expected frequencies for every estimated frequencies, ei , are not too small.
class We normally state that the estimated fre-
quencies must be greater then 5, except for
ei = n · pi , i = 1, . . . , k , extreme classes, where they can be smaller
then 5 but greater then 1. If this constraint is C
where n is the size of the sample.
not satisfied, we must regroup the classes in
5. Calculate the χ 2 (chi-square) statistic:
order to satisfy this rule.
k
(ni − ei ) 2
χ2 = .
ei
i=1 EXAMPLES
Goodness of Fit to the Binomial
If H0 is true, the χ 2 statistic follows a chi-
Distribution
square distribution with v degrees of
We throw a coin four times and count the
freedom, where:
number of times that “heads” appears.
pling. In: Karl Pearson’s Early Statisti- random variable X for which
cal Papers. Cambridge University Press, 2
P(X ≤ χv,α )=1−α.
pp. 339–357. First published in 1900 in
Philos. Mag. (5th Ser) 50, 157–175 (1948) Note: the notation χ 2 is read “chi-square”.
Chi-Square Test 77
esis H1 .
We have a random sample drawn from
a normally distributed population where the
Case C
mean and variance are not known. The ven-
If χ 2 > χn−1,1−α
2 we do not reject the dor of these items would like the variance σ 2
null hypothesis H0 . If χ 2 ≤ χn−1,1−α
2 we to be smaller than or equal to 0.05. We test
Chi-square Test of Independence 79
REFERENCES
Bartlett, M.S.: Some examples of statisti- MATHEMATICAL ASPECTS
cal methods of research in agriculture and Consider two qualitative categorical vari-
applied biology. J. Roy. Stat. Soc. (Suppl.) ables X and Y. We have a sample contain-
4, 137–183 (1937) ing n observations of these variables.
80 Chi-square Test of Independence
The hypotheses to be tested are: 4. If the calculated χ 2 is smaller then the χv,α
2
Steps Involved in the Test null hypothesis for the alternative hypoth-
1. Compute the expected frequencies, esis. We can then conclude that the two
denoted by eij , for each case in the con- variables are not independent.
tingency table under the independence
hypothesis: DOMAINS AND LIMITATIONS
ni. · n.j Certain conditions must be fulfilled in order
eij = , to be able to apply the chi-square test of inde-
n..
pendence:
c r
ni. = nik and n.j = nkj , 1. The sample, which contains n observa-
k=1 k=1
tions, must be a random sample;
where c represents the number of columns 2. Each individual observation can only
(or number of categories of variable X in appear in one category for each vari-
the contingency table) and r the number of able. In other words, each individu-
rows (or the number of categories of vari- al observation can only appear in one
able Y). line and one column of the contingency
2. Calculate the value of the χ 2 (chi-square) table.
statistic, which is really a measure of the
Note that the chi-square test of indepen-
deviation of the observed frequencies nij
dence is not very reliable for small samples,
from the expected frequencies eij :
especially when the estimated frequencies
(nij − eij )2
c r are small (< 5). To avoid this issue we can
χ2 = . group categories together, but only when this
eij
i=1 j=1 groups obtained are sensible.
Chi-square Test of Independence 81
35 · 31
EXAMPLES e21 = = 10.85
We want to determine whether the propor- 100
35 · 69
tion of smokers is independent of gender. e12 = = 24.15 .
100
The two variables to be studied are categor-
ical and qualitative and contain two cate- The estimated frequency table is given C
gories each: below:
• Variable “gender:” M or F;
• Variable “smoking status:” “smokes” or Smoking status
“does not smoke.” “smokes” “does Total
The hypotheses are then: not
smoke”
H0 : chance of being a smoker is Gender M 20.15 44.85 65
independent of gender F 10.85 24.15 35
H1 : chance of being a smoker is not Total 31 69 100
independent of gender.
If the null hypothesis H0 is true, the statistic
The contingency table obtained from
a sample of 100 individuals (n = 100)
2
2
2 (nij − eij )2
is shown below: χ =
eij
i=1 j=1
Smoking status
is chi-square-distributed with (r − 1)
“smokes” “does Total
(c − 1) = (2 − 1) (2 − 1) = 1 degree
not
smoke” of freedom and
Gender M 21 44 65
χ 2 = 0.036 + 0.016 + 0.066 + 0.030
F 10 25 35
Total 31 69 100 = 0.148 .
ferent groups created during the partition. managers, workers, and so on, and one can
Two different kinds of hierarchical tech- calculate average salaries, average frequen-
niques exist: agglomerating techniques cy of health problems, and so on, for each
and dividing techniques. class.
Agglomerating methods start with sepa- In the second case, classification will lead to C
rated objects, meaning that the n objects a prediction and then to an action. For exam-
are initially distributed into n groups. Two ple, if the foxes in a particular region exhibit
groups are agglomerated in each subse- apathetic behavior and excessive salivation,
quent step until there is only one group we can conclude that there is a new rabies
left. epidemic. This should then prompt a vaccine
In contrast, dividing methods start with all campaign.
of the objects grouped together, meaning
that all of the n objects are in one single FURTHER READING
group to start with. New groups are creat- Clusteranalysis
ed at each subsequent step until there are Complete linkage method
n groups. Data analysis
• Geometric classification, in which the
objects are depicted on a scatter plot and REFERENCES
then grouped according to position on the Everitt, B.S.: Cluster Analysis. Halstead,
plot. In a graphical representation, the London (1974)
proximities of the objects to each other in
Gordon, A.D.: Classification. Methods for
the graphic correspond to the similarities
the Exploratory Analysis of Multivariate
between the objects.
Data. Chapman & Hall, London (1981)
The first three types of classification are gen-
erally grouped together under the term clus- Kaufman, L., Rousseeuw, P.J.: Finding
ter analysis. What they have in common Groups in Data: An Introduction to Clus-
is the fact that the objects to be classified ter Analysis. Wiley, New York (1990)
must present a certain amount of structure Thorndike, R.L.: Who belongs in a family?
that allows us to measure the degree of sim- Psychometrika 18, 267–276 (1953)
ilarity between the objects. Tryon, R.C.: Cluster Analysis. McGraw-
Each type of classification contains a multi- Hill, New York (1939)
tudeof methodsthatallowusto createclasses
of similar objects. Zubin, J.: A technique for measuring like-
mindedness. J. Abnorm. Social Psychol.
DOMAINS AND LIMITATIONS 33, 508–516 (1938)
Classification can be used in two cases:
• Description cases;
• Prediction cases.
Cluster Analysis
In the first case, the classification is done Clustering is the partitioning of a data set
on the basis of some generally accepted into subsets or clusters, so that the degree of
standard characteristics. For example, pro- association is strong between members of the
fessions can be classified into freelance, same cluster and weak between members of
84 Cluster Analysis
different clusters according to some defined until all of the objects are gathered into the
distance measure. same class.
Several methods of performing cluster Note that the distance between the group
analysis exist: formed from aggregated elements and the
• Partitional clustering other elements can be defined in different
• Hierarchical clustering. ways, leading to different methods. Exam-
ples include the single link method and the
complete linkage method.
HISTORY
The single link method is a hierarchical
See classification and data analysis.
classification method that uses the Euclidean
distance to establish a distance table, and
MATHEMATICAL ASPECTS the distance between two classes is given by
To carry out cluster analysis on a set of the Euclidean distance between the two clos-
n objects, we need to define a distance est elements (the minimum distance).
between the objects (or more generally In the complete linkage method, the dis-
a measure of the similarity between the tance between two classes is given by the
objects) that need to be classified. The exis- Euclidean distance between the two ele-
tence of some kind of structure within the ments furthest away (the maximum dis-
set of objects is assumed. tance).
To carry out a hierarchical classification of Given that the only difference between these
a set E of objects {x1 , x2 , . . . , xn }, it is neces- two methods is that the distance between two
sary to define a distance associated with E classes is either the minimum and the max-
that can be used to obtain a distance table imum distance, only the single link method
between the objects of E. Similarly, a dis- will be considered here.
tance must also be defined for any subsets For a set E = {X1 , X2 , . . . , Xn }, the distance
of E. table for the elements of E is then estab-
One approach to hierarchical clustering is to lished.
use the agglomerating method. It can be sum- Since this table is symmetric and null along
marized in the following algorithm: its diagonal, only one half of the table is con-
1. Locate the pair of objects (xi , xj ) which sidered:
have the smallest distance between each d(X1 , X2 ) d(X1 , X3 ) . . . d(X1 , Xn )
other. d(X2 , X3 ) . . . d(X2 , Xn )
2. Aggregate the pair of objects (xi , xj ) into ..
a single element α and re-establish a new . ...
distance table. This is achieved by sup- d(Xn−1 , Xn )
pressing the lines and columns associat- where d(Xi , Xj ) is the Euclidean distance
ed with xi and xj and replacing them with between Xi and Xj for i < j, where the values
a line and a column associated with α. The of i and j are between 1 and n.
new distance table will have a line and The algorithm for the single link method is
a column less than the previous table. as follows:
3. Repeat these two operations until the • Search for the minimum d(Xi , Xj ) for
desired number of classes are obtained or i < j;
Cluster Analysis 85
ed for aggregation in the first step does not Alain 0 1.22 1.5 2.35 2
influence later steps (provided the algorithm Jean 1.22 0 1.8 2.65 2.74
does not finish at this step), because the oth- Marc 1.5 1.8 0 1.32 1.12
er pair of closest elements will be aggre- Paul 2.35 2.65 1.32 0 1.22
gated in the following step. The aggrega- Pierre 2 2.74 1.12 1.22 0
Jean Marc Paul Pierre of Marc and Pierre on the one side and Paul
Alain 1.22 1.5 2.35 2 on the other side (in other words, two pairs
Jean 1.8 2.65 2.74 exhibit the minimum distance); let us choose
Marc 1.32 1.12 to regroup Alain and Jean first. The other
Paul 1.22 pair will be aggregated in the next step. We
rebuild the distance table and obtain:
The minimum distance is 1.12, between
Marc and Pierre; we therefore form the first d({Alain,Jean}, {Marc,Pierre})
group from these two students. We then cal- = min{d(Alain,{Marc,Pierre}) ,
culate the new distances.
d(Jean,{Marc,Pierre})}
For example, we calculate the new dis-
tance between Marc and Pierre on one side = min{1.5; 1.8} = 1.5
and Alain on the other by taking the mini-
as well as:
mum distance between Marc and Alain and
the minimum distance between Pierre and d({Alain,Jean},Paul)
Alain: = min{d(Alain,Paul);d(Jean,Paul)}
d({Marc,Pierre},Alain) = min{2.35; 2.65} = 2.35 .
= min{d(Marc,Alain);d(Pierre,Alain)}
This gives the following distance table:
= min{1.5; 2} = 1.5 ,
plete list of clusters because we do not have icant than the increase in sampling variance
a complete and up-to-date list of all popu- that will result.
lation units. Therefore, we sample a requi-
site number of clusters from the list and then EXAMPLES
question all of the units in the selected clus- Consider N, the size of the population of
ter. town X. We want to study the distribution of
“Age” for town X without performing a cen-
DOMAINS AND LIMITATIONS sus. The population is divided into G parts.
The advantage of cluster sampling is that it Simple random sampling is performed
is not necessary to have a complete, up-to- amongst these G parts and we obtain g parts.
date list of all of the units of the population The final sample will be composed of all the
to perform analysis. individuals in the g selected parts.
For example, in many countries, there are no
updated lists of people or housing. The costs FURTHER READING
of creating such lists are often prohibitive. Sampling
It is therefore easier to analyze subsets of the Simple random sampling
population (known as clusters).
In general, cluster sampling provides esti- REFERENCES
mations that are not as precise as simple Hansen, M.H., Hurwitz, W.N., Madow,
random sampling, but this drop in accuracy M.G.: Sample Survey Methods and The-
is easily offset by the far lower cost of cluster ory. Vol. I. Methods and Applications.
sampling. Vol. II. Theory. Chapman & Hall, Lon-
In order to perform cluster sampling as effi- don (1953)
ciently as possible:
• The clusters should not be too big, and
there should be a large enough number of Coefficient
clusters, of Determination
• cluster sizes should be as uniform as pos-
The coefficient of determination, denoted
sible;
R2 , is the quotient of the explained varia-
• The individuals belonging to each clus-
tion (sum of squares due to regression) to the
ter must be as heterogenous as possible
total variation (total sum of squares total SS
with respect to the parameter being ob-
(TSS)) in a model of simple or multiple lin-
served.
ear regression:
Another reason to use cluster sampling is
cost. Even when a complete and up-to-date Explained variation
R2 = .
list of all population units exists, it may be Total variation
preferable to use cluster sampling from an It equals the square of the correlation coeffi-
economic point of view, since it is complet- cient, and it can take values between 0 and 1.
ed faster, involves fewer workers and min- It is often expressed as a percentage.
imizes transport costs. It is therefore more
appropriate to use cluster sampling if the HISTORY
money saved by doing so is far more signif- See correlation coefficient.
Coefficient of Determination 89
Yi = β0 + β1 Xi1 + · · · + βp Xip + εi C
for i = 1, . . . , n, where
n
n
(Yi − Ȳ)2 = (Yi − Ŷi )2 EXAMPLES
i=1 i=1 The following table gives values for the
n Gross National Product (GNP) and the
+ (Ŷi − Ȳ)2 demand for domestic products covering the
i=1 1969–1980 period for a particular country.
TSS = RSS + REGSS
Year GNP Demand for dom-
where X estic products Y
1969 50 6
TSS is the total sum of squares,
1970 52 8
RSS the residual sum of squares and
1971 55 9
REGSS the sum of the squares of the
1972 59 10
regression.
90 Coefficient of Determination
Year GNP Demand for dom- We can calculate the mean using:
X estic products Y
1973 57 8
12
Yi
1974 58 10
i=1
1975 62 12 Ȳ = = 9.833 .
n
1976 65 9
Yi Ŷi (Yi − Ȳ )2 (Ŷi − Ȳ )2
1977 68 11
6 7.260 14.694 6.622
1978 69 10
8 7.712 3.361 4.500
1979 70 11
9 8.390 0.694 2.083
1980 72 14 10 9.294 0.028 0.291
8 8.842 3.361 0.983
We will try to estimate the demand for small 10 9.068 0.028 0.586
goods as a function of GNP according to the 12 9.972 4.694 0.019
model 9 10.650 0.694 0.667
11 11.328 1.361 2.234
Yi = a + b · Xi + εi , i = 1, . . . , 12 . 10 11.554 0.028 2.961
11 11.780 1.361 3.789
Estimating the parameters a and b by the 14 12.232 17.361 4.754
least squares method yields the following Total 47.667 30.489
estimators:
We therefore obtain:
12
(Xi − X̄)(Yi − Ȳ) 30.489
R2 = = 0.6396
i=1 47.667
b̂ = = 0.226 ,
12
or, in percent:
(Xi − X̄)2
i=1 R2 = 63.96% .
â = Ȳ − b̂ · X̄ = −4.047 .
We can therefore conclude that, according to
The estimated line is written the model chosen, 63.96% of the variation in
the demand for small goods is explained by
Ŷ = −4.047 + 0.226 · X . the variation in the GNP.
Obviously the value of R2 cannot exceed
The quality of the fit of the measured points 100%. While 63.96% is relatively high, it is
to the regression line is given by the deter- not close enough to 100% to rule out trying
mination coefficient: to modify the model further.
This analysis also shows that other variables
12
apart from the GNP should be taken into
(Ŷi − Ȳ)2 account when determining the function cor-
REGSS i=1
R =
2
= . responding to the demand for small goods,
TSS 12
since the GNP only partially explains the
(Yi − Ȳ) 2
i=1
variation.
Coefficient of Kurtosis 91
1
h
m4 = fi (xi − x̄)4 , HISTORY
n
i=1 Between the end of the nineteenth centu-
where n = 75, x̄ = 290.60 and xi is the cen- ry and the beginning of the twentieth cen-
ter of class interval i. We can summarize the tury, Pearson, Karl studied large sets of
calculations in the following table: data which sometimes deviated significant-
ly from normality and exhibited consider-
xi xi − x̄ fi fi (xi − x̄ )4
able skewness.
225 −65.60 4 74075629.16
He first used the following coefficient as
245 −45.60 6 25942428.06
a measure of skewness:
265 −25.60 13 5583457.48
285 −5.60 22 21635.89 x̄ − mode
skewness = ,
305 14.40 15 644972.54 S
325 34.40 6 8402045.34
where x̄ represents thearithmetic mean and
345 54.40 5 43789058.05
S the standard deviation.
365 74.40 4 122560841.32
This measure is equal to zero if the data are
281020067.84 distributed symmetrically.
Coefficient of Skewness 93
He discovered empirically that for a mod- third power. The coefficient obtained, desig-
√
erately asymmetric distribution (the gamma nated by β1 , is equal to:
distribution): μ3
β1 = 3 .
σ
Mo − x̄ ≈ 3 · (Md − x̄) , C
The estimator of this coefficient, calculat-
where Mo and Md denote the mode and ed for a sample (x1 , x2 , . . . xn ), is denoted by
√
the median of data set. By substituting this b1 . It is equal to:
expression into the previous coefficient, the m3
following alternative formula is obtained: b1 = 3 ,
S
3 · (x̄ − Md ) where m3 is the centered third-order
skewness = .
S moment of the sample, given by:
Following this, Pearson, K. (1894,1895) 1
n
introduced a coefficient of skewness, known m3 = · (xi − x̄)3 .
n
as the β1 coefficient, based on calculations i=1
of the centered moments. This coefficient Here x̄ is the arithmetic mean, n is the total
is more difficult to calculate but it is more number of observations and S3 is the stan-
descriptive and better adapted to large num- dard deviation raised to the third power.
bers of observations. For the case where a random variable X
Pearson, K. also created the coefficient of takes values xi with frequencies fi , i =
kurtosis (β2 ), which is used to measure the 1, 2, . . . , h, the centered third-order moment
oblateness of a curve. This coefficient is also of the sample is given by the formula:
based on the moments of the distribution
1
h
being studied.
m3 = · fi · (xi − x̄)3 .
Tables giving the limit values of the coeffi- n
i=1
cients β1 and β2 can be found in the works
of Pearson and Hartley (1966, 1972). If the If the coefficient is positive, the distribution
sample estimates a fall outside the limit for spreads to the the right. If it is negative, the
β1 , β2 , we conclude that the population is distribution expands to the left. If it is close to
significantly curved or skewed. zero, the distribution is approximately sym-
metric.
If the sample is taken from a normal pop-
MATHEMATICAL ASPECTS √
ulation, the statistic b1 roughly follows
The skewness coefficient is based on the cen-
a normal distribution with a mean of 0 and
tered third-order moment of the distribution
a standard deviation of 6n . If the size of
in question, which is equal to:
the sample n is bigger than 150, the nor-
μ3 = E (X − μ)3 . mal table can be used to test the skewness
hypothesis.
To obtain a coefficient of skewness that is
independent of the measuring unit, the third- DOMAINS AND LIMITATIONS
order moment is divided by the standard This coefficient (similar to the other mea-
deviation of the population σ raised to the sures of skewness) is only of interest if it
94 Coefficient of Skewness
can be used to compare the shapes of two or Since S = 33.88, the coefficient of skewness
more distributions. is equal to:
1
EXAMPLES (821222.41) 10949.632
Suppose that we want to compare the shapes b1 = 75 =
(33.88)3 38889.307
of the daily turnover distributions obtained
for 75 bakeries for two different years. We = 0.282 .
then calculate the skewness coefficient in
For year 2, n = 75 and x̄ = 265.27. The
both cases.
calculations are summarized in the following
The data are categorized in the table below:
table:
Turnover Frequencies Frequencies
for year 1 for year 2 xi xi − x̄ fi fi (xi − x̄ )3
225 −40.27 25 −1632213.81
215–235 4 25
245 −20.27 15 −124864.28
235–255 6 15
265 −0.27 9 −0.17
255–275 13 9
285 19.73 8 61473.98
275–295 22 8
305 39.73 6 376371.09
295–315 15 6 325 59.73 5 1065663.90
315–335 6 5 345 79.73 4 2027588.19
335–355 5 4 365 99.73 3 2976063.94
355–375 4 3 4750082.83
The third-order moment of the sample is Since S = 42.01, the coefficient of skewness
given by: is equal to:
1
h
m3 = · fi · (xi − x̄)3 . 1
n (4750082.83) 63334.438
i=1 b1 = 75 =
(42.01) 3 74140.933
For year 1, n = 75, x̄ = 290.60 and xi is the
= 0.854 .
center of each class interval i. The calcula-
tions are summarized in the following table: The coefficient of skewness for year 1 is
√
close to zero ( b1 = 0.282), so the daily
xi xi − x̄ fi fi (xi − x̄ )3
turnover distribution for the 75 bakeries for
225 −65.60 4 −1129201.664
year 1 is very close to being a symmetrical
245 −45.60 6 −568912.896
distribution. For year 2, the skewness coef-
265 −25.60 13 −218103.808
ficient is higher; this means that the distri-
285 −5.60 22 −3863.652
bution spreads towards the right.
305 14.40 15 44789.760
325 34.40 6 244245.504
345 54.40 5 804945.920 FURTHER READING
365 74.40 4 1647323.136 Measure of shape
821222.410 Measure of skewness
Coefficient of Variation 95
same principle, inverting an almost singu- Por- Ingre- Ingre- Ingre- Ingre- Heat
lar matrix is similar to inverting a very small tion dient dient dient dient
n n!
ingredient 4 taken alone has a small effect Cn =
k
= .
k k! · (n − k)!
on variable Y, and that the small quanti-
ties of 2 taken alone can explain the small 2. Combination with repetitions
amount of heat emitted. Hence, the linear Combination with repetitions (or with
dependence between two explanatory vari- remittance) are used when each drawn
ables (X2 and X4 ) makes it more complicat- object is placed back for the next draw-
ed to see the effect of each variable alone on ing. Each object can then occur r times in
the response variable Y. each group, r = 0, 1, . . . , k.
Combinatory Analysis 99
Binomial distribution
n+k−1 (n + k − 1)! Combinatory analysis
Kn =
k
= . C
k k! · (n − 1)! Permutation
EXAMPLES
1. Combinations without repetition
Combinatory Analysis
Consider the situation where we must Combinatory analysis refers to a group of
choose a committee of three people from techniques that can be used to determine the
an assembly of eight people. How many number of elements in a particular set with-
different committees could potentially be out having to count them one-by-one.
picked from this assembly, if each person The elements in question could be the results
can only be selected once in each group? from a scientific experiment or the different
Here we need to calculate the number potential outcomes of a random event.
of possible combinations of three people Three particular concepts are important in
from eight: combinatory analysis:
• Permutations;
n! 8!
Cnk = = • Combinations;
k! · (n − k)! 3! · (8 − 3)!
• Arrangements.
40320
= = 56
6 · 120
HISTORY
Therefore, it is possible to form 56 dif- Combinatory analysis has interested mathe-
ferent committees containing three peo- maticians for centuries. According to Takacs
ple from an assembly of eight people. (1982), such analysis dates back to ancient
2. Combinations with repetitions Consider Greece. However, the Hindus, the Per-
an urn containing six numbered balls. We sians (including the poet and mathematician
carry out four successive drawings, and Khayyâm, Omar) and (especially) the Chi-
place the drawn ball back into the urn after nese also studied such problems. A 3000
each drawing. How many different com- year-old Chinese book “I Ching” describes
binations could occur from this drawing? the possible arrangements of a set of n
In this case we want to find the number elements, where n ≤ 6. In 1303, Chu, Shih-
of combinations with repetition (because chieh published a work entitled “Ssu Yuan
each drawn ball is placed back in the urn Yü Chien” (Precious mirror of the four
before the next drawing). We obtain elements). The cover of the book depict-
(n + k − 1)! 9! sa triangle that shows the combinations of
Knk = = k elements taken from a set of size n where
k! · (n − 1)! 4! · (6 − 1)!
0 ≤ k ≤ n. This arithmetic triangle was also
362880
= = 126 explored by several European mathemati-
24 · 120
cians such as Stifel, Tartaglia and Hérigone,
different combinations. and especially Pascal, who wrote the “Trai-
100 Compatibility
EXAMPLES
See arrangement, combination and per-
Two events A and B are incompatible (or
mutation.
mutually exclusive) if the occurrence of A
prevents the occurrence of B, or vice versa.
FURTHER READING We can represent this in the following way:
Arithmetic triangle
Arrangement
Binomial
Binomial distribution
Compatibility 101
This means that the probability that these The events A and B are compatible, because
two events happen at the same time is zero: it is possible to draw both a heart and a queen
at the same time (the queen of hearts). There-
A ∩ B = φ −→ P(A ∩ B) = 0 . fore, the intersection between A and B is the
queen of hearts. The probability of this event C
MATHEMATICAL ASPECTS is given by:
If two events A and B are compatible, the
probability that at least one of the events 1
P(A ∩ B) = .
occurs can be obtained using the following 52
formula:
The probability of the union of the two
P(A ∪ B) = P(A) + P(B) − P(A ∩ B) . events A and B (drawing either a heart or
a queen) is then equal to:
On the other hand, if the two events A and B
are incompatible, the probability that the P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
two events A and B happen at the same time 13 4 1
= + −
is zero: 52 52 52
P(A ∩ B) = 0 . 4
= .
13
The probability that at least one of the events
occurs can be obtained by simply adding the On the other hand, the events A and C
individual probabilities of A and B: are incompatible, because a card cannot be
both a heart and a club! The intersection
P(A ∪ B) = P(A) + P(B) . between A and C is an empty set.
Index Number 1 Pn
In/0 = · · 100 ,
N P0
Composite index numbers allow us to mea-
sure, with a single number, the relative varia- where N is the number of goods consid-
tions within a group of variables upon mov- ered and PPn0 is the simple index number
ing from one situation to another. of each item.
The consumer price index, the wholesale In these two methods, each item has
price index, the employment index and the the same importance. This is a situation
104 Conditional Probability
which often does not correspond to real- According to this method, the price index
ity. has increased by 405.9% (505.9 − 100)
3. Index number of weighted arithmetic between the reference year and the current
means (the weighted sum method): year.
The general formula for an index number 2. Arithmetic mean of simple index num-
calculated by the weighted sum method is bers:
as follows:
1 Pn
Pn · Q I n/0 = · · 100
In/0 = · 100 . N P0
P0 · Q
Now imagine that we are given supplemen- the concept of the confidence interval. Bow-
tary information: a fast car was chosen. ley presented his first confidence interval cal-
What, then, is the probability that this car is culations to the Royal Statistical Society in
also comfortable? 1906.
We calculate the probability of B knowing
that A has occurred, or the conditional proba- MATHEMATICAL ASPECTS
bility of B depending on A: In order to construct a confidence interval
We find in the table that that contains the true value of the param-
P(A ∩ B) = 0.4 . eter θ with a given probability, an equation
P(A ∩ B) 0.4 of the following form must be solved:
⇒ P(B|A) = = = 0.667 .
P(A) 0.6 P(Li ≤ θ ≤ Ls ) = 1 − α ,
The probability that the car is comfortable,
where
given that we know that it is fast, is there-
fore 23 . θ is the parameter to be estimated,
Li is the lower limit of the interval,
FURTHER READING Ls is the upper limit of the interval and
Event 1 − α is the given probability, called the
Probability confidence level of the interval.
Random experiment
The probability α measures the error risk of
Sample space
the interval, meaning the probability that the
interval does not contain the true value of the
REFERENCES
parameter θ .
Kolmogorov, A.N.: Grundbegriffe der
In order to solve this equation, a function
Wahrscheinlichkeitsrechnung. Springer,
f (t, θ ) must be defined where t is an estima-
Berlin Heidelberg New York (1933)
tor of θ , for which the probability distri-
Kolmogorov, A.N.: Foundations of the The- bution is known.
ory of Probability. Chelsea Publishing Defining this interval for f (t, θ ) involves
Company, New York (1956) writing the equation:
P(k1 ≤ f (t, θ ) ≤ k2 ) = 1 − α ,
Confidence Interval
where the constants k1 and k2 are given by
A confidence interval is any interval con- the probability distribution of the function
structed around an estimator that has a par- f (t, θ ). Generally, the error risk α is divid-
ticular probability of containing the true ed into two equal parts at α2 distributed on
value of the corresponding parameter of each side of the distribution of f (t, θ ). If, for
a population. example, the function f (t, θ ) follows a cen-
tered and reduced normal distribution, the
HISTORY constantsk1 and k2 willbesymmetricand can
According to Desrosières, A. (1988), Bow- be represented by −z α2 and +z α2 , as shown
ley, A.L. was one of the first to beinterested in in the following figure.
Confidence Interval 107
σ σ
P −1.96 √ ≤ x̄ − μ ≤ 1.96 √
EXAMPLES n n
A business that fabricates lightbulbs wants = 0.95
to test the average lifespan of its lightbulbs.
The distribution of the random variable X,
σ σ
which represents the life span in hours, is P x̄ − 1.96 √ ≤ μ ≤ x̄ + 1.96 √
n n
a normal distribution with mean μ and
standard deviation σ = 30. = 0.95 .
108 Confidence Level
By replacing x̄, σ and n with their respective We designate the confidence level by (1 −
values, we obtain: α), where α corresponds to the risk of error;
that is, to the probability that the confidence
30
P 860 − 1.96 √ ≤ μ ≤ 860 interval does not contain the true value of the
25
parameter.
30
+ 1.96 √ = 0.95
25
HISTORY
P(848.24 ≤ μ ≤ 871.76) = 0.95 . The first example of a confidence inter-
The confidence interval for μ at the confi- val appears in the work of Laplace (1812).
dence level 0.95 is therefore: According to Desrosières, A. (1988), Bow-
ley, A.L. was one of the first to become inter-
[848.24, 871.76] .
ested in the concept of the confidence inter-
This means that we can affirm with a proba- val.
bility of 95% that this interval contains the See hypothesis testing.
true value of the parameter μ that corre-
sponds to the average lifespan of the light-
MATHEMATICAL ASPECTS
bulbs.
Let θ be a parameter associated with a pop-
ulation. θ is to be estimated and T is its esti-
FURTHER READING mator from a random sample. We evaluate
Confidence level
the precision of T as the estimator of θ by
Estimation
constructing a confidence interval around
Estimator
the estimator, which is often interpreted as
an error margin.
REFERENCES In order to construct this confidence interval,
Bowley, A.L.: Presidential address to the we generally proceed in the following man-
economic section of the British Asso- ner. From the distribution of the estimator T,
ciation. J. Roy. Stat. Soc. 69, 540–558 we determine an interval that is likely to con-
(1906) tain the true value of the parameter. Let us
Desrosières, A. La partie pour le tout: com- denote this interval by (T − ε, T + ε) and
ment généraliser? La préhistoire de la con- the probability of true value of the paramter
trainte de représentativité. Estimation et being in this interval as (1 − α). We can then
sondages. Cinq contributions à l’histoire say that the error margin ε is related to α by
de la statistique. Economica, Paris (1988) the probability:
P(T − ε ≤ θ ≤ T + ε) = 1 − α .
Confidence Level The level of probability associated with an
The confidence level is the probability that interval of estimation is called the confidence
the confidence interval constructed around level or the confidence degree.
an estimator contains the true value of the The interval T − ε ≤ θ ≤ T + ε is called
corresponding parameter of the popula- the confidence interval forθ at the confidence
tion. level 1−α. Let us use, for example, α = 5%,
Confidence Level 109
which will give the confidence interval of the The standard deviation σ of the population
parameter θ to a probability level of 95%. is known; the value of ε is zα/2 · σX̄ . The val-
This means that, if we use T as an estima- ue of zα/2 is obtained from the normal table,
tor of θ , then the interval indicated will on and it depends on the probability attributed
average contain the true value of the param- to the parameter α. We then deduce the con- C
eter 95 times out of 100 samplings, and it will fidence interval of the estimator of μ at the
not contain it 5 times. probability level 1 − α:
The quantity ε of the confidence interval cor-
responds to half od the length of the interval. X̄ − zα/2 σX̄ ≤ μ ≤ X̄ + zα/2 σX̄ .
This parameter therefore gives us an idea of From the hypothesis that the bulb lifetime X
the error margin for the estimator. For a given follows a normal distribution with mean μ
confidence level 1 −α, the smaller the confi- and standard deviation σ = 30, we deduce
dence interval, more efficient the estimator. that the expression
x̄ − μ
σ
√
DOMAINS AND LIMITATIONS n
The most commonly used confidence levels
follows a standard normal distribution.
are 90%, 95% and 99%. However, if neces- From this we obtain:
sary, other levels can be used instead.
Although we would like to use the highest x̄ − μ
P −z0.025 ≤ σ ≤ z0.025 = 0.95 ,
confidence level in order to maximize the √
n
probability that the confidence interval con-
tains the true value of the parameter, but if where the risk of error α is divided into two
we increase the confidence level,the interval parts that both equal α2 = 0.025.
increases as well. Therefore, what we gain in The table of the standard normal distribution
terms of confidence is lost in terms of preci- (the normal table) gives z0.025 = 1.96. We
sion, so have to find a compromise. then have:
x̄ − μ
P −1.96 ≤ σ ≤ 1.96 = 0.95 .
√
n
EXAMPLES
A company that produces ligthtbulbs wants To get the confidence interval for μ, we must
to study the mean lifetime of its bulbs. The isolate μ in the following equation via the
distribution of the random variable X that following transformations:
represents the lifetime in hours is a normal
σ σ
time of x̄ = 860 hours. It then wants to P x̄ − 1.96 √ ≤ μ ≤ x̄ + 1.96 √
n n
construct a 95% confidence interval for μ
around the estimator x̄. = 0.95 .
110 Contingency Table
Continuous Distribution
Function
The distribution function of a continuous
random variable is defined to be the proba-
bility that the random variable takes a value
less than or equal to a real number. DOMAINS AND LIMITATIONS
The probability that the continuous ran-
dom variable X takes a value in the inter-
HISTORY
val ]a, b] for all a < b, meaning that P(a <
See probability.
X ≤ b), is equal to F(b) − F(a), where F is
the distribution function of the random vari-
MATHEMATICAL ASPECTS able X.
The function defined by Demonstration: The event {X ≤ b} can be
written as the union of two mutually exclu-
b
sive events: {X ≤ a} and {a < X ≤ b}:
F(b) = P(X ≤ b) = f (x) dx .
−∞
{X ≤ b} = {X ≤ a} ∪ {a < X ≤ b} .
is called the distribution function of a con-
By finding the probability on each side of
tinuous random variable.
the equation, we obtain:
In other words, the density function f is
the derivative of the continuous distribution P(X ≤ b) = P({X ≤ a} ∪ {a < X ≤ b})
function. = P(X ≤ a) + P(a < X ≤ b) .
112 Continuous Probability Distribution
!
figure: = E X 2 − E[X]2 ,
Contrast 113
where D represents the interval covering the butions, vols. 1 and 2. Wiley, New York
range of values that X can take. (1970)
One essential property of a continuous ran-
dom variable is that the probability that it
will take a specific numerical value is zero, Contrast C
whereas the probability that it will take a val- In analysis of variance a contrast is a linear
ue over an interval (finite or infinite) is usu- combination of the observations or factor
ally nonzero. levels or treatments in a factorial experi-
ment, where the sum of the coefficients is
DOMAINS AND LIMITATIONS zero.
The most famous continuous probability
distribution is the normal distribution.
HISTORY
Continuous probability distributions are
According to Scheffé, H. (1953), Tukey, J.W.
often used to approximate discrete proba-
(1949 and 1951) was the first to propose
bility distributions. They are used in model
a method of simultaneously estimating all
construction just as much as they are used
of the contrasts.
when applying statistical techniques.
MATHEMATICAL ASPECTS
FURTHER READING
Beta distribution
Consider T1 , T2 , . . . , Tk , which are the sums
Cauchy distribution
of n1 , n2 , . . . , nk observations. The linear
Chi-square distribution
function
Continuous distribution function cj = c1j · T1 + c2j · T2 + · · · + ckj · Tk
Density function
Discrete probability distribution is a contrast if and only if
Expected value
k
Exponential distribution ni · cij = 0 .
Fisher distribution i=1
Gamma distribution If each ni = n, meaning that if Ti is the sum of
Laplace distribution the same number of observations, the con-
Lognormal distribution dition is reduced to:
Normal distribution
k
Probability cij = 0 .
Probability distribution i=1
Random variable
Student distribution DOMAINS AND LIMITATIONS
Uniform distribution In most experiments involving several
Variance of a random variable treatments, it is interesting for the exper-
imenter to make comparisons between the
REFERENCES different treatments. The statistician uses
Johnson, N.L., Kotz, S.: Distributions in contrasts to carry out this type of compari-
Statistics: Continuous Univariate Distri- son.
114 Convergence
Scheffé, H.: A method for judging all con- for every ε > 0 .
trasts in the analysis of variance. Biometri-
ka 40, 87–104 (1953) 3. {Xn }n∈⺞ exhibits almost sure conver-
gence to a random variable X if:
Tukey, J.W.: Comparing individual means in
the analysis of variance. Biometrics 5, 99– & '
P w| lim Xn (w) = X(w) = 1 .
114 (1949) n→∞
Tukey, J.W.: Quick and Dirty Methods in 4. Suppose that all elements of X have
n
Statistics. Part II: Simple Analyses for a finite expectancy. The set {Xn }n∈⺞ con-
Standard Designs. Quality Control Con- verges in mean square to X if:
ference Papers 1951. American Society
for Quality Control, New York, pp. 189–
lim E (Xn − X)2 = 0 .
197 (1951) n→∞
Note that:
Convergence
• Almost sure convergence and mean
In statistics, the term “convergence” is relat- square convergence both imply a con-
ed to probability theory. This statistical con- vergence in probability;
vergence is often termed stochastic conver- • Convergence in probability (weak con-
gence in order to distinguish it from classical vergence) implies convergence in distri-
convergence. bution.
Correlation Coefficient 115
EXAMPLES REFERENCES
Let Xi be independent random variables uni- Le Cam, L.M., Yang, C.L.: Asymptotics
formly distributed over [0, 1]. We define the in Statistics: Some Basic Concepts.
following set of random variables from Xi : Springer, Berlin Heidelberg New York
(1990) C
Zn = n · min Xi .
i=1,...,n Staudte, R.G., Sheater, S.J.: Robust Esti-
We can show that the set {Zn }n∈⺞ converges mation and Testing. Wiley, New York
in distribution to an exponential distri- (1990)
bution Z with a parameter of 1 as follows:
t
= P min Xi > The convergence theorem leads to the most
i=1,...,n n important theoretical results in probability
t t
= P X1 > and X2 > and theory. Among them, we find the law of
n n
t large numbers and the central limit the-
. . . Xn > orem.
n
n t t n
ind.
= P X i > = 1 − .
EXAMPLES
n n
i=1
The central limit theorem and the law of
t n
Now, for limn→∞ , 1 − n = exp (−t) large numbers are both convergence the-
since: orems.
The law of large numbers states that the
lim FZn = lim P(Zn ≤ t) mean of a sum of identically distributed ran-
n→∞ n→∞
= 1 − exp(−t) = FZ . dom variables converges to their common
mathematical expectation.
Finally, let Sn be the number of success- On the other hand, the central limit theorem
es obtained during n Bernoulli trials with states that the distribution of the sum of a suf-
a probability of success p. Bernoulli’s theo- ficiently large number of random variables
rem tells us that Snn converges in probability tends to approximate the normal distri-
to a “random” variable that takes the value p bution.
with probability 1.
FURTHER READING
FURTHER READING Central limit theorem
Bernoulli’s theorem
Law of large numbers
Central limit theorem
Convergence theorem
De Moivre–Laplace Theorem
Law of large numbers
Correlation Coefficient
Probability The simple correlation coefficient is a mea-
Random variable sure of the strength of the linear relation
Stochastic process between two random variables.
116 Correlation Coefficient
The correlation coefficient can take val- cept of a negative correlation. According to
ues that occur in the interval [−1; 1]. The Stigler (1989), Galton only appeared to sug-
two extreme values of this interval represent gest that correlation was a positive relation-
a perfectly linear relation between the vari- ship.
ables, “positive” in the first case and “nega- Pearson, Karl wrote in 1920 that correla-
tive” in the other. The value 0 (zero) implies tion had been discovered by Galton, whose
the absence of a linear relation. work “Natural inheritance” (1889) pushed
The correlation coefficient presented here is him to study this concept too, along with two
also called the Bravais–Pearson correlation other researchers, Weldon and Edgeworth.
coefficient. Pearson and Edgeworth then developed the
theory of correlation.
HISTORY Weldon thought the correlation coefficient
The concept of correlation originated in the should be called the “Galton function.” How-
1880s with the works of Galton, F.. In his ever, Edgeworth replaced Galton’s term “co-
autobiography Memories of My Life (1890), relation index” and Weldon’s term “Galton
he writes that he thought of this concept dur- function” by the term “correlation coeffi-
ing a walk in the grounds of Naworth Castle, cient.”
when a rain shower forced him to find shelter. According to Mudholkar (1982), Pear-
According to Stigler, S.M. (1989), Por- son, K. systemized the analysis of correla-
ter, T.M. (1986) was carrying out historical tion and established a theory of correlation
research when he found a forgotten article for three variables. Researchers from Uni-
written by Galton in 1890 in The North Ame- versity College, most notably his assistant
rican Review, under the title “Kinship and Yule, G.U., were also interested in develop-
correlation”. In this article, which he pub- ing multiple correlation.
lished right after its discovery, Galton (1908) Spearman published the first study on rank
explained the nature and the importance of correlation in 1904.
the concept of correlation. Among the works that were carried out
This discovery was related to previous works in this field, it is worth highlighting those
of the mathematician, notably those on of Yule, who in an article entitled “Why
heredity and linear regression. Galton had do we sometimes get non-sense-correlation
been interested in this field of study since between time-series” (1926) discussed the
1860. He published a work entitled “Natural problem of correlation analysis interpre-
inheritance” (1889), which was the starting tation. Finally, correlation robustness was
point for his thoughts on correlation. investigated by Mosteller and Tukey (1977).
In 1888, in an article sent to the Royal Statis-
tical Society entitled “Co-relations and their
measurement chiefly from anthropometric MATHEMATICAL ASPECTS
data,” Galton used the term “correlation” for Simple Linear Correlation Coefficient
the first time, although he was still alter- Simple linear correlation is the term used to
nating between the terms “co-relation” and describe a linear dependence between two
“correlation” and he spoke of a “co-relation quantitative variables X and Y (see simple
index.” On the other hand, he invoke the con- linear regression).
Correlation Coefficient 117
n
(Xi − X̄)(Yi − Ȳ) Multiple Correlation Coefficient
i=1 Known as the coefficient of determina-
r=" tion denoted by R2 , determines whether
# n
# n
$ (Xi − X̄)2 (Yi − Ȳ)2 the hyperplane estimated from a multiple
i=1 i=1 linear regression is correctly adjusted to
the data points.
is an estimation of ρ; it is the sampling cor- The value of the multiple determination
relation. coefficient R2 is equal to:
If we denote (Xi − X̄) by xi and (Yi − Ȳ) by
Explained variation
yi , we can write this equation as: R2 =
Total variation
n
n
(Ŷi − Ȳ)2
xi yi
i=1
i=1 = .
r = " n .
n
# n
# (Yi − Ȳ)2
$ x2 y2i i i=1
i=1 i=1
It corresponds to the square of the multiple
Test of Hypothesis correlation coefficient. Notice that
To test the null hypothesis
0 ≤ R2 ≤ 1 .
H0 : ρ = 0
In the case of simple linear regression, the
against the alternative hypothesis following relation can be derived:
H1 : ρ = 0 , r = sign(β̂1 ) R2 ,
118 Correlation Coefficient
where β̂1 is the estimator of the regression • The characteristics of the distribution of
coefficient β1 , and it is given by: the scores (linear or nonlinear).
n
EXAMPLES
(Xi − X̄)(Yi − Ȳ)
The data for two variables X and Y are
i=1
β̂1 = . shown in the table below:
n
(Xi − X̄) 2
No of x= y=
i=1
order X Y X − X̄ Y − Ȳ xy x2 y2
1 174 64 −1.5 −1.3 1.95 2.25 1.69
DOMAINS AND LIMITATIONS 2 175 59 −0.5 −6.3 3.14 0.25 36.69
If there is a linear relation between two vari- 3 180 64 4.5 −1.3 −5.85 20.25 1.69
ables, the correlation coefficient is equal to 4 168 62 −7.5 −3.3 24.75 56.25 10.89
1 or −1. 5 175 51 −0.5 −14.3 7.15 0.25 204.49
A positive relation (+) means that the two 6 170 60 −5.5 −5.3 29.15 30.25 28.09
variables vary in the same direction. If the 7 170 68 −5.5 2.7 −14.85 30.25 7.29
individuals obtain high scores in the first 8 178 63 2.5 −2.3 −5.75 6.25 5.29
variable (for example the independent 9 187 92 11.5 26.7 307.05 132.25 712.89
variable), they will have a tendency to 10 178 70 2.5 4.7 11.75 6.25 22.09
obtain high scores in the second variable Total 1755 653 358.5 284.5 1034.1
(the dependant variable). The opposite is
also true.
X̄ = 175.5 and Ȳ = 65.3 .
A negative relation (−) means that the indi-
viduals that obtain high scores in the first We now perform the necessary calculations
variable will have a tendency to obtain low to obtain the correlation coefficient between
scores in the second one, and vice versa. the two variables. Applying the formula
Note that if the variables are independent the gives:
correlation coefficient is equal to zero. The
n
reciprocal conclusion is not necessarily true. xi yi
The fact that two or more variables are relat- i=1
r = " n
ed in a statistical way is not sufficient to con- # n
#
clude that a cause and effect relation exists. $ x 2 y 2
i i
The existence of a statistical correlation is i=1 i=1
not a proof of causality. 358.5
Statistics provides numerous correlation = √ = 0.66 .
284.5 · 1034.1
coefficients. The choice of which to use for
a particular set of data depends on different Test of Hypothesis
factors, such as: We can calculate the estimated standard
• The type of scale used to express the vari- deviation of r:
able; %
• The nature of the underlying distribution 1 − r2 0.56
(continuous or discrete); Sr = = = 0.266 .
n−2 10 − 2
Correspondence Analysis 119
Calculating the statistic t gives: Pearson, K.: Studies in the history of statis-
r−0 0.66 tics and probability. Biometrika 13, 25–
t= = = 2.485 . 45 (1920). Reprinted in: Pearson, E.S.,
Sr 0.266
If we choose a significance level α of 5%, Kendall, M.G. (eds.) Studies in the Histo-
ry of Statistics and Probability, vol. I. Grif- C
the value from the Student table, t0.025,8, is
equal to 2.306. fin, London
Since |t| = 2.485 > t0.025,8 = 2.306, the Porter, T.M.: The Rise of Statistical Think-
null hypothesis ing, 1820–1900. Princeton University
H0 : ρ = 0 Press, Princeton, NJ (1986)
Stigler, S.: Francis Galton’s account of the
is rejected.
invention of correlation. Stat. Sci. 4, 73–
FURTHER READING 79 (1989)
Coefficient of determination Yule, G.U.: On the theory of correlation.
Covariance J. Roy. Stat. Soc. 60, 812–854 (1897)
Dependence Yule, G.U. (1926) Why do we sometimes
Kendall rank correlation coefficient get nonsense-correlations between time-
Multiple linear regression series? A study in sampling and the nature
Regression analysis of time-series. J. Roy. Stat. Soc. (2) 89, 1–
Simple linear regression 64
Spearman rank correlation coefficient
this term was given by Benzécri, J.P. in the by dividing each element in this row (col-
winter of 1963. In 1976 the works of Ben- umn) by the sum of the elements in the line
zécri, J.P., which retraced twelve years of his (column).
laboratory work, were published, and since The line profile of row i is obtained by
then the algebraic and geometrical proper- dividing each term of row i by ni. , which
ties of this descriptive analytical tool have is the sum of the observed frequencies in
become more widely known and used. the row.
The table of row profiles is constructed
MATHEMATICAL ASPECTS by replacing each row of the contingency
Consider a contingency table relating to two table with its profile:
categorial qualitative variables X and Y
Y1 Y2 ... Yc Total
that have, respectively, r and c categories: n11 n12 n1c
X1 n1. n1. ... n1. 1
n21 n22 n2c
Y1 Y2 ... Yc Total X2 n2. n2. ... n2. 1
X1 n11 n12 ... n1c n1. ... ... ... ... ... ...
nr1 nr2 nrc
X2 n21 n22 ... n2c n2. Xr nr. nr. ... nr. 1
... ... ... ... ... ... Total n.1 n.2 ... n.c r
Xr nr1 nr2 ... nrc nr.
Total n.1 n.2 ... n.c n.. It is also common to multiply each ele-
ment of the table by 100 in order to convert
where the terms into percentages and to make the
sum of terms in each row 100%.
nij represents the frequency that category i
The column profile matrix is constructed
of variable X and category j of variable
in a similar way, but this time each col-
Y is observed,
umn of the contingency table is replaced
ni. represents the sum of the observed fre- with its profile: the column profile of col-
quencies for category i of variable X, umn j is obtained by dividing each term of
n.j represents the sum of the observed fre- column j by n.j , which is the sum of fre-
quencies for category j of variable Y, quencies observed for the category cor-
responding to this column.
n.. represents the total number of observa-
tions.
Y1 Y2 ... Yc Total
n1.
n11 n12 n1c
We will assume that r ≥ c; if not we take X1 n.1 n.2 ... n.c
n2.
n21 n22 n2c
the transpose of the initial table and use X2 n.1 n.2 ... n.c
this transpose as the new contingency table. ... ... ... ... ... ...
nr.
nr1 nr2 nrc
The correspondence analysis of a contingen- Xr n.1 n.2 ... n.c
cy table with more lines than columns, is per- Total 1 1 ... 1 c
formed as follows:
1. Tables of row profiles XI and column pro- The tables of row profiles and column pro-
files XJ are constructed.. files correspond to a transformation of the
For a fixed line (column), the line (col- contingency table that is used to make
umn) profile is the line (column) obtained the rows and columns comparable.
Correspondence Analysis 121
2. Determine the inertia matrix V. If we want to know, for example, the num-
This is done in the following way: ber of eigenvalues and therefore the fac-
• The weighted mean of the r column torial axes that explain at least 3/4 of the
coordinates is calculated: total inertia, we sum the explained iner-
ni. nij
r tia from each of the eigenvalues until we C
n.j
gj = · = , j = 1, . . . , c . obtain 75%.
n.. ni. n..
i=1 We then calculate the main axes of inertia
• The c obtained values gj are written from these factorial axes.
r times in the rows of a matrix G; 5. The main axes of inertia, denoted ul , are
• The diagonal matrix DI is constructed then given by:
with diagonal elements of nni... ;
ul = M −1 · vl ,
• Finally, the inertia matrix V is calcu-
lated using the following formula: meaning that its jth component is:
%
V = (XI − G) · DI · (XI − G) . n.j
ujl = · vjl .
n..
3. Using the matrix M, which consists of nn.j..
terms on its diagonal and zero terms else- 6. We then calculate the main components,
where, we determine the matrix C: denoted by yk , which are the orthogonal
√ √ projections of the row coordinates on the
C = M·V· M. main axes of inertia: the ith coordinate of
4. Find the eigenvalues (denoted kl ) and the lth main component takes the follow-
the eigenvectors (denoted v ) of this ing value:
l
matrix C. yil = xi · M · ul ,
The c eigenvalues kc , kc−1 , . . . , k1 (writ-
ten in decreasing order) are the iner- meaning that
tia. The corresponding eigenvectors are c %
nij n..
called the factorial axes (or axes of iner- y il = · vjl
ni. n.j
j=1
tia).
For each eigenvalue, we calculate the cor- is the coordinate of row i on the lth axis.
responding inertia explained by the facto- 7. After the main components yl (of the
rial axis. For example, the first factorial row coordinates) have been calculated,
axis explains: we determine the main components of the
100 · k1 column coordinates (denoted by zl ) using
(in %) of inertia. the yl , thanks to the transaction formulae:
c
kl
1 nij
r
l=1 zjl = √ · yil ,
kl i=1 n.j
In the same way, the two first factorial
axes explain: for j = 1, 2, . . . , c;
1 nij
c
100 · (k1 + k2 )
(in %) of inertia. yil = √ · zjl ,
c kl i=1 ni.
kl
l=1 for i = 1, 2, . . . , r .
122 Correspondence Analysis
ns1 ns2 nsc The results are reported in the following con-
, ,..., .
ns. ns. ns. tingency table, to which margins have been
The lth main component of the extra row- added:
dot is given by the second formula of
point 7): Number of visits
per year
1
c
nsj
yil = √ · zjl , 0 to 6 7 to 12 > 12 Total
kl ns.
j=1 Managerial
staff members 5 7 3 15
for i = 1, 2, . . . , r .
> 40
We proceed the same way for a column Managerial
coordinate, but apply the first formula of staff members 5 5 2 12
point 7 instead. < 40
Correspondence Analysis 123
The explained inertia is determined for 6. We then calculate the main components
each of these values. For example: by projecting the rows onto the main
axes of inertia. Constructing an auxiliary
0.0746
k1 : = 95.03% , matrix U formed from the two vectors u1
0.0746 + 0.0039 and u2 , we define:
0.0039
k2 : = 4.97% . Y = XI · M · U
0.0746 + 0.0039 ⎡ ⎤
−0.1129 0.0790
⎢ 0.0564 ⎥
For the last one, k3 , the explained inertia ⎢ 0.1075 ⎥
⎢ ⎥
is zero. = ⎢ −0.5851 0.0186 ⎥.
⎢ ⎥
The first two factorial axes, associated ⎣ 0.1311 −0.0600 ⎦
with the eigenvalues k1 and k2, explain all 0.2349 0.0475
of the inertia. Since the third eigenvalue
is zero it is not necessary to calculate the We can see the coordinates of the five rows
eigenvector that is associated with it. We written horizontally in this matrix Y; for
focus on calculating the first two normal- example, the first column indicates the
ized eigenvectors then: components related to the first factorial
axis and the second indicates the compo-
⎡ ⎤
0.7807 nents related to the second axis.
v1 = ⎣ −0.5576 ⎦ and 7. We then use the coordinates of the rows to
−0.2821 find those of the columns via the following
⎡ ⎤ transition formulae written as matrices:
0.0807
v2 = ⎣ 0.5377 ⎦ .
Z = K · Y · XJ ,
−0.8393
where:
5. We then calculate the main axes of inertia
by: K is the second-order diagonal matrix
√
that has 1/ ki terms on its diagonal
ui = M −1 · vi , for i = 1 and 2 , and zero terms elsewhere;
Y is the transpose of the matrix contain-
√
where M −1 is obtained ing the coordinates of the rows, and;
√ by inverting the
diagonal elements of M. XJ is the column profile matrix.
Correspondence Analysis 125
We obtain the matrix: employees under 40 (Y4 ) and the office per-
sonnel (Y5 ). We can verify from the table of
0.3442 −0.2409 −0.1658
Z= , line profiles that the percentages for these
0.0081 0.0531 −0.1128 two categories are indeed very similar.
Similarly, the proximity between two C
where each column contains the compo-
columns (representing two categories relat-
nents of one of the three column coor-
ed to the number of medical visits) indicates
dinates; for example, the first line corre-
similar distributions of people within the
sponds to each coordinate on the first fac-
business for these categories. This can be
torial axis and the second line to each
seen for the modalities Z2 (from 7 to 12
coordinate on the second axis.
medical visits per year) and Z3 (more than
We can verify the transition formula that
12 visits per year).
gives Y from Z:
If we consider the rows and the columns
Y = XI · Z · K simultaneously (and not separately as we did
previously), it becomes possible to identi-
using the same notation as seen previous- fy similarities between categories for cer-
ly. tain modalities. For example, the employ-
8. We can now represent the five categories ees under 40 (Y4 ) and the office personnel
of people questioned and the three cate- (Y5 ) seem to have the same behavior towards
gories of answers proposed on the same health: high proportions of them (0.44 and
factorial plot: 0.5 respectively) go to the doctor less than
6 times per year (Z1 ).
In conclusion, axis 1 is confronted on one
side with the categories indicating an aver-
age or high number of visits (Z2 and Z3 )—the
employees or the managerial staff members
over 40 (Y1 and Y3 )—and on the other side
with the modalities associated with low num-
bers of visits (Z1): Y2 , Y4 and Y5 . The first fac-
tor can then be interpreted as the importance
of medical control according to age.
We can study this factorial plot at three dif-
ferent levels of analysis, depending: FURTHER READING
• The set of categories for the people ques- Contingency table
tioned; Data analysis
• The set of modalities for the medical vis- Eigenvalue
its; Eigenvector
• Both at the same time. Factorial axis
In the first case, close proximity between Graphical representation
two rows(between two categoriesof person- Inertia matrix
nel) signifies similar medical visit profiles. Matrix
On the factorial plot, this is the case for the Scatterplot
126 Covariance
3. The total sum of products of X and Y: Adjustment of the variable Y to the con-
comitant variable X yields two new sums of
t
ni
SXY = (Xij − X̄.. )(Yij − Ȳ.. ) . squares:
i=1 j=1 1. The adjusted total sum of squares:
C
4. The sum of squares of the treatments for 2
SXY
SStot = SYY − .
X: SXX
t
ni
2. The adjusted sum of the squares of the
TXX = (X̄i. − X̄.. )2 .
errors:
i=1 j=1
2
EXY
5. The sum of squares of the treatments for Y SSerr = EYY − .
(TYY ) EXX
is calculated in the same way as TXX, but X The new degrees of freedom for these two
is replaced by Y. sums are:
6. The sum of the products of the treatments 1. ti=1 ni − 2;
of X and Y: t
2. i=1 ni − t − 1, where a degree of free-
t
ni dom is subtracted due to the adjustment.
TXY = (X̄i. − X̄.. )(Ȳi. − Ȳ.. ) . The third adjusted sum of squares, the adjust-
i=1 j=1 ed sum of the squares of the treatments, is
7. The sum of squares of the errors for X: given by:
t
ni SStr = SStot − SSerr .
EXX = (Xij − X̄i. )2 .
i=1 j=1 Thishasthesamenumber of degrees of free-
dom t − 1 as before.
8. The sum of the squares of the errors for Y:
is calculated in the same way as EXX , but X
is replaced by Y. Covariance Analysis Table
9. The sum of products of theerrors X and Y: We now have all of the elements needed to
establish the covariance analysis table. The
t ni
sums of squares divided by the number of
EXY = (Xij − X̄i. )(Yij − Ȳi. ) .
degrees of freedom gives the means of the
i=1 j=1
squares.
Substituting in appropriate values and calcu-
lating these formulae corresponds to an anal- Source Degrees Sum of squares
ysis of variance for each of X, Y and XY. of var- of free- and of products
The degrees of freedom associated with iation dom t t t
xi2 xi yi yi2
these different formulae are as follows: i =1 i =1 i =1
t
Treat-
1. For the total sum: ni − 1. t−1 TXX TXY TYY
ments
i=1 t
2. For the sum of the treatments: t − 1. Errors i=1 ni −t EXX EXY EYY
t
t
3. For the sum of the errors: ni − t. Total i=1 ni − 1 SXX SXY SYY
i=1
130 Covariance Analysis
2 2
Note: the numbers in the xi and yi ables before the application of the treatment.
columns cannot be negative; on the other To test this hypothesis, we will assume the
hand, the numbers in the ti=1 xi yi column null hypothesis to be
can be negative.
H0 : β = 0
Adjustment
MCtr F ≥ F1,t .
F= . i=1 ni −t−1,α
MCerr
The ratio follows a Fisher distribution with DOMAINS AND LIMITATIONS
t−1 and ti=1 ni −t−1 degrees of freedom. The basic hypotheses that need to be con-
The null hypothesis structed before initiating a covariance anal-
ysis are the same as those used for an analysis
H 0 : τ 1 = τ2 = . . . = τt
of variance or a regression analysis. These
will be rejected at the significance level α if are the hypotheses of normality, homogene-
the F ratio is superior or equal to the value ity (homoscedasticity), variances and inde-
of the Fisher table, in other words if: pendence.
In covariance analysis, as in an analysis of
F ≥ Ft−1,t ni −t−1,α . variance, the null hypothesis stipulates that
i=1
the independent samples come from differ-
It is clear that we assume that the β coeffi- ent populations that have identical means.
cient is different from zero when perform- Moreover, since there are always conditions
ing covariance analysis. If this is not the case, associated with any statistical technique,
a simple analysis of variance is sufficient. those that apply to covariance analysis are as
follows:
Test Concerning the β Slope 1. The population distributions must be
So, we would like to know whether there is approximately normal, if not completely
a significant effect of the concomitant vari- normal.
Covariance Analysis 131
= (32 − 29.8667)2 + . . .
2. The populations from which the samples
are taken must have the same variance + (32 − 29.8667)2
σ 2 , meaning: = 453.73 .
σ12 = σ22 = . . . = σk2 , 2. The total sum of squares for Y: C
where k is the number of populations to
be compared. 3 5
SYY = (Yij − Ȳ.. )2
3. The samples must be chosen random- i=1 j=1
ly and all of the samples must be inde-
= (167 − 167.4667)2 + . . .
pendent.
We must also add a basic hypothesis specif- + (162 − 167.4667)2
ic to covariance analysis, which is that the = 3885.73 .
treatments that were carried out must not
influence the values of the concomitant vari- 3. The total sum of the products of X and Y:
able X.
3 5
1 2 3
3
5
TXX = (X̄i. − X̄.. )2
X Y X Y X Y
i=1 j=1
32 167 26 182 36 158
29 172 33 171 34 191
= 5(28.2 − 29.8667)2
22 132 22 173 37 140 + 5(26.2 − 29.8667)2
23 158 28 163 37 192 + 5(35.2 − 29.8667)2
35 169 22 182 32 162
= 223.33 .
We first calculate the various sums of squares
and products: 5. The sum of the squares of the treatments
1. The total sum of squares for X: for Y:
3
5
3
5
SXX = (Xij − X̄.. )2 TYY = (Ȳi. − Ȳ.. )2
i=1 j=1 i=1 j=1
132 Covariance Analysis
= 5(159.6 − 167.4667)2
The degrees of freedom associated with
+ 5(174.2 − 167.4667)2 these different calculations are as follows:
+ 5(168.6 − 167.4667)2 1. For the total sums:
= 542.53 .
3
ni − 1 = 15 − 1 = 14 .
6. The sum of the products of the treatments
i=1
of X and Y:
2. For the sums of treatments:
3
5
TXY = (X̄i. − X̄.. )(Ȳi. − Ȳ.. ) t−1=3−1=2.
i=1 j=1
· (159.6 − 167.4667) + . . .
3
ni − t = 15 − 3 = 12 .
+ 5(35.2 − 29.8667) i=1
· (168.6 − 167.4667)
Adjusting the variable Y to the concomitant
= −27.67 . variable X yields two new sums of squares:
7. The sum of the squares of the errors for X: 1. The total adjusted sum of squares:
2
SXY
3
5
SStot = SYY −
EXX = (Xij − X̄i. )2 SXX
i=1 j=1 158.932
= 3885.73 −
= (32 − 28.2) + . . . + (32 − 35.2)
2 2 453.73
= 230.40 . = 3830.06 .
8. The sum of the squares of the errors for Y: 2. The adjusted sum of the squares of the
errors:
3 5
EYY = (Yij − Ȳi. )2 E2
SSerr = EYY − XY
i=1 j=1 EXX
= (167 − 159.6) + . . .
2 186.602
= 3343.20 −
230.40
+ (162 − 168.6)2
= 3192.07 .
= 3343.20 .
The new degrees of freedom for these two
9. The sum of the products of the errors of X sums are:
and Y:
1. 3i=1 ni − 2 = 15 − 2 = 13;
3 5 2. 3i=1 ni − t − 1 = 15 − 3 − 1 = 11.
EXY = (Xij − X̄i. )(Yij − Ȳi. ) The adjusted sum of the squares of the treat-
i=1 j=1 ments is given by:
= (32 − 28.2)(167 − 159.6) + . . .
SStr = SStot − SSerr
+ (32 − 35.2)(162 − 168.6)
= 3830.06 − 3192.07
= 186.60 .
= 637.99 .
Covariation 133
x + f . This relation states that there is a lin- (xt − x̄) · (yt − ȳ)
t=1
ear dependence between the two time series, C= " .
# n
which is not the case. #
n
$ (xt − x̄)2 · (yt − ȳ)2
Therefore, measuring the correlation
t=1 t=1
between the evolutions of two phenome-
na over time does not imply the existence of This yields values of between −1 and +1.
a link between them. The term covariation is If it is close to ±1, there is a linear rela-
therefore used instead of correlation, and this tion between the time evolutions of the
dependence is measured using a covariation two variables.
coefficient. We can distinguish between: Notice that:
• The linear covariation coefficient;
n
• The tendency covariation coefficient. Xt · Yt
t=1
C= .
HISTORY n
See correlation coefficient and time series. Here n is the number of observations,
while Yt and Xt are the centered and
MATHEMATICAL ASPECTS reduced series obtained by a change of
In order to compare two time series yt and xt , variable, respectively.
the first step is to attempt to represent them • The tendency covariation coefficient
on the same graphic. The influence exerted by the means is
However, visual comparison is generally dif- eliminated by calculating:
ficult. The following change of variables is n
performed: (xt − Txt ) · (yt − Tyt )
t=1
yt − ȳ xt − x̄ K= " .
# n
Yt = and Xt = , #
n
Sy Sx $ (xt − Tx )2 · (yt − Tyt )2
t
t=1 t=1
which are the centered and reduced variables
where Sy and Sx are the standard deviations The means x̄ and ȳ have simply been
of the respective time series. replaced with the values of the secular
We can distinguish between the following trends Txt and Tyt of each time series.
covariation coefficients: The tendency covariation coefficient K
• The linear covariation coefficient also takes values between −1 to +1, and
The form of this expression is identical to the closer it gets to ±1, the stronger the
the one for the correlation coefficient r, covariation between the time series.
Covariation 135
Xt Yt Xt · Yt Xt · Yt −1
0.00 2.25 0.00 5.36 Cox, David R.
2.38 0.10 0.24 −0.01
Cox, David R. was born in 1924 in Bir-
−0.14 −0.92 0.13 0.84
mingham in England. He studied mathe-
−0.91 0.31 −0.28 0.00
matics at the University of Cambridge and
0.00 −0.41 0.00 0.23
obtained his doctorate in applied mathe-
−0.56 0.41 −0.23 0.11
matics at the University of Leeds in 1949.
0.28 −1.13 −0.32 1.18
From 1966 to 1988, he was a professor of
−1.05 −0.61 0.64
statistics at Imperial College London, and
The linear covariation coefficient is then cal- then from 1988 to 1994 he taught at Nuffield
culated: College, Oxford.
Cox, David is an eminent statistician. He was
8
knighted by Queen Elizabeth II in 1982 in
Xt · Yt
0.18 gratitude for his contributions to statistical
t=1
C= = = 0.0225 . science, and has been named Doctor Hon-
8 8
oris Causa by many universities in England
If the observations Xt are compared and elsewhere. From 1981 to 1983 he was
with Yt−1 , meaning the production this year President of the Royal Statistical Society;
is compared with that of the previous year, he was also President of the Bernoulli Soci-
we obtain: ety from 1972 to 1983 and President of the
8 International Statistical Institute from 1995
Xt · Yt−1 to 1997.
t=2 7.71 Due to the variety of subjects that he has stud-
C= = = 0.964 .
8 8 ied and developed, Professor Cox has had
The linear covariation coefficient for a shift a profound impact in his field. He was named
of one year is very strong. There is a strong Doctor Honoris Causa of the University of
(positive) covariation with a shift of one year Neuchâtel in 1992.
between the two variables. Cox, Sir David is the author and the coauthor
of more then 250 articles and 16 books, and
FURTHER READING between 1966 and 1991 he was the editor of
Correlation coefficient Biometrika.
Moving average
Secular trend Some principal works and articles of Cox,
Standard deviation Sir David:
Time series
1964 (and Box, G.E.P.) An analysis of
transformations (with discussion)
REFERENCES
J. Roy. Stat. Soc. Ser. B 26, 211–
Kendall, M.G.: Time Series. Griffin, London
243.
(1973)
Py, B.: Statistique Déscriptive. Economica, 1970 The Analysis of Binary Data.
Paris (1987) Methuen, London.
Cp Criterion 137
1973 (and Hinkley, D.V.) Theoretical she agreed to, although she remained in the
Statistics. Chapman & Hall, London. field of psychology because she worked on
1974 (and Atkinson, A.C.) Planning expe- the evaluation of statistical test in psych-
riments for discriminating between ology and the analysis of psychological C
models. J. Roy. Stat. Soc. Ser. B 36, data.
321–348. On 1st November 1940, she became the
director of the Department of Experimental
1978 (and Hinkley, D.V.) Problems and
Statistics of the State of North Carolina.
Solutions in Theoretical Statistics.
In 1945, the General Education Board gave
Chapman & Hall, London.
her permission to create an institute of statis-
1981 (and Snell, E.J.) Applied Statistics: tics at the University of North Carolina, with
Principles and Examples. Chapman a department of mathematical statistics at
& Hall. Chapel Hill.
1981 Theory and general principles in She was a founder member of the Journal of
statistics. J. Roy. Stat. Soc. Ser. A the International Biometric Society in 1947,
144, 289–297. and she was a director of it from 1947 to
2000 (and Reid, N.) Theory of Design 1955 and president of it from 1968 to 1969.
Experiments. Chapman & Hall, Lon- She was president of the American Statisti-
don. cal Association (ASA) in 1956. She died in
1978.
of trying to explain one variable using all 4. Set D contains one equation with three
of the available explanatory variables, it is explanatory variables:
sometimes possible to determine an under-
lying model with a subset of these variables, Y = β 0 + β1 X 1 + β2 X 2 + β3 X 3 + ε .
and the explanatory power of this model is C
almost the same as that of the model contain-
ing all of the explanatory variables. Anoth- District X1 X2 X3 Y
His main work in actuarial mathematics is where E (.) and V (.) are the usual symbols
Collective Risk Theory. In 1929, he was used for the expected value and the vari-
asked to create a new department in Stock- ance. We define the total mean squared error
holm, and he became the first Swedish pro- as:
fessor of actuarial and statistical mathe- C
p−1
matics. At the end of the Second World War j) =
TMSE(β j
MSE β
he wrote his principal work Mathematical j=1
Methods of Statistics, which was published
p−1
2
for the first time in 1945 and was recently (in = E j − βj
β
1999) republished. j=1
Some principal works and articles of Cramér, p−1
Harald = j + E β
Var β j −βj 2
j=1
1946 Mathematical Methods of Statistics.
Princeton University Press, Prince- = (p − 1)σ 2 · Trace (V)
ton, NJ.
p−1
+ j − βj 2 .
E β
1946 Collective Risk Theory: A Survey of
j=1
the Theory from the Point of View of
the Theory of Stochastic Processes. where V is the variance-covariance matrix
Skandia Jubilee Volume, Stockholm. of β.
p−1
The criterion of total mean squared error
=
TMSE β E β j − βj 2
is a way of comparing estimations of the
j=1
parameters of a biased or unbiased model.
of a vector of estimators
MATHEMATICAL ASPECTS
β= β 1 , . . . , β
p−1
Let
= β
β p−1
1 , . . . , β
for the parameters of the model
be a vector of estimators for the parameters of
a regression model. We define the total mean Yi = β0 + β1 Xi1
s
+ . . . + βp−1 Xip−1
s
+ εi ,
square error, TMSE, of the vector β of esti-
we need to know the values of the model
mators as being the sum of the mean squared
parameters βj, which are obviously unknown
errors (MSE) of its components.
for all real data. The notation Xjs means that
We recall that
the data for the jth explanatory variable were
MSE β j = E β j − βj 2 standardized (see standardized data).
Therefore, in order to estimate these TMSE
=V β j + E β j − βj 2 , we generate data from a structural similarly
142 Criterion of Total Mean Squared Error
ry variables. We compare these three estima- these three methods are applied to these
tions using the total mean squared errors of 100 samples without any influence from the
the three vectors: results from the equations
β̂LS = β̂LS1 , β̂LS2 , β̂LS3 , β̂LS4
YiLS = 95.4 + 9.12Xi1s s
+ 7.94Xi2
C
s s
β̂R = β̂R1 , β̂R2 , β̂R3 , β̂R4 + 0.65Xi3 − 2.41Xi4 ,
YiR = 95.4 + 7.64Xi1s
+ 4.67Xi2
s
β̂SV = β̂SV1 , β̂SV2 , β̂SV3 , β̂SV4 .
− 0.91Xi3
s
− 5.84Xi4
s
and
Here the subscript LS corresponds to the YiSV = 95.4 + 8.64Xi1 + 10.3Xi2
s s
to be zero. In our case we have: for one of these 100 samples, the method has
selected better Xi2 and Xi3 variables, or only
β̂LS = (9.12, 7.94, 0.65, −2.41) Xi3 , or all of the subset of the four available
β̂R = (7.64, 4.67, −0.91, −5.84) variables. In the same way, despite the fact
that the equation YiR = 95.4 + 7.64Xi1 s
+
β̂SV = (8.64, 10.3, 0, 0) .
4.67xXi2 − 0.91Xi3 − 5.84Xi4 was obtained
s s s
We have chosen to approximate the under- with k = 0.157, the value of k is recalculat-
lying process that results in the cement data ed for each of these 100 samples (for more
by the following least squares equation: details refer to the ridge regression example).
From these 100 estimations of β̂LS , β̂SV and
YiMC = 95.4 + 9.12Xi1 s
+ 7.94Xi2
s
β̂R , we can calculate 100 ETCs for each
+ 0.65Xi3 s
− 2.41Xi4s
+ εi . method, which we label as:
2
ETCSV = β̂SV1 − 9.12 We can therefore conclude (at least for the
2 particular model used to generate the simu-
+ β̂SV2 − 7.94 lated data, and taking into account our aim—
2 to get a small TMSE), that the ridge method is
+ β̂SV3 − 0.65 the best of the methods considered, followed
2 by the varible selection procedure.
+ β̂SV4 + 2.41 .
After this simulation was performed, the fol- Mean squared error
test on the right or the left), the probability and II. Biometrika 20A, 175–240, 263–
distribution and the significance level α. 294 (1928)
To a significance level of α = 5%, by look- The first step when investigating a time series
ing at the normal table we find that the crit- is always to determine the secular trend Tt ,
ical value z α2 equals 1.96. and then to determine the seasonal varia-
tion St . It is then possible to adjust the initial
FURTHER READING data of the time series Yt according to these
Confidence interval two components:
Yt
Hypothesis testing • = Ct · It (multiplicative model);
Significance level
S t · Tt
• Yt − St − Tt = Ct + It (additive model).
Statistical table
To avoid cyclical fluctuations, a weighted
moving average is established over a few
REFERENCE months only. The use of moving averages
Neyman, J., Pearson, E.S.: On the use and allows use to smooth the irregular varia-
interpretation of certain test criteria for tions It by preserving the cyclical fluctua-
purposes of statistical inference, Parts I tions Ct .
146 Cyclical Fluctuation
The choice of a weighted moving average usually vary in length and amplitude. This
allows us to give more weight to the cen- is due to the presence of a multitude of fac-
tral values compared to the extreme val- tors, where the effects of these factors can
ues, in order to reproduce cyclical fluctua- change from one cycle to the other. None
tionsin amoreaccurateway. Therefore,large of the models used to explain and predict
weights will be given to the central values such fluctuations have been found to be
and small weights to the extreme values. completely satisfactory.
For example, for a moving average con-
sidered for an interval of five months, the EXAMPLES
weights −0.1, 0.3, 0.6, 0.3 and −0.1 can be Let us establish a moving average con-
used; since their sum is 1, there will be no sidered over five months of data adjusted
need for normalization. according to the secular trend and seasonal
If the values of Ct · It (resulting from the variations.
adjustments performed with respect to the Let us also use the weights −0.1, 0.3, 0.6, 0.3,
secular trend and to the seasonal varia- −0.1; since their sum is equal to 1, there is
tions) are denoted by Xt , the value of the no need for normalization.
cyclical fluctuation for the month t is deter- Let Xi be the adjusted values of Ci · Ii :
mined by:
Ci = −0.1 · Xi−2 + 0.3 · Xi−1 + 0.6 · Xi
Ct = −0.1 · Xt−2 + 0.3 · Xt−1 + 0.6 · Xt + 0.3 · Xi+1 − 0.1 · Xi+2 .
+ 0.3 · Xt+1 − 0.1 · Xt+2 . The table below shows the electrical pow-
er consumed by street lights every month in
DOMAINS AND LIMITATIONS millions of kilowatt hours during the years
Estimating cyclical fluctuations allows us to: 1952 and 1953. The data have been adjusted
• Determine the maxima or minima that according to the secular trend and seasonal
a time series can attain. variations.
• Perform short- or medium-term forecast-
Year Month Data Moving average
ing. Xi for 5 months Ci
• Identify the cyclical components.
1952 J 99.9
The limitations and advantages of the use of
F 100.4
weighted moving averageswhen evaluating
M 100.2 100.1
cyclical fluctuations are the following:
• Weighted moving averages can smooth A 99.0 99.1
Year Month Data Moving average cycle is often sought, but this only appears
Xi for 5 months Ci every 20 years.
1953 J 100.6 100.3
F 100.1 100.4 FURTHER READING
M 100.5 100.1 Forecasting
C
A 99.2 99.5 Irregular variation