• Embed Doc
  • Readcast
  • Collections
  • CommentGo Back
 
Generating Data with Identical Statistics but Dissimilar Graphics: AFollow up to the Anscombe Dataset
Sangit
Chatterjee
and Aykut
Firat
The Anscombe dataset is popular for teaching the importanceof graphics in data analysis. It consists of four datasets that haveidentical summary statistics (e.g., mean, standard deviation, andcorrelation) but dissimilar data graphics (scatterplots). In thisarticle, we provide a general procedure to generate datasets withidentical summary statistics but dissimilar graphics by using agenetic algorithm based approach.KEYWORDS: Geneticalgorithms;Ortho-normalization;Non-linear optimization.
1. INTRODUCTION
To demonstrate the usefulness of graphics in statistical anal-ysis, Anscombe (1973) produced four datasets each with an in-dependent variable
x
and a dependent variable
y
that had thesame summary statistics (such as mean, standard deviation, andcorrelation),butproducedcompletelydifferentscatterplots.TheAnscombe dataset is reproduced in Table 1 and the scatterplotsof the four datasets are given in Figure 1. The dataset has nowbecomefamousastheAnscombedata,andisoftenusedinintro-ductory statistics classes as an example to illustrate the useful-ness of graphics: an apt illustration of the well-known wisdomthat a scatterplot can often reveal patterns that may be hiddenby summary statistics. It is not known, however, how Anscombecame up with his datasets. In this article, we provide a generalprocedure to generate datasets with identical summary statis-tics but dissimilar graphics by using a genetic-algorithm-basedapproach.
2. PROBLEM DESCRIPTION
Consideragivendatamatrix
X
consistingoftwodatavectorsof size
n
: the independent variable
x
, and the dependent vari-able
y
. (Though we present the case for two data vectors, ourmethodology is generally applicable.) Let
x
,y
be the meanvalue, and
s
x
,s
y
be the standard deviation of vectors, and
r
bethe correlation coefficient between vectors
x
and
y
. Let
X
beanother data matrix containing two data vectors of size
n
:
x
,
y
.The problem is to find at least one
X
that has identical summarystatistics as
X
. At the same time, scatterplots of 
x
,
y
should be
Sangit Chatterjee is Professor, and Aykut Firat is Assistant Professor, College of Business Administration, Northeastern University, Boston, MA 02115 (E-mailaddresses:
s.chatterjee@neu.edu
and
a.firat@neu.edu
). We greatly appreciatetheeditor’sandananonymousassociateeditor’scommentsthatgreatlyimprovedthe article.
dissimilar to those of 
x
,
y
according to a function
g(
X
,
X
)
,which quantifies the graphical difference between the scatter-plots of 
x
,
y
and
x
,
y
. This problem can be formulated as amathematical program as follows:maximize
g(
X
,
X
)
s.t.
x
x
+
y
y
+
s
x
s
x
+
s
y
s
y
+
r
r
=
0
.
Intheaboveformulation,theobjectivefunctiontobemaximizedis the graphical dissimilarity between
X
and
X
, and the con-straint ensures that the summary statistics will be identical. Inorder to measure the graphical dissimilarity between two scat-terplots, we considered the absolute value differences betweenthe following quantities of 
X
and
X
:a. ordered data values
g
=
x
(i)
x
(i)
+
y
(i)
y
(i)
.b. Kolmogorov–Smirnov test statistics over an interpolatedgrid of 
y
values;
(g
=
max
F(a)
(a)
,
where
F(a)
is the proportion of 
y
i
values less than or equal to
a
and
(a)
is the proportion of 
y
i
values less than or equal to
a
,where
a
corresponds to all possible values of 
y
i
and
y
i
.c. the quadratic coefficients of the regression fit
(g
= |
b
2
b
2
|
, where
y
i
=
b
0
+
b
1
x
i
+
b
2
x
2
i
+
e
i
and
y
i
=
b
0
+
b
1
x
i
+
b
2
x
2
i
+
e
i
.d. Breusch-Pagan (1979) Lagrange multiplier (LM) statisticsas a measure of heteroscedasticity;
(g
= |
LM
LM
|
)
.e. standardized skewness
g
skewness
=
(y
i
y)
3
s
3
y
y
i
y
3
s
3
y
.
f. standardized kurtosis
g
kurtosis
=
(y
i
y)
4
s
4
y
y
i
y
4
s
4
y
.
g. maximum of the Cook’s
D
statistic (Cook 1977)
(g
=|
max
(d 
i
)
max
(d 
i
)
|
, where
i
is Cook’s
D
statistic for obser-vation
i
).Wealsoexperimentedwithvariouscombinationsoftheaboveitems such as the multiplicative combination of standardizedskewness and kurtosis measures
(g
=
g
skewness
g
kurtosis
)
. Wereport on such experiments in the results section.
248 The American Statistician, August 2007, Vol. 61, No. 3
c
 American Statisticial Association DOI: 10.1198/000313007X220057 
 
Table 1. Anscombe’s Original Dataset. All four datasets have identical sum-mary statistics: means
(x
=
9
.
0
,y
=
7
.
5
)
, regression coefficients
(b
0
=
3
.
0
,b
1
=
0
.
5
)
, standard deviations
(s
x
=
3
.
32
,s
y
=
2
.
03
)
, correlation co-efficients, etc.Dataset 1 Dataset 2 Dataset 3 Dataset 4x y x y x y x y10 8.04 10 9.14 10 7.46 8 6.588 6.95 8 8.14 8 6.77 8 5.7613 7.58 13 8.76 13 12.74 8 7.719 8.81 9 8.77 9 7.11 8 8.8411 8.33 11 9.26 11 7.81 8 8.4714 9.96 14 8.10 14 8.84 8 7.046 7.24 6 6.13 6 6.08 8 5.254 4.26 4 3.10 4 5.39 8 5.5612 10.84 12 9.13 12 8.15 8 7.917 4.82 7 7.26 7 6.42 8 6.895 5.68 5 4.74 5 5.73 19 12.5
3. METHODOLOGY
We propose a genetic algorithm (GA) (Goldberg 1989) basedsolution to our problem. GAs are often used for problems thatare difficult to solve with traditional optimization techniques;thereforeagoodchoiceforourproblemthathasadiscontinuous,and nonlinear objective function with undefined derivatives. Seealso Chatterjee, Laudoto, and Lynch (1996) for applications of genetic algorithms to problems of statistical estimation.InaGAanindividualsolutioniscalledagene,andistypicallyrepresented as a vector of real numbers, bits
(
0
/
1
)
, or characterstrings.Inthebeginning,aninitialpopulationofgenesiscreated.The GA, then, repeatedly modifies this population of individualsolutions over many generations. At each generation, childrengenes are produced from randomly selected parents (crossover),or from randomly modified individual genes (mutation). In ac-cord with the Darwinian principle of “natural selection,” geneswith high “fitness values” have a higher chance of survival inthe next generations. Over successive generations, the popula-tion evolves toward an optimal solution. We now explain thedetails of this algorithm applied to our problem.
3.1 Representation
We conceptualize a gene as a matrix of size
n
×
2 havingreal values. For example, when
n
=
11 (the size of Anscombe’sdata), an example gene
X
would be as follows (note that thetranspose of 
X
is shown below):
X
=
0
.
43 1
.
66 0
.
12 0
.
28
1
.
14 1
.
19 1
.
18
0
.
03 0
.
32 0
.
17
0
.
180
.
72
0
.
58 2
.
18
0
.
13 0
.
11 1
.
06 0
.
05
0
.
09
0
.
83 0
.
29
1
.
33
.
3.2 Initial Population Creation
Individual solutions in our population should satisfy the con-straint in our mathematical formulation in order to be a feasiblesolution. Given an original data matrix
X
of size
n
×
2, we ac-complish this through orthonormalization and a transformationas outlined in the following for a single gene (Matlab statementsfor a specific case
(n
=
11
)
are also given for each step).(i) Generateamatrix
X
ofsize
n
×
2withrandomlygenerated
Figure 1. Scatterplots of Anscombe’s data. Scatterplots of the Anscombe datasets reveal different data graphics.
The American Statistician, August 2007, Vol. 61, No. 3 249
 
data from a standard normal distribution. Distrubutions otherthan the standard normal can also be used in this step.
Matlab> X = randn(11,2)
(ii) Set the mean values of X’s columns to zero using
X
=
X
e
n
×
1
X
,where
e
n
×
1
isan
n
-elementcolumnvectorofones.This step is needed to make sure that after ortho-normalizationthe standard deviation of the columns will be equal to the unitvector norm.
Matlab> X = X ones(11,1)*mean(X)
(iii) Ortho-normalize the columns of 
X
. For this, we use theGram-Schmidt process (Arfken 1985), by taking a nonorthogo-nal set of linearly independent vectors
x
and
y
constructing anorthogonal basis
e
1
and
e
2
as follows (in
R
2
):
u
1
=
x
,
u
2
=
y
proj
u
1
y
,
whereproj
u
v
=
v
,
u
u
,
u
u
,
and
v
1
,
v
2
represents the inner product. Then
e
1
=
u
1
u
1
,
and
e
2
=
u
2
u
2
,
and
X
ortho-normalized
=
[
e
1
,
e
2
]
.
Matlab> X = grams(X);
where grams is a custom function that performs Gram-Schmidtortho-normalization.(iv) Transform
X
with the following equation:
X
=√ 
n
1
X
ortho-normalized
cov
(
X
)
1
/
2
+
e
n
×
1
X
,
where cov
(
X
)
is the covariance matrix of 
X
,
X
=
x
1
,x
2
.
√ 
n
1 is needed since we are using the
sample
standard devi-ation in covariance calculations.
Matlab> X = sqrt(10) * X * sqrtm(cov(Xo))+ ones(11,1)*mean(Xo);
where
Xo
is the original data matrix.With these steps, we can create a gene that satisfies the con-straint of our mathematical formulation, that is, the aforemen-tioned summary statistics of our new gene are identical to ouroriginal gene. We independently generated 1,000 such randomgenes to create our initial population
.
3.3 Fitness Values
At each generation, a fitness value needs to be calculated foreachpopulationmember.Forourproblemagene’sfitnessispro-portional to its graphical difference from the given data matrix
X
. We used the graphical difference functions mentioned in theproblem description section in different runs of experiments.
3.4 Creating The Next Generation
3.4.1 Selection
Once the fitness values are calculated, parents are selected forthe next generation based on their fitness. We use the “stochas-tic uniform” selection procedure, which is the default method inMatlab Genetic Algorithm Toolbox (Matlab 2006). This selec-tion algorithm first lays out a line in which each parent corre-sponds to a section of the line of length proportional to its scaledfitness value. Then the algorithm moves along the line in stepsof equal size. At each step, the algorithm allocates a parent fromthe section it lands on.
3.4.2 New Children
Three types of children are created for the next generation:(i)
Elite Children
—Individuals in the current generation withthe top two best fitness values are called elite children, and au-tomatically survive in the next generation.(ii)
CrossoverChildren
—Thesearechildrenobtainedbycom-bining two parent genes. A child is obtained by splitting two se-lected parent genes at a random point, and combining the headof one with the tail of the other and vice versa as illustrated inFigure2.Eightypercentoftheindividualsinthenewgenerationare created this way.(iii)
Mutation Children
—Mutation children make up the re-maining members of the new generation. A parent gene is mod-ified by adding a random number, or mutation, chosen from a
Figure 2. Crossover operation.
250 Statistical Computing and Graphics
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...
You must be to leave a comment.
Submit
Characters: ...