Generating Data with Identical Statistics but Dissimilar Graphics: AFollow up to the Anscombe Dataset
Sangit
Chatterjee
and Aykut
Firat
The Anscombe dataset is popular for teaching the importanceof graphics in data analysis. It consists of four datasets that haveidentical summary statistics (e.g., mean, standard deviation, andcorrelation) but dissimilar data graphics (scatterplots). In thisarticle, we provide a general procedure to generate datasets withidentical summary statistics but dissimilar graphics by using agenetic algorithm based approach.KEYWORDS: Geneticalgorithms;Ortho-normalization;Non-linear optimization.
1. INTRODUCTION
To demonstrate the usefulness of graphics in statistical anal-ysis, Anscombe (1973) produced four datasets each with an in-dependent variable
x
and a dependent variable
y
that had thesame summary statistics (such as mean, standard deviation, andcorrelation),butproducedcompletelydifferentscatterplots.TheAnscombe dataset is reproduced in Table 1 and the scatterplotsof the four datasets are given in Figure 1. The dataset has nowbecomefamousastheAnscombedata,andisoftenusedinintro-ductory statistics classes as an example to illustrate the useful-ness of graphics: an apt illustration of the well-known wisdomthat a scatterplot can often reveal patterns that may be hiddenby summary statistics. It is not known, however, how Anscombecame up with his datasets. In this article, we provide a generalprocedure to generate datasets with identical summary statis-tics but dissimilar graphics by using a genetic-algorithm-basedapproach.
2. PROBLEM DESCRIPTION
Consideragivendatamatrix
X
∗
consistingoftwodatavectorsof size
n
: the independent variable
x
∗
, and the dependent vari-able
y
∗
. (Though we present the case for two data vectors, ourmethodology is generally applicable.) Let
x
∗
,y
∗
be the meanvalue, and
s
∗
x
,s
∗
y
be the standard deviation of vectors, and
r
∗
bethe correlation coefficient between vectors
x
∗
and
y
∗
. Let
X
beanother data matrix containing two data vectors of size
n
:
x
,
y
.The problem is to find at least one
X
that has identical summarystatistics as
X
∗
. At the same time, scatterplots of
x
,
y
should be
Sangit Chatterjee is Professor, and Aykut Firat is Assistant Professor, College of Business Administration, Northeastern University, Boston, MA 02115 (E-mailaddresses:
s.chatterjee@neu.edu
and
a.firat@neu.edu
). We greatly appreciatetheeditor’sandananonymousassociateeditor’scommentsthatgreatlyimprovedthe article.
dissimilar to those of
x
∗
,
y
∗
according to a function
g(
X
,
X
∗
)
,which quantifies the graphical difference between the scatter-plots of
x
,
y
and
x
∗
,
y
∗
. This problem can be formulated as amathematical program as follows:maximize
g(
X
,
X
∗
)
s.t.
x
∗
−
x
+
y
∗
−
y
+
s
∗
x
−
s
x
+
s
∗
y
−
s
y
+
r
∗
−
r
=
0
.
Intheaboveformulation,theobjectivefunctiontobemaximizedis the graphical dissimilarity between
X
and
X
∗
, and the con-straint ensures that the summary statistics will be identical. Inorder to measure the graphical dissimilarity between two scat-terplots, we considered the absolute value differences betweenthe following quantities of
X
and
X
∗
:a. ordered data values
g
=
x
(i)
−
x
∗
(i)
+
y
(i)
−
y
∗
(i)
.b. Kolmogorov–Smirnov test statistics over an interpolatedgrid of
y
values;
(g
=
max
F(a)
−
F
∗
(a)
,
where
F(a)
is the proportion of
y
i
values less than or equal to
a
and
F
∗
(a)
is the proportion of
y
∗
i
values less than or equal to
a
,where
a
corresponds to all possible values of
y
i
and
y
∗
i
.c. the quadratic coefficients of the regression fit
(g
= |
b
2
−
b
∗
2
|
, where
y
i
=
b
0
+
b
1
x
i
+
b
2
x
2
i
+
e
i
and
y
∗
i
=
b
∗
0
+
b
∗
1
x
∗
i
+
b
∗
2
x
∗
2
i
+
e
∗
i
.d. Breusch-Pagan (1979) Lagrange multiplier (LM) statisticsas a measure of heteroscedasticity;
(g
= |
LM
−
LM
∗
|
)
.e. standardized skewness
g
skewness
=
(y
i
−
y)
3
s
3
y
−
y
∗
i
−
y
∗
3
s
3
y
∗
.
f. standardized kurtosis
g
kurtosis
=
(y
i
−
y)
4
s
4
y
−
y
∗
i
−
y
∗
4
s
4
y
∗
.
g. maximum of the Cook’s
D
statistic (Cook 1977)
(g
=|
max
(d
i
)
−
max
(d
∗
i
)
|
, where
d
i
is Cook’s
D
statistic for obser-vation
i
).Wealsoexperimentedwithvariouscombinationsoftheaboveitems such as the multiplicative combination of standardizedskewness and kurtosis measures
(g
=
g
skewness
∗
g
kurtosis
)
. Wereport on such experiments in the results section.
248 The American Statistician, August 2007, Vol. 61, No. 3
c
American Statisticial Association DOI: 10.1198/000313007X220057
Leave a Comment