You are on page 1of 18

The Australian Economic Review, vol. 38, no. 3, pp.

333–50

For the Student

Descriptive Methods for Cross-Section Data

Joe Hirschberg, Lan Lu and Jenny Lye*


Department of Economics, The University of Melbourne; Department of the
Treasury, Canberra; and Department of Economics, The University of Melbourne,
respectively

1. Introduction the majority of graphic representations of data


were done by hand. Today the wide availability
Econometric analysis often results in highly of graphic screens, laser printers and such soft-
specialised quantitative results.1 However, as ware as Microsoft Excel has made the produc-
the old mantra of the computer programmer tion of graphic representations of data almost
states, GIGO—‘garbage in garbage out’ which automatic. The purpose of this article is to ex-
applies here with respect to the nature of the amine which graphs should be used for what
data used in the analysis. Thus it is crucial to purpose and to demonstrate how some of the
examine the data used in analysis prior to the most popular software packages can produce
application of computational analysis. In addi- graphic images that will aid in the interpreta-
tion, to paraphrase an old adage, ‘one graph is tion of statistical analysis.
worth a thousand numbers’—the human eye To illustrate why it is so crucial to examine
can analyse data better in a visual format than the data we show two simple examples where
in a tabular form. To serve both of these ends it we wish to run a regression. First, consider the
is important to understand how graphic meth- regression results presented in Table 1. Based
ods can be used to check and to carefully look on the values in this table it looks like a very
at data before performing econometric analy- good model with a highly significant t-statistic
sis. associated with the explanatory variable and a
Computer-driven pen plotters were one of high R2 indicating that it explains a large pro-
the first additional output peripherals attached portion of the variation in y.
to computers back in the era of the mainframe However, in Figure 1 a scatter plot of the
systems. Prior to the availability of computers data reveals that the data fall clearly into two

Table 1 Regression Results for Example 1


(dependent variable: y; sample: 136)
Variable Coefficient Standard error t-statistic P-value
c 1.11 0.60 1.85 0.07
x 1.29 0.08 17.08 0.00

R-squared 0.89
Standard error of regression 2.07
F-statistic 291.58
Probability (F-statistic) 0.00

* We would like to thank David Moreton for helpful suggestions and the Quality of Teaching Committee of the Faculty of
Economics and Commerce at the University of Melbourne for financial support.

©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Published by Blackwell Publishing Asia Pty Ltd
334 The Australian Economic Review September 2005

Figure 1 A Scatter Plot of the Data for Example 1 subgroups indicating the presence of a ‘dumb-
bell’ plot as shown in Figure 2. Note that if the
data in each subgroup are fit as separate mod-
els, the R2’s obtained from the separate regres-
sions are very close to 0.
The regression results for the second exam-
ple are presented in Table 2. These results ap-
pear to show that there is not a significant
relationship between y and x and the regression
has a very low R2. With results such as these it
is often the case that one would assume that
these variables have no relationship to one an-
other.
Figure 2 The Implied Dumbell However, a scatter plot of x and y in Figure 3
for the Data for Example 1 indicates that there is a unique well-defined re-
lationship between them. To estimate the pa-
rameters of this relationship would require a
model that does not assume linearity.
Many other authors have found examples
similar to those presented here. Two worth
mentioning are by Anscombe (1973) and
Leamer (1994). In Anscombe (1973) four
datasets each consisting of 11 data points on y
and x are examined. For each of the four
datasets the same results from an ordinary least
squares regression including the estimated co-
efficients and R2 are obtained. However, when
Figure 3 A Scatter Plot of the Data for Example 2 the data are plotted for each of the four
datasets, differences between the datasets such
as outliers and non-linearities can be seen.
Leamer (1994, p. xiii) presents an example in
which a consumption function is estimated
using hypothetical expenditure and income
data. The results of the ordinary least squares
regression look very good with a high R2 and a
coefficient of the right sign. However, a scatter
plot of the data reveals the data spell out the let-
ters H E L P.

Table 2 Regression Results for Example 2


(dependent variable: y; sample: 1100)
Variable Coefficient Standard error t-statistic P-value
c 0.07 0.85 0.09 0.93
x –0.07 0.12 –0.14 0.89

R-squared 0.00
Standard error of regression 8.51
F-statistic 0.02
Probability (F-statistic) 0.89

©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Hirschberg, Lu and Lye: Descriptive Methods for Cross-Section Data 335

The inspection of the data prior to running a The estimation of the density for the contin-
regression is strongly encouraged in numerous uous random variable is often not sufficient.
econometric textbooks. For example, see Because the normal distribution acts as a
Kennedy (2003, p. 408), Koop (2000, pp. 12– model on which many statistical tests rely, it is
20), Pindyck and Rubinfeld (1998, p. 45), important to determine how the distribution for
Greene (2003, pp. 878–81) and Griffiths, the series compares to a known distribution
Carter Hill and Judge (1993, p. 22). However, such as the normal. Q-Q plots and P-P plots
in many cases the discussion is limited to the provide a means of comparing series distribu-
construction of histograms and scatter plots. tions to a standard distribution while also iden-
The aim of this article is to provide a detailed tifying the observations that do not conform
student guide to useful methods of summaris- and these techniques are presented in Section
ing and examining raw cross-section data 4.
before attempting to apply sophisticated Sections 5 and 6 discuss graphical tech-
econometric techniques.2 niques for analysing multivariate data. The re-
To illustrate the techniques presented we use lationship between two series is the focus of
detailed data for Dominick’s Finer Foods su- Section 5 and the use of the correlation coeffi-
permarkets. A research project at the Univer- cient and the scatter plot is examined. Whereas
sity of Chicago has made available a set of the correlation coefficient only indicates the
detailed data for the Dominick’s supermarkets presence of a linear association, the scatter plot
located in metropolitan Chicago.3 We use can highlight the presence of a non-linear rela-
these data to create average daily sales by de- tionship between the variables or the presence
partment for a set of 84 stores for which the of outliers. Section 6 discusses two graphical
data are relatively complete—at least three techniques that can be used to look at a group
years of data are used for each store. In addi- of variables at a time. Side by side box plots si-
tion to the sales for each store, the data also multaneously display the distribution of a
contain information obtained from the US group of variables so that distributions and
Census to describe the population in the neigh- their properties can be compared between the
bourhood in which the store is located and variables. The matrix scatter plot is a graphic
some additional marketing information con- analogue to the correlation matrix and is a use-
cerning the nature of the customers in the ful method of tracking particular observations
store. across a group of variables.
Various univariate methods for investigating An important use of graphs is to find patterns
the individual series are discussed first. In Sec- in the data. Therefore graphs need to be clear
tion 2 we define a number of descriptive statis- and well presented and often this requires
tics that numerically summarise the properties changing the default options available in
of a single variable. In Section 3 we demon- graphic routines in computer programs. In Sec-
strate how graphic displays for the nature of the tion 7 we illustrate the steps involved in chang-
distribution of the series can be generated. The ing the visual impact of a graph obtained using
first of these distribution plots is the box plot. the default options in Microsoft Excel. Section
This is a very useful summary in graphic form 8 discusses statistical packages with particular
of the overall shape of the distribution of the emphasis on widely used software packages for
data. The next method is the histogram which graphics in econometrics. Section 9 concludes
provides more detail. Here we introduce the the article.
method by which we can compare the series
distribution to the normal distribution using a 2. Descriptive Statistics
simple overlay. Following the histogram we
introduce the kernel density estimate or A number of descriptive statistics are used to
smoothed histogram. This provides a more ac- summarise the properties of a single variable x
curate estimate for the distribution of a contin- with n observations in terms of its location, dis-
uous variable. persion and shape. The most common of which
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
336 The Australian Economic Review September 2005

is the mean. The mean is a statistical term for Table 3 Descriptive Statistics for Produce Sales
the average and is defined as: Measure Statistic
Mean 5573.97
1 n
x = --- ∑ x i (1) Median 5575.83
ni = 1
Variance 3321618.56
where the xi are the observations and n is the Standard deviation 1822.531
number of observations. Another measure of Minimum 2083.79
central tendency is the median which if n is an Maximum 12660.69
even number is defined as: Skewness 1.070
Kurtosis 5.508
x[(n ⁄ 2) – 1] + x[(n ⁄ 2) + 1]
median = ----------------------------------------------------- (2) Jarque-Bera 38.03
2

and if n is odd is defined as:


distributed. The test statistic measures a combi-
nation of the skewness and kurtosis of the se-
median = x [ ( n + 1 ) ⁄ 2 ] (3)
ries with those values that would be implied if
the data were generated by a normal distribu-
For symmetric distributions the mean and me-
tion and is computed as:
dian will be equal to each other. The dispersion
is usually measured by the standard deviation
skewness 2 ( kurtosis – 3 ) 2
(or variance s2) which is defined as: JB = n -------------------------- + ------------------------------------- (7)
6 24
n
∑ ( xi – x )2 Under the null hypothesis of a normal distribu-
i=1 tion the Jarque-Bera statistic is distributed in
s= ----------------------------- (4)
(n – 1) large samples as χ2 with 2 degrees of freedom.
A number of other statistics can also be com-
The shape of the distribution of observations is puted, such as the minimum, the maximum, the
also often of interest. Measures usually re- percentiles and the range of the series. Table 3
ported include skewness: lists a number of descriptive statistics for the
n
average daily Produce Sales for a set of super-
∑ ( xi – x )3 markets from the Dominick’s database. From
Table 3 the values of the mean and median are
skewness = i----------------------------
=1
- (5)
s3 ( n – 1 ) very similar. The Jarque-Bera test can be cal-
culated as 38.03. The value of the χ2 with 2
and kurtosis: degrees of freedom at the 0.01 level of signifi-
n
cance is 9.21 indicating a strong rejection of
∑ ( xi – x )4 the null hypothesis that Produce Sales are nor-
mally distributed. However, this is not very in-
kurtosis = i----------------------------
=1
- (6)
s4 ( n – 1 ) formative and we have produced a series of
plots to investigate the implications of this test
For a symmetric distribution the benchmark for further.
skewness is 0 and for a normally distributed
random variable kurtosis is equal to 3. If the 3. Distributions
kurtosis exceeds 3, the distribution is peaked
relative to the normal; and if the kurtosis is less There are a number of useful methods for de-
than 3, the distribution is flat relative to the nor- scribing the distribution of a single variable. In
mal. this section we summarise the use of the box
The Jarque-Bera test is a widely used statis- plot which graphically presents the quartiles
tic for testing whether the series is normally of the data, the histogram which attempts to
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Hirschberg, Lu and Lye: Descriptive Methods for Cross-Section Data 337

provide a rough shape to the density form, and and the ‘lower adjacent value’. The upper adja-
the density plot which is a smoothed version cent value (xua) is defined as either the largest
of the histogram. observation [max(x)] if it is less than or equal
to the upper quartile plus 1.5 times the inter-
3.1 Box Plots quartile range or xua = min[Q0.75 + 1.5IQR,
max(x)]. For Produce Sales, xua = $8359. The
The box plot provides a summary display of the lower adjacent value (xla) is either the smallest
distribution of the data (Chambers et al. 1983) observation if it is greater than or equal to the
by graphically showing quartiles of the data. It lower quartile minus 1.5 times the IQR or 1.5
shows the centre of the distribution (the median times the IQR or xla = max[Q0.25 – 1.5IQR,
or 50th percentile (Q0.50)), the spread of the min(x)]. Thus for Produce Sales, it is the small-
bulk of the data (the length of the box which is est value, $2084.
the distance from the 25th percentile (Q0.25) to Any data points that fall outside the range of
the 75th percentile (Q0.75)), and how stretched the two adjacent values are referred to as out-
the tails of the distribution are (the length of the side or outlier values and are plotted as an indi-
lines relative to the box). Additional values that vidual point. In the case of Produce Sales, there
are greater than the limits are plotted as well. In are two outside values, observations 62 and 84,
addition, some box plot programs also have op- which are respectively the sales values of store
tions to locate the mean of the data and the 95 109, $11895, and the sales value of store 137,
per cent confidence bounds of the mean. $12661. If there are outside values it may be
In Figure 4 a simple box plot is illustrated for necessary to go back to the source of the data to
Produce Sales. The top and bottom of the rect- verify that these values are valid. In this case
angle represent the upper and lower quartiles of these two extreme values are for two large
the data and the centre line within the rectangle stores which both have more than $94000 in
is the median which is equal to $5576. The total average daily sales while the mean of total
upper and the lower quartiles are found by or- average daily sales is $56046.
dering the data and finding those values that The width of the box plot is arbitrary which
limit the upper 25 per cent and the lowest 25 means that multiple box plots could be placed
per cent. Thus the interquartile range (IQR) is side by side to allow for comparisons between
the range between the upper and lower quar- groups of data. We will examine this use of the
tiles IQR = Q0.75 – Q0.25; for Produce Sales, box plot further in Subsection 6.1 when we dis-
IQR = $2529. cuss multivariate plotting techniques.
The lines that extend from the ends of the
box (sometimes referred to as whiskers) go to 3.2 Histograms
what is referred to as the ‘upper adjacent value’
Another way to summarise a data distribution
is the histogram or density plot for data that
take discrete values. The range of the data is
Figure 4 Box Plots of Produce Sales
partitioned into several intervals of equal
length, the number of points is counted in each
interval and plotted as bar lengths in a histo-
gram (Chambers et al. 1983). The vertical axis
shows the proportion of the observations in
each bar and the relative heights of the bars
represent the relative density of cases in the in-
tervals. For Produce Sales, there is one store
that sells around $2000 and two stores that sell
more than $9000, as shown in Figure 5. The
Produce Sales of most stores are between
N = 84
$3000 and $9000.
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
338 The Australian Economic Review September 2005

Figure 5 Histogram of Produce Sales 1822.53) as defined by the formula for the esti-
mated normal density as given below:

1 –( z – x )2
f̂ N ( z ) = ----------------- exp --------------------- (8)
2 πσ̂ 2 2s 2

where x = 5574, s2 = (1822.53)2, z is the value


at which the density is evaluated and x is the
data in the sample. In Section 4 below we dis-
cuss more formal methods for making this
comparison to other densities.
It is not uncommon that samples of real data
include a small fraction of observations that lie
outside the typical range of the data. The histo-
gram illustrated in Figure 5 shows the two out-
Figure 6 Histogram of Produce Sales
with Six Intervals lying values for stores 109 and 137 as
identified from the box plot in Figure 4. This
demonstrates that the box plot may be more
useful than a histogram in identifying the loca-
tion of these outliers.

3.3 Kernel Density Estimation

To estimate a density is to construct a probabil-


ity model of the stochastic process that gen-
erated the observations from a sample of
continuous data. One method already men-
tioned above involves assuming the data are
distributed according to a well-defined density
function, such as the normal, and the parame-
The histogram is widely used because of its ters of the density function are estimated by
simplicity. However, histograms can give very statistics calculated from the data. By plugging
different visual impressions due to the arbi- these values into an expression such as (8) we
trary choice of the number and placement of have an exact formula for the density under the
intervals. For the same sales data, the histo- assumption that the data are indeed distributed
gram in Figure 5 has 22 intervals and the histo- according to the particular distribution. Unfor-
gram in Figure 6 has six intervals. The more tunately, as we see from Figures 5 and 6, the
intervals, the more detail the histogram shows. histogram of the data may not indicate that a
However, one should always be aware that distribution such as the normal is appropriate.
these details may not be features of the data In Section 4 we discuss formal methods for
but merely artefacts generated by the place- comparing the distribution of a sample to those
ment of the intervals. This should be taken into distributions implied by various assumed dis-
account when choosing the number of inter- tributions.
vals. The histogram is one of the most common
The histogram overlaid with a normal curve methods of density estimation that does not
can help detect outliers (Henry 1995). The nor- rely on the assumption of a particular density
mal curve is the density of a normally distrib- function. For data with only a finite set of dis-
uted random variable with the same mean and crete values the histogram is the primary
standard deviation as estimated for the data method. However, as outlined above, when ap-
(mean = 5574 and standard deviation = plying histograms to continuous data a number
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Hirschberg, Lu and Lye: Descriptive Methods for Cross-Section Data 339

of problems arise. First, histograms, unlike the be selected. Commonly the choice of K(•) will
data-generating process, are not continuous; be a function that defines a symmetric unimo-
they assume that all values within an interval dal probability density function and hence it
have the same probability of occurring. Sec- will follow that f̂ K ( z ) will itself be a probabil-
ond, it is quite possible to obtain quite different ity density function and will inherit all the con-
looking histograms for the same data by chang- tinuity and differentiability properties of the
ing the length of the interval used and by kernel K(•).
changing their locations. One can think of the histogram as a form of
One commonly employed alternative ap- kernel density estimate where the density is
proach is to use the kernel density estimator. only evaluated at the midpoints of the intervals
The kernel is a weighting function which is de- and the bandwidth is half the interval size. We
fined for each point at which the density is define the kernel as K(ui) = hI( u i ≥ 1), where
evaluated. The kernel density estimator is de- ui = (1/h)(xi – z) and I(•) is an indicator function
fined as: that takes a value of 1 if its argument is true and
0 otherwise. This means that whenever an ob-
1 n z – xi
f̂ K ( z ) = ------ ∑ K ⎛ ------------⎞
servation satisfies the inequality defined as
(9)
nh i = 1 ⎝ h ⎠ x i – z ≤ h it is included in the interval with a
midpoint at z. The estimated density for the his-
where x1 … xn are the observations in the sam- togram defined in the form of a kernel density
ple, f̂ K ( z ) denotes the estimated density func- estimator is given as:
tion, z is the value at which the density is being
evaluated, h is the bandwidth (also known as 1 n
f̂ H ( z ) = ------ ∑ hI ( u i ≤ 1 )
the smoothing parameter or window width) nh i = 1
and K(•) is the kernel function. z is usually a
1 n
value within the span of the values of the sam- = --- ∑ I ( x i – z ≤ h ) (10)
ni = 1
ple but could be outside the range of the values
of the sample; recall that in (8) we defined the
A widely employed kernel function is the
normal density which is defined over all values
Epanechnikov function (Epanechnikov 1969)
from –∞ to ∞. The bandwidth (h) is similar to
which has the form:
the width of the intervals in the histogram and
determines the smoothness of the density esti-
3
mate. The kernel function is a weighting func- K(ui) = --- ( 1 – u 2i )I ( u i ≤ 1 ) (11)
4
tion and is generally chosen so that less weight
is placed on observations that are further from
where again we use ui = (1/h)(xi – z). Note
z than is placed on those that are near. That is,
that only those observations which satisfy the
the distance from z, defined for a symmetric
kernel function as z – x i (the absolute value of
the difference between z and xi), takes larger
Figure 7 The Epanechnikov Kernel
values the closer xi is to z and smaller values
when they diverge from each other. When we
use (8), the normal distribution, to approximate
the density we downweight the density for
those observations that are further from the
mean of the sample. We also divide the dis-
tance from the mean by the variance which
makes the estimated density flatter in shape in
much the same way that larger values of the
bandwidth imply a smoother density.
To implement the kernel estimator a smooth-
ness parameter h and kernel function K(•) must
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
340 The Australian Economic Review September 2005

criterion that u i ≤ 1 are included in the com- little to choose between the various kernels and
putation of the density estimate for z. A graph it is quite appropriate to base the choice of ker-
of the Epanechnikov function is given in Fig- nel on which is easiest to compute.
ure 7. In Figure 7, as z → xi, then ui → 0 and Bandwidth (h) selection is crucial in density
K(ui) reaches its maximum value. estimation as it controls the smoothness of the
There are numerous other kernels that have density estimate. The larger the value of h the
been suggested; for example, the biweight ker- smoother f̂ K ( z ) . There are various ways of
nel: choosing h. One possibility is to plot out sev-
eral values of f̂ K ( z ) corresponding to different
15 2 values of h and then choose the estimate which
K(ui) = ------ ( 1 – u 2i ) I ( u i ≤ 1 ) (12)
16 meets with the prior expectation about the den-
sity. As with the choice of interval size/number
and the uniform kernel: with histograms, the choice of the bandwidth
should be made so that the complexity of the
1 data is not masked by too large a value for h,
K(u) = --- I ( u i ≤ 1 ) (13)
2 without making the artefacts in the particular
sample into prominent features. Silverman
We can also base the kernel density estimate on (1986) provides a general formula which he
the normal density functional form by using the claims works well for a variety of cases based
normal kernel function defined as: on the interquartile range (IQR), the estimated
standard deviation (s), and the sample size (n):
1
K(u) = ----------exp ( – 0.5u 2 ) (14)
2π h = 0.9n –1 ⁄ 5 min ( s, IQR ⁄ 1.34 ) (15)

Although the normal density has a well-known In Figure 8 a histogram of Produce Sales is
form, it has the potential disadvantage that out- presented. The width of the intervals case is
liers in the sample may influence the density given as $1000. Given the location and size of
estimate for a certain value even when they are these bins we have an indication of a bimodal
a great distance away. It also has the property probability function and a small probability of
that f̂ K ( z ) ≠ 0, ∀z ∈ R or that the estimated quite large values.
density is never equal to zero no matter where In Figure 9 the kernel density estimate is
it is evaluated on the real line. plotted by the Eviews computer program using
Silverman’s (1986) monograph is an impor- the Epanechnikov kernel and a bandwidth (h)
tant reference for a more detailed discussion of of 1000. A comparison shows that the kernel
these functions. Silverman comments that on density estimate is much smoother than the
the basis of efficiency measures there is very histogram although they both show similar

Figure 8 Histogram of Produce Sales Figure 9 Kernel Density Estimate of Produce Sales
Kernel
K e rn e lDDensity (Epanechnikov,
e n sity (E p a n e ch n iko v,hh =
= 1000.0)
1 0 0 0 .0 )
.0 0 0 2 4

.0 0 0 2 0

.0 0 0 1 6

.0 0 0 1 2

.0 0 0 0 8

.0 0 0 0 4

.0 0 0 0 0
4000 8000 12000

Produce
P ro d u ceSales
S a le s(Fruit
(fru itand
a n d Veg)
V eg )

©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Hirschberg, Lu and Lye: Descriptive Methods for Cross-Section Data 341

Figure 10 A Comparison of Kernel Figure 11 The Sensitivity of Kernel


Density Estimates with Different Bandwidths Density Estimates to Variations in the
Number of Points Used to Evaluate Them
Kernel Density (Epanechnikov)
Kernel Density (Epanechnikov, h = 1488.1)

Produce Sales (Fruit and Veg)

Produce Sales (Fruit and Veg)


characteristics. However, the probability of
generating a value in the intermodal range ap-
pears greater from the kernel density estimate
than from the histogram. When working with such as a price that can only be positive. In gen-
estimated density functions one must remem- eral, the more points the estimated density is
ber that the vertical axis, unlike in the histo- evaluated at, the smoother is the estimate. In
gram, does not measure probability. Because Figure 11 the line with dot points represents a
these are density functions for continuous data kernel density estimate constructed by evaluat-
the probability of values can only be deter- ing the density at only 10 points, whereas the
mined from the area under these functions for a continuous line represents the kernel density
particular span of values. Thus the units of estimate when the density is constructed using
measure for the variable of interest will influ- 100 points.
ence the values on the vertical axis. The estimation of density functions for con-
Figure 10 plots several kernel density esti- tinuous data has been facilitated by the advent
mates for Produce Sales using different band- of computer programs such as Eviews. Aside
widths including h = 1488.1 which is the from the graphical uses of these estimates one
automatic value generated by using (15). Note could also use these programs to obtain the val-
that as the bandwidth gets larger the density es- ues of these densities f̂ K ( z ) at each value of z.
timate becomes smoother and in particular the
bimodality and the hump in the tail features 4. Assessing Distributional Assumptions
disappear. Thus it is necessary to be cautious
when smoothing that important information is As noted above an alternative to allowing the
not removed from the resulting density esti- data of the sample to dictate the density of the
mate. data is to assume a well-defined distribution for
Another issue in the estimation of densities is the data-generating process. As shown in (8)
the location (z’s) at which the estimated density the estimated mean and standard deviation of
is evaluated. Usually the points chosen for the sample could be used along with the as-
evaluation are evenly spaced values within the sumption of the normal density function to pro-
range of the sample. Note that if we used a ker- duce an estimated density for a particular
nel function that is not constrained to be zero sample. In this section we discuss various
by the indicator function as in the case of the graphical methods to compare the distribution
normal distribution, it is possible for the kernel of a sample to alternative distributions. They
density estimator to indicate a probability of can be used to compare the distribution of a
having observations greater or less than the particular sample to another sample or they can
limits of the sample. This can also be the case be used to compare the distribution of a sample
when the random variable has a limiting value to a specific distribution such as the normal.
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
342 The Australian Economic Review September 2005

4.1 Q-Q Plots Figure 12 Q-Q Plot of Produce Sales

Q-Q plots are often called theoretical quantile-


quantile plots or probability plots (Chambers et
al. 1983). A particular quantile, for example the
0.85 quantile, of a set of data is defined to be a
number on the scale of the data that divides the
data into two groups so that 85 per cent are
below and 15 per cent are above that number.
A convenient method of finding the quantiles
or the empirical cumulative distribution is to
take the sample of observations x1 … xn and or-
dering the data from smallest to largest obtain-
ing the sorted data x(1) … x(n). The quantiles
Q(pi) are then defined as:
the test distribution predicts that these values
pi = --- ⎛ i – ---⎞
1 1
i = 1 to n (16) would have a greater probability of being ob-
n ⎝ 2⎠ served.
From Figure 12 one can find those observa-
This means that if there is a sample of size n
tions that deviate from the diagonal line in the
then the npi value of the sample when sorted
Q-Q plot and identify those cases where the
from highest to lowest value is the Q p of the
i data do not conform to the test distribution. In
data. For example, if n = 120 the Q0.1 = x(12)
this case we have a Q-Q plot of Produce Sales
(the 12th observation), if n = 500 the Q0.05 =
(generated by the SPSS program) compared to
x(25) and so forth. Thus the quantiles can be de-
the test distribution defined as the normal. The
termined simply by sorting the data.
two points identified in Figure 4 are identified
The Q-Q plot is constructed by plotting the
here as observations that are not consistent
quantiles of the sample data against the corre-
with the assumptions that the sample has been
sponding quantiles of another distribution
generated by a normal distribution. Apart from
which can be defined by a particular distribu-
these observations the rest of the data are not
tion or another dataset. It is generally used to
that different from what one would expect from
determine whether the distribution of data
normally distributed data except that at the left
matches a given distribution, which can be any
tail of the data the normal would generate more
of a number of test distributions (for example,
data in the lower end of the data and the right
normal, lognormal, student’s t and uniform).
tail of the data appears to have more mass than
Thus the plot has on one axis the empirical cu-
the equivalent normally distributed sample.
mulative probability based on the sorted value
of the sample and on the other axis the pre-
4.2 P-P Plots
dicted cumulative probability if the data were
generated by the test distribution based on the
P-P plots are very similar to Q-Q plots; the only
parameters of the distribution as estimated
difference is that the values are translated into
from the sample data. If the data are generated
the corresponding cumulative probabilities in-
by the test distribution, the points plotted
stead of listed as the values of the random vari-
should lie along a 45-degree line from the ori-
able. That is, for the ordered sample of
gin. The deviations from the line indicate
observations from smallest to largest, x(1) …
where the empirical distribution does not
x(n), a variable’s cumulative probability is de-
match the test distribution. If they are above the
fined as:
line then the empirical distribution has a
greater mass here than the test distribution i
Prob[x ≤ x(i)] = --- (17)
would predict. If they are below the line then n
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Hirschberg, Lu and Lye: Descriptive Methods for Cross-Section Data 343

Figure 13 P-P Plot of Produce Sales 5.1 Correlation Coefficient


Normal P-P Plot of Produce Sales (Fruit and Veg)
The correlation coefficient is the most com-
monly employed statistic used to quantify the
relationship between two variables x and y. It is
defined as:
n
∑ ( xi – x ) ( yi – y )
i=1
r = -------------------------------------------------------------------
- (18)
Expected Cum Prob

n n
∑ ( xi – x )2 ∑ ( yi – y )2
i=1 i=1

where x and y are the sample means of the n


observations on x and y. The value of r lies be-
Observed Cum Prob tween –1 and 1 and measures the linear associ-
ation between x and y. If r = 0 there is no linear
association; if r = 1 there is a perfect positive
These are then plotted against the cumulative linear relationship; and if r = –1 there is a per-
probability of any of a number of test distribu- fect negative relationship. The correlation co-
tions. If the selected variable matches the given efficient which corresponds to the relationship
distribution, all the points will lie on the 45- between Produce Sales and the Total Average
degree line out of the origin. Again as in the Q- Daily Sales is found to be 0.905 which indi-
Q plot, departures from this line give us infor- cates a strong relationship. It can be shown that
mation about how the distribution of the simple bivariate regression is closely related to
selected variable differs from the test distribu- the concept of correlation. The R2 statistic that
tion. indicates the degree of fit for the regression is
Figure 13 is the corresponding P-P plot to the equal to the square of the correlation between
Q-Q plot in Figure 12 (generated by SPSS). the dependent and independent variables.
Note that the rescaling of the data means that
the two outlier points are now represented by 5.2 Scatter Plots
the two observations that are predicted by the
normal to have a cumulative probability of 1. A simple scatter plot is a graph of one of the se-
In the reverse interpretation from the Q-Q plot, ries on the horizontal axis against the other se-
the points plotted above the line indicate that ries on the vertical axis. It is a useful method
the probability of their occurrence in the nor- for analysing the relationship between two
mal distribution is higher than they actually are variables. For example, it may highlight a non-
and the points below the line have a lower pre- linear relationship between two variables or it
dicted probability than they actually are. may highlight the presence of outliers. An ex-
tension of the simple scatter plot is to also in-
5. Analysing Two Dimensional Data clude in the graph the fitted regression line
resulting from the ordinary least squares esti-
Box plots, histograms, kernel density estima- mation of the two variables.
tors, Q-Q plots and P-P plots are all useful tech- In Figure 14 the scatter plot of Produce Sales
niques for describing the behaviour of a single against Total Average Daily Sales is illustrated.
variable. In this section, methods used to exam- From this figure it can be seen that there is a
ine the relationships between pairs of variables positive relationship between Produce Sales
will be discussed. We look at correlation coef- and Total Average Daily Sales. Also in this fig-
ficients and their graphic counterparts—scatter ure a number of observations have been high-
plots (also known as XY plots). lighted which correspond to observations that
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
344 The Australian Economic Review September 2005

Figure 14 Scatter Plot of Produce the correlation coefficients between the x and y
Sales and Total Average Daily Sales variables is around 0.7, yet the scatter plots
show quite different relationships between the
variables. Thus the relationship between them
is not as well established by the correlation co-
efficient as the scatter plot.
In the introduction of this article we pre-
sented two examples of simple regressions
where the scatter plot told the story that the re-
gression estimates did not. In the first example
we found that the correlation was quite high but
that this was due to a ‘dumbbell’-type relation-
ship. In the second example we determined
from the regression that the correlation was
very low but, just as in the second plot in Fig-
ure 15, the plot indicated a prominent relation-
ship between the variables.
may potentially be outliers. In particular, two
of these observations are observations that 6. Multivariate Techniques
have been highlighted in the Q-Q and box plots
for Produce Sales. Thus the two stores corre- The examination of the relationships among a
sponding to these observations have high Pro- group of variables is frequently the objective of
duce Sales and high Total Average Daily Sales. our analysis. Even if this is the case it is still
This may indicate that for these two stores their useful to begin by looking at each variable in-
Produce Sales are in fact a large component of dividually, paying attention to such things as
their total sales. Note that the type of relation- skewness, kurtosis, outliers, distributional as-
ship shown in Figure 1 is not present in this sumptions and so on. The techniques described
case, thus we can conclude that although the above in Sections 3 and 4 are useful for this
correlation coefficient is high it is not due to purpose. However, there are other graphical
the presence of a ‘dumbbell’ effect. techniques that can be used to look at a group
To demonstrate why the scatter plots provide of variables at a time. In this section two such
additional information we examine two more techniques will be described. The first is side
scatter plots in Figure 15. In both of these cases by side box plots and the second is the graphic

Figure 15 Two Scatter Plots with Correlation Coefficients Equal to 0.70

©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Hirschberg, Lu and Lye: Descriptive Methods for Cross-Section Data 345

analogue to the correlation matrix—the matrix Figure 16 Side by Side Box Plots
scatter plot.

6.1 Side by Side Box Plots

Side by side box plots are a collection of box


plots which display the distributions of a num-
ber of cases or variables in such a way that we
can compare not only the measures of central
tendency but the distributions of the variables.
As we mentioned above, the width of the box
plots is usually arbitrary so they can be scaled
in such a way that we can place them next to
each other as long as the ranges of the values
are similar in magnitude so they can be plotted 6.2 Matrix Scatter Plots
against a common vertical axis.
In Table 4 we have listed the descriptive sta- This graph consists of an array of scatter plots
tistics for a set of seven different sections of the such that any adjacent pair of plots has an axis
supermarkets in the sample. From this table it in common. It is the graphical equivalent to the
is hard to determine aside from the order of the correlation matrix. It is a useful method of
means how the sales in each department com- tracking an interesting point or group of points
pare. across a series of variables. While any number
In Figure 16, the side by side box plots pro- of variables can be included in such a graph it
duced by Eviews are shown for a selection of should be remembered that it is often easier to
different departments in our sample of super- interpret a graph if it can fit onto a single page.
markets in which the boxes are ordered by the In Table 5 a correlation matrix for four vari-
level of the median of sales values (note the ables is given. These variables are Grocery
stars show the location of the mean values). Sales, Meat Sales, Produce Sales and Total Av-
This figure allows us not only to compare the erage Daily Sales.
median values but also the interquartile dis- Figure 17 is a matrix scatter plot using SPSS
tances for each department. For example, this of four variables used in the construction of
figure shows that the highest sales in the Health Table 5. In this graph the observations 62 and
and Beauty department are within the lower 25 84 are tracked in each scatter plot.
per cent of the Produce Sales. Also, it shows From Figure 17 the observations tracked are
that Meat Sales are on average lower than Pro- outliers associated with Produce Sales. It is in-
duce Sales but that the interquartile distance is teresting to note that from the matrix scatter
the same for the two departments. plot it looks like these are outliers for Produce

Table 4 Summary Statistics for the Daily Sales of Supermarket Departments


Standard
Number Minimum Maximum Mean deviation
Produce Sales (Fruit and Veg) 84 2083.79 12660.69 5573.97 1822.53
Meat Sales 84 2809.67 8491.62 5364.44 1491.95
Dairy Sales 84 1951.03 9012.87 4610.69 1374.85
Frozen Sales 84 1526.82 6659.48 3511.63 1085.45
Deli Services 84 1222.91 4316.47 2437.31 739.30
Health and Beauty Sales 84 396.42 4209.98 1910.75 962.47
Bakery Sales 84 508.33 2540.69 1437.96 462.06

©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
346 The Australian Economic Review September 2005

Table 5 The Correlation Matrix Corresponding to the Different Variables Plotted in Figure 17
Produce Sales Total Average
Grocery Sales Meat Sales (Fruit and Veg) Daily Sales
Grocery Sales 1.00 0.91 0.87 0.97
Meat Sales 0.91 1.00 0.76 0.90
Produce Sales (Fruit and Veg) 0.87 0.76 1.00 0.91
Total Average Daily Sales 0.97 0.90 0.91 1.00

Figure 17 A Matrix Scatter Plot of Four Variables allows a lot of variation in the design of the el-
ements of the plots. It is possible to change the
shape, size, font, colour, darkness, orientation
and location along an axis of the graph to max-
imise its visual impact. In this section we
present an example of a scatter plot using the
default options in Excel and we demonstrate
what steps are involved in improving the visual
impact of this graph.

7.1 Default Scatter Plot

In Figure 18 we show the result of generating a


scatter plot between Dairy Sales (y axis) versus
Grocery Sales (x axis) by using the default op-
tions in Microsoft Excel. There are visual prob-
lems with the presentation of this graph:
confusing plot labels; the use of too much ink,
from the additional lines across the graph and
the background colour, that does not represent
Sales only. Note that if these stores were not in- data information; a lot of blank space caused by
cluded when computing the correlation coeffi- forcing the origin of the x and y axes to be at
cient between Meat Sales and Produce Sales zero; and the use of large symbols for the plot-
one would expect this relationship to be stron- ted points that obscure the location of the
ger. It is also of interest to note that in terms of points that are underneath. In this section we
Total Average Daily Sales, Meat Sales and demonstrate how one can change the default
Grocery Sales have a similar relationship options to improve the visual presentation
whereas it appears to be much steeper between using Microsoft’s Excel program.
Total Average Daily Sales and Produce Sales.
A plot of this type can be included with a cor-
relation matrix of these variables to determine Figure 18 Default Scatter Plot
whether the correlations found are due to out-
lier values or not.

7. General Principles of Graphical


Display

The value of using graphs in data analysis


comes when they show important patterns in
the data. For this purpose graphs need to be leg-
ible and well-designed plots. Modern software
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Hirschberg, Lu and Lye: Descriptive Methods for Cross-Section Data 347

7.2 Fixing Axis Labels performing both of these actions is illustrated


in Figure 20.
From the Chart Options dialogue box Titles is
chosen. In this set of options we remove Dairy 7.4 Reducing Non-Informative Space
in Chart title and label the x axis Grocery and
the y axis Dairy. These steps are illustrated in In Figure 18 the minimum of both the x axis
Figure 19. In addition, the label Dairy on the and y axis is 0, although the minimum value of
right of Figure 18 in the legend box can be re- Grocery is around 10000 and the minimum
moved by highlighting it and deleting it. value for Dairy is around 2000. To change the
default minimum values click on a number on
Figure 19 Chart Options Dialogue Box the x axis and choose the Format Axis option.
Select scale and type the value 10000 as the
minimum for the x axis. Similarly, to change
the minimum value of the y axis, click on a
number on the y axis and choose the Format
Axis option. Select scale and type the value
2000 as the minimum for the y axis.

7.5 Reducing the Size of the Plotted Points

The last step is to click on the plotted points to


obtain the Format Data Series option. Within
this option by selecting Pattern the size of the
7.3 Maximising the Data Images in the points can be made smaller. In Figure 21 the
Graph final scatter plot is illustrated.
It is important to remember that there are
Most of the ink on a graph should present data- many statistical packages that provide a graph-
related information (Tufte 1983). In Figure 18 ics package, however it may be necessary to
there is background shading and gridlines par- change the default options associated with
allel to the x axis. To remove the gridlines from these packages to obtain a graph that maxi-
the Chart Options dialogue box choose Grid- mises the visual presentation. In some software
lines and remove the check on Major gridlines it is possible to save the options so that all the
under Value (Y) axis. To remove the back- plots in a series have a similar design.
ground click on the right-hand button on the
mouse and choose the Format Plot Area option 8. Statistical Packages
and select white as the colour. The result of
Statistical packages have been available since
the mainframe computers became widely
Figure 20 Maximising Share of Data
Figure 21 The Final Scatter Plot

©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
348 The Australian Economic Review September 2005

available for data analysis in the 1960s. The offshoot of the demand for the capability of
earliest programs included rudimentary graph- computers to play sophisticated computer
ics routines that produced what are often re- games—more and more statistics packages
ferred to as ‘printer plots’ and line graphs were have been written with graphics plotting soft-
only available if one had access to purpose- ware. Microsoft Excel is a widespread program
built plotters. These printer plots are still avail- for the generation of plots. Two widely used
able in a number of programs and can provide software packages for graphics in economet-
a convenient method for scanning large rics are SPSS and Eviews (here we refer to ver-
amounts of data in that they can be produced sion 11.5 and version 5.0 respectively). Both of
very efficiently and scaled with a single se- these programs employ the point-and-click
lection command. With the widespread avail- menu-driven editing of the plot characteristics
ability of graphics-capable computers—an of the most widely used PC software such as
Figure 22 The Multiple Density Plots for Supermarket Sales

©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
Hirschberg, Lu and Lye: Descriptive Methods for Cross-Section Data 349

Microsoft Excel. Below we have listed the ca- ity of overlaying one plot on the other. This is
pabilities of Eviews and SPSS. particularly useful for summarising a number
Eviews and SPSS have graphics editors that of plots on one page. Figure 22 is an example
allow changes to almost all aspects of the plot. of how the distribution plots of the sales type
Once the initial plot has been created the plot listed in Figure 16 can be summarised in a se-
can be brought into the graphics editor. Both ries of kernel density plots. It is possible to re-
Eviews and SPSS allow you to modify the font, quest the estimated density be placed in a file
colour and scale of the axis—however SPSS that can be exported to another file so that den-
allows you to identify particular observations sities can be compared on the same plot. In Fig-
as well as annotate the graph and to add refer- ure 23 we have rescaled the density plots for
ence lines wherever needed inside the graphic both Health and Beauty Sales and Bakery Sales
area (see Figures 12, 14, 16 and 17 where par- so that they can be compared directly. Note that
ticular observations are identified). In addition, the area under each density curve is scaled so
with SPSS multiple graphic images can be cop- that it is equal to unity.
ied simultaneously and inserted into an MS-
Word document. SPSS also has the capability 9. Conclusion
of recording the particular modifications made
to a plot to a file so that new plots can be made The message to be drawn from this article is
with the same format using what is referred to that graphic representation of data can help to
as a ‘template’ file (an option also available in improve the understanding of the observational
Eviews). But a further feature of SPSS, which information used in statistical analysis. There
is available after every command, is that it cre- are a number of methods for summarising data
ates an exact ‘journal file’ which allows all the in a visual way. These methods can be used on
point-and-click commands completed in a ses- individual series of values or on paired data or
sion to be recorded to a ‘batch commands’ file with multiple series. The distributional as-
which can be read into the syntax window of sumptions of the data can be examined and the
the program. In this window the file can be ed- interrelationships between two variables and
ited with a text editor so that multiple identical multiple variables can be made. In addition, we
runs can be made with the same data. have demonstrated how a standard software-
Eviews unlike SPSS will compute non- generated graphic image can be improved to
parametric density estimates. In addition, in enhance the message in the graphic informa-
Eviews multiple graphs can be produced and tion of the data.
put into a single graphic file—with the capabil- The emphasis in this article is on cross-
section observations in that we have not dis-
cussed the time-series aspects of the data. Im-
Figure 23 The Overlay of the Density Plot plicitly we have assumed that the data under
for Health and Beauty Sales (Dashed Line) examination are identically and independently
and Bakery Sales (Solid Line) distributed. The second assumption is often
not the case when a sample is measured over
time. Unfortunately, when a sample is not in-
dependent the use of correlation methods with
other dependent data may result in spurious re-
sults. In addition, if the data are not identically
distributed, the estimation of the density func-
tion may be confounded by the fact that the
data may be generated by multiple processes
and thus trying to identify a single process
may be akin to the use of data from a dumb-
bell plot case to estimate a correlation coeffi-
cient. Graphical methods can be used with
©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research
350 The Australian Economic Review September 2005

time-series data to identify the nature of the Chambers, J. M., Cleveland, W. S., Kleiner, B.
data. This will be the topic of a future paper. and Tukey, P. A. 1983, Graphical Methods
for Data Analysis, Chapman & Hall, United
November 2004 States.
Epanechnikov, V. A. 1969, ‘Nonparametric es-
Endnotes timates of multivariate probability density’,
Theory of Probability and Applications, vol.
1. Note that this article assumes the knowledge 14, pp. 153–8.
from a first year introductory statistics subject. Greene, W. 2003, Econometric Analysis, 5th
edn, Prentice Hall, New Jersey.
2. Cross-section data are data on one or more Griffiths, W., Carter Hill, R. and Judge, G.
variables collected at the same point in time, 1993, Learning and Practicing Economet-
such as survey data. Although the methods de- rics, John Wiley & Sons Ltd, New York.
scribed here can be applied to data over time Henry, G. T. 1995, Graphing Data: Tech-
(time series), time-series data require special- niques for Display and Analysis, Sage Publi-
ised methods which are not discussed in this ar- cations, Thousand Oaks, California.
ticle. Kennedy, P. 2003, A Guide to Econometrics,
5th edn, Blackwell Publishing, United King-
3. The Dominick’s database covers store-level dom.
scanner data collected at Dominick’s Finer Koop, G. 2000, Analysis of Economic Data,
Foods over a period of more than seven years. John Wiley & Sons Ltd, New York.
The data are the property of the Marketing Leamer, E. 1994, Sturdy Econometrics, Ed-
group at the University of Chicago Graduate ward Elgar Publishing Company, Great Brit-
School of Business and are intended for aca- ain.
demic use only. For more detail on any other Pindyck, R. and Rubinfeld, D. 1998, Econo-
parts of this dataset consult the web site at metric Models and Economic Forecasts, 4th
<http://gsbwww.uchicago.edu/research/mkt/ edn, International edn, Irwin/McGraw Hill,
Databases/DFF/DFF.html>. Boston, Massachusetts.
Silverman, B. W. 1986, Density Estimation for
References Statistics and Data Analysis, Chapman &
Hall, London.
Anscombe, F. 1973, ‘Graphs in statistical an- Tufte, E. R. 1983, The Visual Display of
alysis’, American Statistician, vol. 27, pp. Quantitative Information, Graphics Press,
17–21. Cheshire, Connecticut.

©
2005 The University of Melbourne, Melbourne Institute of Applied Economic and Social Research

You might also like