You are on page 1of 12

Econ 396A

Fall 2016

STATA DEMO 3: DESCRIPTIVE STATISTICS WITH STATA

GETTING STARTED

This exercise uses the data file autolab.dta, which is posted on Moodle. Copy that file into the
“Work” folder on the computer you are working on.

Double-click on autolab.dta to launch Stata and open the auto data.

At a few points in this demo, some questions about how you would do certain things have been
inserted. When you get to these, stop, think about them, and work on them until you have
figured them out. You don’t have to turn anything in, but you will learn a lot more if you tackle
these questions seriously.

Descriptive statistics for a single quantitative variable: numerical summaries

We have seen how to get the number of observations, mean, standard deviation, and the
minimum and maximum values of one or more variables:

sum price
sum price mpg

And you can get additional descriptive statistics by typing

sum price mpg, detail

Finding percentiles with Stata is a little weird, but not hard. It requires two commands:

_pctile price, p(12 25 83 99)


return list

Note: This is an example of how, after some commands, Stata temporarily saves some of
the results it has generated. In these cases, it is possible to see what results are saved, and
the labels Stata has given them, by typing “return list.” Here is another example:

summarize price, detail


return list
summarize price if price>=r(p50)

Another command that produces descriptive statistics for quantitative variables is tabstat:

tabstat price
And you can use tabstat to find more than just the mean. For instance:

tabstat price, stat(min, p25, p50, p75, max)

Note: How do I know what statistics you can specify using the stat option after the
tabstat command?

help tabstat

Descriptive statistics for a single quantitative variable: graphical summaries

Histograms
To make a histogram of the gas mileage variable:

histogram mpg

You can specify other scales for the vertical axis:

histogram mpg, percent


histogram mpg, frequency

You can tell Stata how many intervals (or “bins”) you want the values divided into:

histogram mpg, percent bin(20)

You can tell Stata to show the histograms “by” a categorical variable—i.e., for subsets of the
entire data set corresponding to different values of a categorical variable. You might expect it to
work like this:

sort origin
by origin: histogram mpg, percent

but that’s not how it works. Instead:

histogram mpg, percent by(origin)

And try one more option:

histogram mpg, percent by(origin, total)

Look what happens if we make a histogram of repair:

histogram repair, percent

2
That doesn’t look good—it would be nicer if the bars lined up nicely over the values of 1, 2, 3, 4
and 5. When the variable for which you are constructing a histogram consists of “discrete”
values such as these, you can get a better looking histogram like this:

histogram repair, percent discrete

Box plots
To make a box plot of the price variable:

graph box price

To make separate box plots for different subsets of the data defined by a categorical variable:

graph box price, by(origin)


graph box price, by(repair)
graph box price, over(repair)

You can even do this for subsets of data defined by two categorical variables:

graph box price, over(repair) over(origin)

Descriptive statistics for a single categorical variable: frequency tables

For an ordered categorical variable, it is easy to make a frequency table. The tabulate
command gives you a table of frequencies, relative (percent) frequencies, and cumulative relative
frequencies:

tabulate repair

And you can do this for subsets of the data defined in various ways:

tabulate repair if price<6000 & origin==”Domestic”

bysort origin: tabulate repair

Note: The bysort prefix allows automatically sorts the data before executing
the command, so that you don’t have to precede it with a sort command before
the command staring with the by prefix.

For an unordered categorical variable, things are almost just the same, but you need to be
careful about one thing. Try this:

tabulate make

3
Does all the output that Stata produces make sense?

The thing to be careful about: Stata doesn’t distinguish between ordered and unordered
categorical variables; if the categories of a categorical variable do have some meaningful order,
you just need to keep that in mind when you work with the data and interpret your results. And
one particular thing to keep in mind: when you construct a frequency table for an unordered
categorical variable, the cumulative relative frequencies do not mean anything, and you should
therefore make sure they are not included in any tables you make.

Creating a set of dummy variables to represent a categorical variable

The labor intensive way:

generate rep1=1 if repair==1


replace rep1=0 if repair!=1 & repair!=.

generate rep2=1 if repair==2


replace rep1=0 if repair!=2 & repair!=.

generate rep3=1 if repair==3


replace rep3=0 if repair!=3 & repair!=.

generate rep4=1 if repair==4


replace rep1=0 if repair!=4 & repair!=.

generate rep5=1 if repair==1


replace rep5=0 if repair!=5 & repair!=.

A much easier way to accomplish the same thing:

tabulate repair, gen(altrep)

Graphical representation of the distribution of a categorical variable: bar charts

Stata’s basic syntax for bar charts is as follows:

graph bar (stat_choice) quant_var1 quant_var2 … quant_varK

where some possible choices for stat_choice are mean, sd, count, p50, and sum; and
quant_var1 quant_var2 … quant_varK is a list of K quantitative variables.

You get one bar for each variable you specify, and the height of each bar is the value of the
stat_choice you specified.

4
For example:

graph bar (mean) price weight mpg


graph bar (mean) price weight mpg if origin==“Domestic”
graph bar (p50) price weight mpg if origin==“Domestic”

graph bar (sd) price weight mpg


graph bar (sum) price weight mpg

You can do the same thing with just one quantitative variable, but you don’t get much:

graph bar (sd) price


graph bar (sum) price

But you can use the over option to get one bar for each value of a categorical variable you
choose, with the height of each bar equal to the value of the stat_choice you chose for the
quantitative variable you specified:

graph bar (mean) price, over(repair)

And you can make a similar bar graph over two different categorical variables:

graph bar (mean) price, over(repair) over(origin)

Compare that to what happens if you type the over(repair) and over(origin) in the
reverse order:

graph bar (mean) price, over(origin) over(repair)

You can avoid the overlapping labels for “Domestic” and “Foreign” by inserting another option
inside the over(origin)option:

graph bar (mean) price, over(origin, label(angle(45)))


over(repair)

You can get Stata to leave out the categories with no observations—i.e., (repair=F &
origin=”Foreign”) and (repair=D & origin=”Foreign”)

graph bar (mean) price, over(origin, label(angle(45)))


over(repair) nofill

QUESTIONS:

How would you get a bar chart with one bar for each repair record, and the height of each bar
equal to the proportion of observations with that repair record?

5
How would you get a bar chart with one bar for each repair record, and the height of each bar
equal to the number of observations with that repair record?

How would you get a bar chart with two bars, one representing domestic cars and the other
representing foreign cars, and the height of each bar equal to the mean value of repair for the
cars in the category represented by the bar?

How would you get a bar chart with two bars, one representing domestic cars and the other
representing foreign cars, and the height of each bar equal to the proportion of cars (in the
category represented by the bar) that have repair records of 4 or 5?

Bivariate descriptive statistics for two quantitative variables

Scatterplots
To make a scatterplot of weight and mpg:

scatter weight mpg

What’s the difference if you type

scatter mpg weight

Note: The first variable you list after scatter is shown on the vertical axis; the second
variable is shown on the horizontal axis.

You can get Stata to use different symbols to mark the points on the scatter plot:

scatter mpg weight, m(+)


scatter mpg weight, m(t)
scatter mpg weight, m(th)

Note: The m stands for markersymbol. To see what other marker symbols are
available, type

help symbolstyle

In addition to marking points on the scatterplot with symbols, you can also label them on the
basis of some other variable. For instance, to get a scatterplot with points representing domestic
cars labeled with 0 and foreign cars labeled with 1:

scatter mpg weight, m(+) ml(import)

Note: The ml stands for markerlabel.

6
Here is an example of how we could use the commands demonstrated above to notice that
something seems to be going on in this data, and to figure out what it is.

Look at a scatterplot of price and turning radius:

scatter price turn

Looking at that scatterplot, I’d say that there is something going on. It looks like there are two
main clusters of points, with some less densely clustered points scattered above them. Do the
two main clusters of points represent two different groups of cars? If so, can we discover the
characteristic or characteristics that separate or define these two groups?

Maybe one cluster of points is mostly foreign cars and one cluster of points is mostly domestic
cars. To check whether this is true:

scatter price turn, ml(import)

Wow—that is pretty striking. All the points in the cluster on the right side of the scatterplot (as
well as all the points scattered above this cluster) are domestic cars; the cars in the cluster on the
left side of the scatterplot (as well as all the points scattered above this cluster) include both
foreign and domestic cars, but it looks like they are predominantly foreign.

We could also try to see whether the clustering of points in the price/turn radius scatterplot is
related to the repair records of the cars. For this, I want to create a new variable called
repair3 which is a recoding of repair that collapses the five categories of repair into
three categories.

Here is a way to generate this new variable using the recode command, a command we
have not seen before. I want repair3 to equal 0 of repair is equal to 1 or 2; to equal
1 if repair equals 3 or 4; and to equal 2 if repair equals 5. Here is how I can get
that with the recode command:

recode repair 1=0 2=0 3=1 4=1 5=2, generate(repair3)

Now I can use repair3 to label points in the price/turn radius scatterplot:

scatter price turn, ml(repair3)

Do you see any kind of pattern in that scatterplot? It might be easier to see any pattern that
might exist if we make separate scatterplots for each category of repair3:

scatter price turn, by(repair3)

A digression: When you are examining data and you look at it in one way, or with the
goal of answering a certain question, what you see often suggests that you might want to

7
look at it another way, or explore some other question you had not thought about before.
Exploring unanticipated questions that arise as you look at the results you generate with
Stata is a good way to learn about your data and make interesting discoveries.

At this point in this demo, we have come to an example of this phenomenon: We have
been looking at scatterplots of the prices of cars and their turning radii, with the goal of
uncovering any association that might exist between the two variables. But in the last
graph, I noticed something: it looks like the distribution of the price variable does not
differ much across repair categories; but the distribution of the turning radius variable
does appear to differ quite a bit across repair categories.

So suppose we want to focus on this apparent pattern, and look carefully at how the
univariate distributions of price and turning radius do or don’t differ across repair
categories. (So we have moved on to a question that is different from the issue we began
with, namely exploring the bivariate relationship between price and turning radius.)

Here is a way to compare the distribution of price across the three repair categories:

graph box price, over(repair3)

Those three distributions look pretty similar, don’t they? If they don’t look so similar to
you, wait until you see what the distribution of turn looks like for the three different
repair categories:

graph box turn, over(repair3)

How would you describe the differences in the distribution of turn for the three repair
categories?

Let me reiterate the point of this digression: The moral of the story is that if you notice
something that you hadn’t expected, or that you think needs to be explained, or that you
think might tell you something valuable to know—follow up on it: think about what
questions the thing you have noticed raises, and then figure out how to get Stata to help
you look at the data in ways that will help you answer those questions. This digression
was just a little example of this process in action.

Covariance and correlation


Because (i) mpg and weight are quantitative variables, and (ii) the relationship between these
two variables looks more-or-less linear, it makes sense to calculate the covariance and
correlation between them. To get the covariance:

correlate mpg weight, covariance

To get the correlation:

correlate mpg weight

8
HEY! When you ask Stata to find the covariance between two variables, does it give you
the population covariance or the sample covariance? Which would you guess it gives
you? How could you find out for sure?

As a matter of fact, we should have asked a similar question a long time ago.
I.e.: when you ask Stata to find the variance of a single variable (e.g., using the
summarize or the tabstat command), does it give you the population
variance or the sample variance? Which would you guess it gives you? How
could you find out for sure?

Should we also ask whether the correlation Stata calculates for us is the population
correlation or the sample correlation?

A final reminder about calculating the covariance and/or correlation between two variables
BEWARE!! Remember that you should calculate the covariance and/or correlation between two
variables only if:

--both of them are quantitative variables


and
--you have reason to believe (often on the basis of a scatterplot) that the relationship
between the two variables is at least approximately linear

Bivariate descriptive statistics for two categorical variables

For this part of the demo, I’d like to have an additional categorical variable to work with, so let’s
generate a new variable called country that shows the country in which the car was
manufactured:

generate country="US"

replace country="Japan" if make=="Datsun" ///


|make=="Honda" | make=="Mazda" ///
| make=="Subaru" | make=="Toyota"

replace country="Germany" if make=="Audi" ///


| make=="BMW" | make=="VW"

replace country="Other" if make=="Fiat" ///


| make=="Peugeot" | make=="Renault" | make=="Volvo"

Note: The three forward slashes at the end of the line mean that the command wraps on
to the next line. If you type three forward slashes at the end of a line in a do-file, Stata
ignores the “return” at the end of the line—so it keeps reading to the end of the command
on the next line, rather than executing the part of the command you have typed when you

9
get to the “return” at the end of the line. If you are typing the commands shown above in
Stata’s command line (rather than working from a do-file), you should leave the slashes
out.

Cross-tabulations
A cross-tabulation (or cross-tab) is a way of representing the joint distribution of two categorical
variables. The rows of the table represent the categories of one of the variables, and the columns
represent the categories of the other variable. Each cell of the table shows the number of
observations for which the values of the two categorical variables are those that correspond to
the row and column of the cell.

To get Stata to construct a cross-tab, you use the same tabulate command you used to get a
frequency table for a single categorical variable; the difference is that, to get a cross-tabulation,
you type the names of two categorical variables:

tabulate country repair3

Pins and Whistles: Let’s make that look nicer, by:

--labeling the variables country and repair3

label variable country “Country of Origin”


label variable repair3 “Repair Record”

--and by labeling the values of the variable repair3 (which, you may recall, must
be done in two steps: first, define a set of labels, and second, apply that set of labels
to the variable repair3)

label define repairlabel 0 Poor 1 Average 2 Good


label values repair3 repairlabel

So now let’s see the nicer-looking version:

tabulate country repair3

Cell percentages: In the above cross-tab, the number in each cell is the number of observations
for which the values of country and repair3 that correspond to the values represented by
the row and column of the cell. We can also get Stata to report the percentage (rather than total
number) of all observations falling in each cell:

tabulate country repair3, cell

This tells us, for example, that 52.17 percent of the cars in the sample were made in the US and
had a repair record of “average.”

Row percentages: If you type

10
tabulate country repair3, row

then (as always) Stata shows the number of observations corresponding to each cell, as well as
the percentage of the observations in the row that fall in the cell. For instance, the total number
of observations falling in the “US” row is 48, and 75 percent of those observations (i.e., 36 out of
48) fall in the row representing “average” repair record.

Row percentages: Try typing

tabulate country repair3, column

and interpreting the percentages given in the cells of the cross-tab.

You can specify all three options—cell, row, column—(or any two of these three options) when
you construct a cross-tab:

tabulate country repair3, cell row column

Now let’s see how you can make bar charts that shows the information you see in a cross-tab
graphically.

First, let’s make a bar chart that shows the frequencies in the cells of a cross-tab. To make this
bar chart, it will help to have a set of dummy variables representing the four categories of
country and the three categories of repair3, so to begin with let’s generate those dummy
variables:

tabulate country, gen(ctry)


tabulate repair3, gen(rep3_)

Now we can make the bar graph. Actually, there are two different ways to do it, both of which
show you exactly what you see in the cross-tabulation, though arranged a little differently in the
two cases:

graph bar (sum) ctry1 ctry2 ctry3 ctry4, over(repair3)

or:

graph bar (sum) rep3_1 rep3_2 rep3_3, over(country)

In both of these bar charts, each bar represents one cell of the cross-tabulation; the height of the
bar is the number of observations in the cell.

We can even get the heights of the bars (i.e., the number of observations in the corresponding
cell of the cross-tabulation) shown at the top of each bar:

11
graph bar (sum) ctry1 ctry2 ctry3 ctry4, over(repair3) ///
blabel(bar)

or:

graph bar (sum) rep3_1 rep3_2 rep3_3, over(country) ///


blabel(bar)

Here is something to think about: could you make a bar chart with one bar corresponding to
every cell of the cross-tabulation (as in the last two bar charts we made above), but with the
height of each bar equal to the row percentage for the corresponding cell of the cross-tabulation,
instead of the number of observations in the cell?

Note: Actually, to be precise, try to make this bar chart with the height of each bar equal
to the row proportion for the corresponding cell of the cross-tabulation, rather than the
row percentage. (The row proportion will be between 0 and 1; the row percentage will be
between 0 and 100.) The chart will look just the same whether you use proportions of
percentages; it just happens to be easier to get Stata to do it with proportions.
And similarly, could you make a bar chart with one bar corresponding to every cell of the cross-
tabulation, but with the height of each bar equal to the column proportion for the corresponding
cell of the cross-tabulation, instead of the number of observations in the cell?

The answer to both of these questions is “yes.” So the questions I’ll leave to you to think about
are these:

How would you make a bar chart with one bar corresponding to every cell of the cross-
tabulation (as in the last two bar charts we made above), but with the height of each bar
equal to the row proportion for the corresponding cell of the cross-tabulation, instead of
the number of observations in the cell?

How would you make a bar chart with one bar corresponding to every cell of the cross-
tabulation (as in the last two bar charts we made above), but with the height of each bar
equal to the column proportion for the corresponding cell of the cross-tabulation, instead
of the number of observations in the cell?

12

You might also like