Professional Documents
Culture Documents
Fall 2016
GETTING STARTED
This exercise uses the data file autolab.dta, which is posted on Moodle. Copy that file into the
“Work” folder on the computer you are working on.
At a few points in this demo, some questions about how you would do certain things have been
inserted. When you get to these, stop, think about them, and work on them until you have
figured them out. You don’t have to turn anything in, but you will learn a lot more if you tackle
these questions seriously.
We have seen how to get the number of observations, mean, standard deviation, and the
minimum and maximum values of one or more variables:
sum price
sum price mpg
Finding percentiles with Stata is a little weird, but not hard. It requires two commands:
Note: This is an example of how, after some commands, Stata temporarily saves some of
the results it has generated. In these cases, it is possible to see what results are saved, and
the labels Stata has given them, by typing “return list.” Here is another example:
Another command that produces descriptive statistics for quantitative variables is tabstat:
tabstat price
And you can use tabstat to find more than just the mean. For instance:
Note: How do I know what statistics you can specify using the stat option after the
tabstat command?
help tabstat
Histograms
To make a histogram of the gas mileage variable:
histogram mpg
You can tell Stata how many intervals (or “bins”) you want the values divided into:
You can tell Stata to show the histograms “by” a categorical variable—i.e., for subsets of the
entire data set corresponding to different values of a categorical variable. You might expect it to
work like this:
sort origin
by origin: histogram mpg, percent
2
That doesn’t look good—it would be nicer if the bars lined up nicely over the values of 1, 2, 3, 4
and 5. When the variable for which you are constructing a histogram consists of “discrete”
values such as these, you can get a better looking histogram like this:
Box plots
To make a box plot of the price variable:
To make separate box plots for different subsets of the data defined by a categorical variable:
You can even do this for subsets of data defined by two categorical variables:
For an ordered categorical variable, it is easy to make a frequency table. The tabulate
command gives you a table of frequencies, relative (percent) frequencies, and cumulative relative
frequencies:
tabulate repair
And you can do this for subsets of the data defined in various ways:
Note: The bysort prefix allows automatically sorts the data before executing
the command, so that you don’t have to precede it with a sort command before
the command staring with the by prefix.
For an unordered categorical variable, things are almost just the same, but you need to be
careful about one thing. Try this:
tabulate make
3
Does all the output that Stata produces make sense?
The thing to be careful about: Stata doesn’t distinguish between ordered and unordered
categorical variables; if the categories of a categorical variable do have some meaningful order,
you just need to keep that in mind when you work with the data and interpret your results. And
one particular thing to keep in mind: when you construct a frequency table for an unordered
categorical variable, the cumulative relative frequencies do not mean anything, and you should
therefore make sure they are not included in any tables you make.
where some possible choices for stat_choice are mean, sd, count, p50, and sum; and
quant_var1 quant_var2 … quant_varK is a list of K quantitative variables.
You get one bar for each variable you specify, and the height of each bar is the value of the
stat_choice you specified.
4
For example:
You can do the same thing with just one quantitative variable, but you don’t get much:
But you can use the over option to get one bar for each value of a categorical variable you
choose, with the height of each bar equal to the value of the stat_choice you chose for the
quantitative variable you specified:
And you can make a similar bar graph over two different categorical variables:
Compare that to what happens if you type the over(repair) and over(origin) in the
reverse order:
You can avoid the overlapping labels for “Domestic” and “Foreign” by inserting another option
inside the over(origin)option:
You can get Stata to leave out the categories with no observations—i.e., (repair=F &
origin=”Foreign”) and (repair=D & origin=”Foreign”)
QUESTIONS:
How would you get a bar chart with one bar for each repair record, and the height of each bar
equal to the proportion of observations with that repair record?
5
How would you get a bar chart with one bar for each repair record, and the height of each bar
equal to the number of observations with that repair record?
How would you get a bar chart with two bars, one representing domestic cars and the other
representing foreign cars, and the height of each bar equal to the mean value of repair for the
cars in the category represented by the bar?
How would you get a bar chart with two bars, one representing domestic cars and the other
representing foreign cars, and the height of each bar equal to the proportion of cars (in the
category represented by the bar) that have repair records of 4 or 5?
Scatterplots
To make a scatterplot of weight and mpg:
Note: The first variable you list after scatter is shown on the vertical axis; the second
variable is shown on the horizontal axis.
You can get Stata to use different symbols to mark the points on the scatter plot:
Note: The m stands for markersymbol. To see what other marker symbols are
available, type
help symbolstyle
In addition to marking points on the scatterplot with symbols, you can also label them on the
basis of some other variable. For instance, to get a scatterplot with points representing domestic
cars labeled with 0 and foreign cars labeled with 1:
6
Here is an example of how we could use the commands demonstrated above to notice that
something seems to be going on in this data, and to figure out what it is.
Looking at that scatterplot, I’d say that there is something going on. It looks like there are two
main clusters of points, with some less densely clustered points scattered above them. Do the
two main clusters of points represent two different groups of cars? If so, can we discover the
characteristic or characteristics that separate or define these two groups?
Maybe one cluster of points is mostly foreign cars and one cluster of points is mostly domestic
cars. To check whether this is true:
Wow—that is pretty striking. All the points in the cluster on the right side of the scatterplot (as
well as all the points scattered above this cluster) are domestic cars; the cars in the cluster on the
left side of the scatterplot (as well as all the points scattered above this cluster) include both
foreign and domestic cars, but it looks like they are predominantly foreign.
We could also try to see whether the clustering of points in the price/turn radius scatterplot is
related to the repair records of the cars. For this, I want to create a new variable called
repair3 which is a recoding of repair that collapses the five categories of repair into
three categories.
Here is a way to generate this new variable using the recode command, a command we
have not seen before. I want repair3 to equal 0 of repair is equal to 1 or 2; to equal
1 if repair equals 3 or 4; and to equal 2 if repair equals 5. Here is how I can get
that with the recode command:
Now I can use repair3 to label points in the price/turn radius scatterplot:
Do you see any kind of pattern in that scatterplot? It might be easier to see any pattern that
might exist if we make separate scatterplots for each category of repair3:
A digression: When you are examining data and you look at it in one way, or with the
goal of answering a certain question, what you see often suggests that you might want to
7
look at it another way, or explore some other question you had not thought about before.
Exploring unanticipated questions that arise as you look at the results you generate with
Stata is a good way to learn about your data and make interesting discoveries.
At this point in this demo, we have come to an example of this phenomenon: We have
been looking at scatterplots of the prices of cars and their turning radii, with the goal of
uncovering any association that might exist between the two variables. But in the last
graph, I noticed something: it looks like the distribution of the price variable does not
differ much across repair categories; but the distribution of the turning radius variable
does appear to differ quite a bit across repair categories.
So suppose we want to focus on this apparent pattern, and look carefully at how the
univariate distributions of price and turning radius do or don’t differ across repair
categories. (So we have moved on to a question that is different from the issue we began
with, namely exploring the bivariate relationship between price and turning radius.)
Here is a way to compare the distribution of price across the three repair categories:
Those three distributions look pretty similar, don’t they? If they don’t look so similar to
you, wait until you see what the distribution of turn looks like for the three different
repair categories:
How would you describe the differences in the distribution of turn for the three repair
categories?
Let me reiterate the point of this digression: The moral of the story is that if you notice
something that you hadn’t expected, or that you think needs to be explained, or that you
think might tell you something valuable to know—follow up on it: think about what
questions the thing you have noticed raises, and then figure out how to get Stata to help
you look at the data in ways that will help you answer those questions. This digression
was just a little example of this process in action.
8
HEY! When you ask Stata to find the covariance between two variables, does it give you
the population covariance or the sample covariance? Which would you guess it gives
you? How could you find out for sure?
As a matter of fact, we should have asked a similar question a long time ago.
I.e.: when you ask Stata to find the variance of a single variable (e.g., using the
summarize or the tabstat command), does it give you the population
variance or the sample variance? Which would you guess it gives you? How
could you find out for sure?
Should we also ask whether the correlation Stata calculates for us is the population
correlation or the sample correlation?
A final reminder about calculating the covariance and/or correlation between two variables
BEWARE!! Remember that you should calculate the covariance and/or correlation between two
variables only if:
For this part of the demo, I’d like to have an additional categorical variable to work with, so let’s
generate a new variable called country that shows the country in which the car was
manufactured:
generate country="US"
Note: The three forward slashes at the end of the line mean that the command wraps on
to the next line. If you type three forward slashes at the end of a line in a do-file, Stata
ignores the “return” at the end of the line—so it keeps reading to the end of the command
on the next line, rather than executing the part of the command you have typed when you
9
get to the “return” at the end of the line. If you are typing the commands shown above in
Stata’s command line (rather than working from a do-file), you should leave the slashes
out.
Cross-tabulations
A cross-tabulation (or cross-tab) is a way of representing the joint distribution of two categorical
variables. The rows of the table represent the categories of one of the variables, and the columns
represent the categories of the other variable. Each cell of the table shows the number of
observations for which the values of the two categorical variables are those that correspond to
the row and column of the cell.
To get Stata to construct a cross-tab, you use the same tabulate command you used to get a
frequency table for a single categorical variable; the difference is that, to get a cross-tabulation,
you type the names of two categorical variables:
--and by labeling the values of the variable repair3 (which, you may recall, must
be done in two steps: first, define a set of labels, and second, apply that set of labels
to the variable repair3)
Cell percentages: In the above cross-tab, the number in each cell is the number of observations
for which the values of country and repair3 that correspond to the values represented by
the row and column of the cell. We can also get Stata to report the percentage (rather than total
number) of all observations falling in each cell:
This tells us, for example, that 52.17 percent of the cars in the sample were made in the US and
had a repair record of “average.”
10
tabulate country repair3, row
then (as always) Stata shows the number of observations corresponding to each cell, as well as
the percentage of the observations in the row that fall in the cell. For instance, the total number
of observations falling in the “US” row is 48, and 75 percent of those observations (i.e., 36 out of
48) fall in the row representing “average” repair record.
You can specify all three options—cell, row, column—(or any two of these three options) when
you construct a cross-tab:
Now let’s see how you can make bar charts that shows the information you see in a cross-tab
graphically.
First, let’s make a bar chart that shows the frequencies in the cells of a cross-tab. To make this
bar chart, it will help to have a set of dummy variables representing the four categories of
country and the three categories of repair3, so to begin with let’s generate those dummy
variables:
Now we can make the bar graph. Actually, there are two different ways to do it, both of which
show you exactly what you see in the cross-tabulation, though arranged a little differently in the
two cases:
or:
In both of these bar charts, each bar represents one cell of the cross-tabulation; the height of the
bar is the number of observations in the cell.
We can even get the heights of the bars (i.e., the number of observations in the corresponding
cell of the cross-tabulation) shown at the top of each bar:
11
graph bar (sum) ctry1 ctry2 ctry3 ctry4, over(repair3) ///
blabel(bar)
or:
Here is something to think about: could you make a bar chart with one bar corresponding to
every cell of the cross-tabulation (as in the last two bar charts we made above), but with the
height of each bar equal to the row percentage for the corresponding cell of the cross-tabulation,
instead of the number of observations in the cell?
Note: Actually, to be precise, try to make this bar chart with the height of each bar equal
to the row proportion for the corresponding cell of the cross-tabulation, rather than the
row percentage. (The row proportion will be between 0 and 1; the row percentage will be
between 0 and 100.) The chart will look just the same whether you use proportions of
percentages; it just happens to be easier to get Stata to do it with proportions.
And similarly, could you make a bar chart with one bar corresponding to every cell of the cross-
tabulation, but with the height of each bar equal to the column proportion for the corresponding
cell of the cross-tabulation, instead of the number of observations in the cell?
The answer to both of these questions is “yes.” So the questions I’ll leave to you to think about
are these:
How would you make a bar chart with one bar corresponding to every cell of the cross-
tabulation (as in the last two bar charts we made above), but with the height of each bar
equal to the row proportion for the corresponding cell of the cross-tabulation, instead of
the number of observations in the cell?
How would you make a bar chart with one bar corresponding to every cell of the cross-
tabulation (as in the last two bar charts we made above), but with the height of each bar
equal to the column proportion for the corresponding cell of the cross-tabulation, instead
of the number of observations in the cell?
12