You are on page 1of 6

PRACTICAL EXERCISE 2: SOLUTION

Part A: Introduction to STATA

1. Log on to the workstation using your student username and password.


2. Open STATA.

Opening a Data Set:


3. You can now open the data set that we’re working with for this prac exercise.
use “Prac 2 Dataset\univariate.dta", clear
(General Household Survey 2005)

Starting a Log File:


4. A log is a text file that keeps a record of everything that you have done within STATA and can be
saved for use later. You should always keep a log file of your work. You will also be required to
submit your log files to show your work for assignment purposes.
Start a log file:
. log using "F:\Quantitative Economics\prac2.log"
--------------------------------------------------------------------------------
log: F:\Quantitative Economics\prac2.log
log type: text
opened on: 25 July 2013, 14:53:05

Looking at the Data:


5. To see the data values that are contained in the dataset you have just opened, click on the ‘Data
Browser’ icon (it looks like a data table with a magnifying glass over it).
What is the value of the 20th observation of the variable age?
Answer: 48

Basic STATA Commands:


6. describe 
(or desc for short) This command will give a list of all the variable names and desciptions
contained in the data set.

Contains data from Mahomedy\Quantitative Economics\Tutorials\Data\univariate.dta


obs: 22,167 General Household Survey 2005
vars: 8 25 July 2013 15:25
size: 465,507 (99.4% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
gender byte %8.0g gender Sex
age long %12.0g Age
race byte %8.0g race Race
maristat byte %37.0g maristat Marital status
province byte %13.0g province Provinces
valrent long %12.0g Total monthly rent paid
tothhexp byte %12.0g tothhexp Total household expenditure in
last month
extrans long %12.0g Household expenditure on
transport in last month
-------------------------------------------------------------------------------

603969139.doc 1
How many variables and how many observations are contained in this dataset?
Answer: 8 variables and 22 167 observations.
What level of analysis would be performed with this sort of dataset? (e.g. are the observations in
the dataset individuals, households, firms, countries, etc?)
Each observation corresponds to an individual (their gender, age, race, etc.) therefore
we can perform the analysis at an individual (or disaggregated) level.

Part B: Calculating and Displaying Univariate Descriptive Statistics

Distribution of a Continuous Variable:


7. To get a histogram of a continuous variable, click on the ‘Graphics’ drop-down menu, then
‘Histogram’.
60
40
Percent
20
0

0 2000 4000 6000 8000


Total monthly rent paid

How would you describe the shape of this graph? What does the graph tell us about rent paid in
the sample?
The graph is heavily skewed to the right – most people pay much less than R2 000 per
month, but there are some who pay almost R8 000. These very large values might be
outliers.

Summary Statistics for a Continuous Variable:


There are several different ways of obtaining summary statistics such as the mean, median, range
and standard deviation.

8. One method is to simply type:


summarize valrent  (or sum valrent for short)
Note: STATA is a US program, so you must use US spelling for the commands!
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
valrent | 3351 536.8621 801.9689 0 8000
603969139.doc 2
The mean rent paid is R536.86 per month, with a standard deviation of R801.96.
The fact that the standard deviation is greater than the mean indicates that there is
a lot of variation amongst the values.
 Using the guide that values more than 3 standard deviations away from the mean may be
outliers, are there any potential outliers in this variable?
The cut-off points are R2 943 and –R1 869 (the negative value is irrelevant).
The maximum value of R8 000 is more than 9 standard deviations above the mean, and
can certainly be considered an outlier.
In fact, many of the figures above R3 000 might be considered as outliers (more than
3 std dev above mean). However, these are real values collected from real people,
and should not be ignored simply because they are very large.

9. The sum command doesn’t give a range or median. The tabstat command is useful as you
can specify exactly what descriptive statistics you want STATA to calculate:
tabstat valrent, statistics(n mean median range sd cv) 
Note: p50 is the median; sd is the standard deviation; cv is the coefficient of variation
variable | N mean p50 range sd cv
-------------+------------------------------------------------------------
valrent | 3351 536.8621 200 8000 801.9689 1.493808
--------------------------------------------------------------------------

 Compare the mean and the median – what can you conclude? (Hint: look at your
histogram from above).
p50 is the median i.e. 50% of the data values are at or below R200. Thus the
mean is more than 2.5 times larger than the median. This indicates a
distribution that is heavily skewed to the right, as was indicated by the
histogram.

Summary Statistics for a Categorical Variable:


10. Look at the variable names and descriptions (scroll up in the Results window to see the output
from your describe command, or type the command again, or drag the right margin of the
Variables window out until you can read the descriptions):
Which variable/s are categorical?
Are then any for which you can’t tell simply by looking at their descriptions?
The variables gender, race, maristat and province must be categorical. We know
that valrent is continuous, because we used it above, but it is unclear whether
tothhexp and extrans are categorical or continuous – it depends on whether people
were asked to state their expenditure as an exact figure (continuous) or were asked
what category their expenditure fell into (categorical).

11. One way of seeing whether a variable is continuous or categorical is to use the codebook
command as shown below. It displays a large quantity of information about the variable
specified:
Try it first on a variable you know to be continuous:

603969139.doc 3
codebook valrent 
--------------------------------------------------------------------------------
valrent Total monthly rent paid
--------------------------------------------------------------------------------

type: numeric (long)

range: [0,8000] units: 1


unique values: 291 missing .: 18816/22167

mean: 536.862
std. dev: 801.969

percentiles: 10% 25% 50% 75% 90%


50 90 200 600 1600

Now try it with a variable you know to be categorical, for example:


codebook maristat 
--------------------------------------------------------------------------------
maristat Marital status
--------------------------------------------------------------------------------

type: numeric (byte)


label: maristat

range: [1,5] units: 1


unique values: 5 missing .: 0/22167

tabulation: Freq. Numeric Label


9010 1 Married
2247 2 Living together like husband and
wife
3988 3 Widow
1141 4 Divorced/Separated
5781 5 Never married

What different sorts of information does STATA give, depending on the type of variable?
For both types of variable, STATA gives the range, number of unique (i.e. different)
values, units and number of missing values.
 For a continuous variable (which takes on lots of different values), STATA also
calculates the mean and standard deviation, and the percentiles.
 For a categorical variable (which takes on only a few values which correspond to
its categories), STATA also gives the frequency of each category, and the
category labels.

The variable for total household expenditure could be continuous, or recorded in categories.
Use codebook to determine its type.
--------------------------------------------------------------------------------
tothhexp Total household expenditure in last month
--------------------------------------------------------------------------------

type: numeric (byte)


label: tothhexp

range: [1,10] units: 1


unique values: 10 missing .: 0/22167

examples: 1 R0-R399
2 R400-R799
3 R800-R1199
5 R1800-RR2499

603969139.doc 4
This is a categorical variable. It has ten categories (we know this because the
variable takes on 10 unique values), but the codebook command only lists a randomly-
selected few of the categories.

12. Construct a frequency distribution for the categorical variable maristat:


tab maristat 
Marital status | Freq. Percent Cum.
--------------------------------------+-----------------------------------
Married | 9,010 40.65 40.65
Living together like husband and wife | 2,247 10.14 50.78
Widow | 3,988 17.99 68.77
Divorced/Separated | 1,141 5.15 73.92
Never married | 5,781 26.08 100.00
--------------------------------------+-----------------------------------
Total | 22,167 100.00

 What is the mode for marital status in this sample?


Most people (40.65% of the sample) are married, therefore married is the
mode (also referred to as the ‘modal category’)..
 What does the final cumulative column do?
This column adds up the percentage figures for each successive category. Thus
the figure of 50.78 indicates that 50.78% of people in this sample are either
married or living together with a partner.

Graphical Representation of a Categorical Variable:


13. To construct a bar graph for a categorical variable, again use the drop-down menu for
‘Graphics’, then ‘Easy graphs’ then ‘Histogram’ (Note: Do not select ‘Bar chart’. In STATA
terminology, the type of graph we want to construct is referred to as a discrete histogram).
40
30
Percent
2010
0

0 1 2 3 4 5
Marital status

Is it easy to interpret this graph? For example, can you tell what proportion of the sample are
widows?
This graph is difficult to interpret because the categories on the x-axis are not
labelled.
Note that STATA labels the x-axis with the code it uses to store the categories. You can get
STATA to label the bars with the names of the categories, but it’s complex.
603969139.doc 5
14. You can also represent categorical data using a pie chart. Again, use the drop-down menu for
‘Graphics’, then ‘Pie chart (by category)’.

26.08%

40.65%

5.147%

17.99%
10.14%

Married Living together like husband and wife


Widow Divorced/Separated
Never married

Because the pie chart function automatically adds a legend for the graph, it’s easy to see which
category is which.

15. Create a pie chart for any other categorical variable. Write a short interpretation.
For example, using tothhexp:

.212%
.6541%
1.453%
4.61%
8.562% 20.91%

6.113%

9.947%

32.03%
15.51%

R0-R399 R400-R799
R800-R1199 R1200-R1799
R1800-RR2499 R2500-R4999
R5000-R9999 >R10000
Don't know Refuse

Roughly a third of total household spending is between R400 and R799 per month.
However, a large portion of spending (20%) is even less, at below R400 per month.
Interestingly, 6% of those surveyed refused to answer this question.
16. Take a look at what information STATA has stored in your log file.
You’ll notice that STATA has recorded all of the output from your Results window. When you
finish your prac session today, your log file will automatically close. Then at a later stage, you
can open your log file using Word, and edit it like any normal document.
However, your log file does not save your graphs.
603969139.doc 6

You might also like