You are on page 1of 136

INTRODUCTION TO STATA

COMPILED BY SOLOMON KEBEDE


INTRODUCTION TO STATA
1. WHAT IS STATA?
 It is a multi-purpose statistical package to help you explore, summarize and
analyze datasets.

 A dataset is a collection of several pieces of information called variables (usually


arranged by columns).
 Stata allows you to execute complex manipulations of large data sets using
simple but powerful commands.
 Based on the maximum size of dataset that can be handled there are 3 kinds of
stata.
Continued…
Characterstics Stata SE (special Intercooled Small
edition) & Stata Mp Stata Stata
(multi processer)
Max. no. of
variables 32,766 2,047 99

Max .no. of
observations 2,147,483,647 2,147,483,647 1,000

Max. no. of
characters for 244 80 80
string variables

Order of the
matrix 1000 * 1000 800 * 800 40 * 40
Continued…
Stata is friendly and flexible to use
It provides easy way of learning it
It is possible to adopt changes easily into Stata
Stata is both command & menu driven statistical
package
Using commands to execute tasks in Stata is the
efficient one as it makes switching to programs
much faster.
Continued
Though efficient using commands to execute
tasks in Stata it is sometimes challenging as Stata
is case sensitive
Stata does not appreciate any spelling, capitalizing or
spacing errors while writing programs (Stata
commands)
2.The Stata Interface
The Stata interface comprises of Stata windows, Stata
menus bar and Stata buttons bar
A. Stata Windows
The Stata windows give you all the key information about:
 The product of the Stata that you are using
 Recent commands executed together with their outputs
They also let you get help on how to use Stata & notifies
any errors you might done while executing programs
(Stata commands)
Some of the Stata windows open automatically when you
launch Stata while others can be opened from window
menu or using buttons on the tool bar
Continued…
 The Stata windows are:
1. Results: consists information about the product
of the Stata you are using , recent commands
together with their outputs except graphs and
feedback about errors you might done while
executing Stata commands (written in red)
2. Command : to enter a command into Stata for
execution
3. Variable : to see the names and descriptions of
the variables in the dataset that you are using
currently
4. Review : to see recent commands that you
executed
Continued…
5. Stata Editor : to edit and have a spreadsheet
like view of data in dataset that you are using
and to enter row data into Stata directly
6. Stata browser : to have a spreadsheet like view
of data in a the dataset that you are using (you
can’t edit data on it)
7. Stata do file editor : to construct a file of Stata
commands, to write and edit programs, to
execute a program or programs simultaneously
in Stata, to write descriptions about your
work so as to communicate it to outsiders
8. Stata viewer : to get help on how to use Stata
Continued…
9. Graph : to see the graph you drew
10. variables manager : to edit the names and
descriptions of variables in a dataset
B. Menus bar
 Stata’s menu bar extend from left to right at the
top of the outer window
 It contains 8 drop-down menus from which
different icons to execute different tasks in
Stata can be selected directly
 These drop-down menus generally used as a
substitute for what can be done by using Stata
commands
Continued…
 The following are 8 drop-down menus from
left to right of menus bar
 File
I. Open : to open a dataset from Stata data file
II. Save: to save changes you made on the dataset
III. save as: to save the whole dataset
IV. Do : to open do file editor
V. File name :to copy the file name of a dataset to
the command line
VI. Exit : to quit Stata
VII. Print : to print log or graph
Continued…
Edit
I. Copy/paste :to copy text among the command,
results and log windows
II. Copy table: to copy table from results window to
another file
III. Table copy options :what to do with table lines in
copy table
IV. Preferences: gives various options for setting
preferences
Continued…
Data: to do tasks of data and variable
managements
Graphics: to draw different kinds of ordinary
and statistical graphs
Statistics : to do various statistical summaries
and analysis
User : to access new Stata commands from
internet
Window : to open Stata window
Help: to search Stata command or to get help
with how to use Stata
C. Buttons (tool) bar
 The button bar contains some buttons to
execute routine tasks in stata directly by
clicking on them
 The buttons on the button bar, from left to
right, their functions and their equivalent
commands are summarized on the table
below
Continued..
Button Function Equivalent
command
open To open Stata data file use

Save To save changes made on data save

Print To print graph or log file Print

log To open/suspend or close log file Log


New viewer To open viewer window view

Graph Bring the graph window to front graph


New do-file editor To open new do file editor doedit

Data editor To open data editor window edit


Data browser To open data browser window browse
Variable manager To open the variable manager window
Clear To clear more condition on result window Space bar

Break Stop current command or do file Ctr + break


3. File management in Stata
 A folder which contains different files in it is
called directory
 A folder in which the active stata operates is
called the working directory
 You can see the location of the working
directory, create your own folders (directories)
in which you can save different files, remove
directories or files from any directory and
change the working directory easily using Stata
commands
Continued…
 pwd: to know the working directory(folder)
 dir : to list directories or files inside the working
directory
 cd : to change the working directory
 mkdir: to create new directory (folder) inside
the working directory or in any other directory
 rmdir: to delete directory (the folder should be
empty otherwise Stata can’t remove it )
Continued…
 specify the location of the directory after any
of the commands if the task you want to
execute is out of the working directory
 Otherwise Stata consider the task you want to
execute is in the working directory
 Use double quotes ( “ ” ) to enclose the
directory name if the directory name contains
spaces
 E.g mkdir . “F:/mube computer/ StataTR/mube “
. mkdir “section A”
3.1 Creating do Files in Stata
 The Do-file Editor allows you construct a stata
file to:
Store all the stata commands you have been using
throughout your research project, permanently
Write a new stata command or edit the existing
commands in the file
Run a stata command or a number of stata
commands simultaneously into stata
To write explanations about your work to let others
know how you get your results
Continued…
 To create a do file first open new do file editor
 To open the Do-file Editor, you can click on
Windows/Do-file Editor or click on the envelope
on the Tool Bar or use the command ‘doedit’
 To run commands from a do file:
click on the Do button or
click on ‘tools’ then ‘do’
Continued…
 If you want to run one or a few commands
rather than whole commands in the do file,
mark the commands you want to run and then
click on the do button
 If you have some notes on your do file and you want
stata not to consider them as commands, enclose
them by:
 * * If the note is on a single line and
 / * * / if your note is on more than a single line
 Otherwise stata consider your notes as stata
commands and try to run them while you run stata
commands from your do file
3.1 Creating and saving stata
results in stata log Files
 Stata result window will not keep all stata
outputs you generate
 It only store about 300-600 lines, and when it
reaches its limit stata will store newer outputs
by deleting the old ones
 Thus, you need a stata file in which you can
store all stata outputs you generate
,permanently
 Log files are stata files where you can save all
stata results we generate, permanently
Continued…
 log using is the command used for saving all
stata results you generate together with the
stata commands used to generate the results
 The syntax is:
log using filename, append (replace)
• Append: tells stata to add the outputs to an exiting
files
• replace: tells stata to delete and replace the existing
files with the new stata outputs
Continued…
 Example:
log using mube (in the working directory)
log using C:\Users\mube\Desktop\mube (on the
desktop)
 log off : to suspend the logging of outputs temporarily
 log on : to restart the logging
 log close : to turn off the logging and save the file
4. Entering raw data into stata
• You can enter raw data into stata to create
your own data file
• There are so many possible ways of entering
raw data into stata
• Here we will see only two them:
Directly using stata data editor window
Copy and pasting from excel
4.1 Setting the memory of STATA
 You can adjust the size of the memory of the
Stata accordingly with the size of the dataset
you are looking to explore
 You can also first check how much memory is
allocated to hold your data set
 To do this use the following command
• memory
 To adjust the size of the memory of your Stata
– Set memory 25m
– Set memory 250m, permanently
5. Variable management
 you can accomplish several tasks of variable
management such as :
Changing the name/names of a variable/s
Label a variable/s
Change the order of the variables in a data set
Assign value labels for variables in the dataset
Delete some variables from a dataset
 easily using stata
 These can be done in two ways:
Manually using variable manager window
Using stata commands
Variable management
 The following are stata commands used for
different tasks of variable management on a
dataset
rename: to change the name of a variable
• rename oldvar newvar
label var: to label a variable in a dataset
• label var varname “label”
Order : to adjust the order of the variables in the
dataset
• order varnames
drop or keep: to delete some variables from a
dataset
Variable management
• drop varnames (delete the variables from the
dataset)
• keep varnames: keep the variables and delete the
remaining variables from the dataset
6. Exploring data files
 A dataset is record of information about one or
more variables
 The maximum number of observations for a
variable or a set of variables contained in a
dataset is called size of the observations
 size of observations may not be necessarily
equal for all variables in a dataset as a result of
missing values
 A file in which a dataset is saved is called data
file
Continued…
 Datasets in a data file can be saved in different
formats
 Stata data file consists datasets saved in Stata
format (.dta format)
 But data sets can also be saved in excel format
(.xlsx format) or in text (ASCII )
 In this section we will see how to explore data
files saved in stata format by using Stata
command (program)
Continued…
 The Stata command
 syntax is as follows:
[by varilist1:] command [varlist2] [if exp] [in
range] [weight], [options]
 Example:
– bysort sex: reg gpa po
 Stata provides a separate regression results for female
and male students
– bysort sex: reg gpa po if gpa >=3
Stata provides a separate regression results for female
and male students whose GPA >=3
 Logical operators used in Stata are as listed
below in the table
~ Not
== Equal
~= not equal
!= not equal
> greater than
>= greater than or equal
< less than
<= less than or equal
& And
| Or
Examining dataset
• Clear
– The clear command deletes all files, variables, and
labels from the memory to get ready to use a new
data file
– You can clear memory using the clear command or
by using it as part of the use command
– This command does not delete any data saved to
the hard-drive
Examining dataset
• Use
– This command opens an existing Stata data file. The syntax is:
•  use filename [, clear ] opens new file
• use [varlist] [if exp] [in range] using filename [, clear ] opens
selected parts of file
– If there is no extension, Stata assumes it is .dta.
– If there is no path, Stata assumes it is in the current folder.
– You can use a path name such as: use C:\...\ERHScons1999
– If the path name has spaces, you must use double quotes:
use .”d:\my data\ERHScons1999”
– You can open selected variables of a file using a variable list.
– You can open selected records of a file using if or in.
Examining dataset
Here are some examples of the use command:
• use ERHScons1999 opens the file ERHScons1999.dta for analysis.
• use ERHScons1999 if q1a == 1 opens data from region 1

• use ERHScons1999 in 5/25 opens records 5 through 25 of file

• use hhid hhsize cons using ERHScons1999 opens 3 variables from


ERHScons1999 file
• use C:\training\ ERHScons1999 opens the file ERHScons1999.dta in the
specified folder
• Use “C:\data files\ ERHScons1999” use quotation marks if there are
spaces
• use ERHScons1999, clear clears memory before opening the new file
Examining dataset
• save
– The save command will save the dataset as a .dta file under the
name you choose. Editing the dataset changes data in the
computer's memory, it does not change the data that is stored on
the computer's disk.
 
• save “C:\...\consumption.dta”, replace
 
– The replace option allows you to save a changed file to the disk,
replacing the original file. Stata is worried that you will accidentally
overwrite your data file. You need to use the replace option to tell
Stata that you know that the file exists and you want to replace it.
Examining dataset
• edit
– This command use to open window called data
editor window that allow us to view all observation
in the memory.
– You can change the data using data editor window
but it is not recommend to edit data using this
window
– It is better to correct errors in the data using a Do-
file program that can be saved (we will see Do-file
program latter).
Examining dataset
• browse
– This window is exactly like the Stata editor window
except that you can’t change the data
• describe 
– This command provides a brief description of the data
file. You can use “des” or “d” and Stata will
understand. The output includes:
• the number of variables
• the number of observations (records)
• the size of the file
• the list of variables and their characteristics
Example 1: Using “describe” to show information about a data file
. des

Contains data from C:\training\ERHSCONS1999.dta


obs: 1,452
vars: 15 24 Feb 2007 07:07
size: 113,256 (98.9% of memory free) (_dta has notes)
-----------------------------------------------------------------------------
storage display value
variable name type format label variable label
-----------------------------------------------------------------------------
q1a float %9.0g reg Region
q1b double %15.0g w Wereda
q1c double %17.0g pa Peseant association
q1d double %12.0g Household id
sexh byte %8.0g sexhh Sex of household head
ageh float %9.0g p1s1q4 Age of household head
cons float %9.0g consumption per month
food float %9.0g food cons per month
hhsize byte %8.0g household size
aeu float %9.0g adult equivalent units in
household
fpi float %9.0g food price index
rconspc float %9.0g real consumption per capita
1994 prices
rconsae float %9.0g real consumption per adult 1994
prices
poor double %8.2f
hhid double %12.0f selected household unique id
-----------------------------------------------------------------------------
Sorted by: hhid
Examining dataset
• list
– This command lists values of variables in data set.
The syntax is:
• list [varlist] [if exp] [in range]
• examples:
– . list lists entire dataset
– . list in 1/10 lists observations 1 through 10
– . list hhsize q1a food lists selected variables
– . list hhsize sex in 1/20 lists observations 1-20 for selected variables
– . list if q1a < 6 lists cases in region is 1 through 5
Examining dataset
• if
– This command is used to select certain records in carrying out
a command
• command if exp
Examples:
– . list hhid q1a food if food >1200 lists data if food is above 1200
– . tab q1a if cons>1000 &cons<2000 frequency table of region if
consumption is in range
– . summarize food if q1a==3 | q1a==4 statistics on food consumption for
regions 3 and 4
– . browse hhid q1a food if food >=1200 browse data if food consumption is
above 1200
• Note that “if” statements always use ==, not a single =. Also note that |
indicates “or” while & indicates “and”
Examining dataset
• in
– We have also used in to select records based on
the case number. The syntax is:
• command in exp
For example:
• . list in 10 list observation number 10
• . summarize in 10/20 summarize observations
10-20
• . l in -10/-1 list the last 10 observations
Examining dataset
• codebook
– The codebook command is a great tool for getting
a quick overview of the variables in the data file.
– It produces a kind of electronic codebook from the
data file, displaying information about variables'
names, labels and values
. codebook
sexh Sex of household head
----------------------------------------------------------------------------

type: numeric (byte)


label: sexhh

range: [0,1] units: 1


unique values: 2 missing .: 0/1452

tabulation: Freq. Numeric Label


400 0 Female
1052 1 Male
Examining dataset
• inspect
– It is another useful command for getting a quick
overview of a data file.
– inspect command displays information about the
values of variables and is useful for checking data
accuracy . inspect sexh
sexh: Sex of household head Number of Observations
---------------------------- Non-
Total Integers Integers
| # Negative - - -
| # Zero 400 400 -
| # Positive 1052 1052 -
| # ----- ----- -----
| # # Total 1452 1452 -
| # # Missing -
+---------------------- -----
0 1 1452
(2 unique values)

sexh is labeled and all values are documented in the label.


Examining dataset
• count
– count command can be used to show the number of
observations that satisfying if options. If no
conditions are specified, count displays the number
of observations in the data.
•  . count
1452
•  . count if q1a==3
466
Descriptive Statistics
• tabulate, tab1, tab2
– These are three related commands that
produce frequency tables for discrete
variables.
– They can produce one-way frequency tables
(tables with the frequency of one variable)
or two-way frequency tables (tables with a
row variable and a column variable.
Descriptive Statistics
• tabulate or tab produce a frequency table
for one or two variables
• tab1 produces a one-way
frequency table for each variable in
the variable list
• tab2 produces all possible two-
variable tables from the list of
variables
Descriptive Statistics
You can use several options with these commands:
• all gives all the tests of association for two-way tables
• cell gives the overall percentage for two-way tables
• column gives column percentages for two-way tables
• row gives row percentages for two-way tables
• nofreq suppresses printing the frequencies.
• chi2 provides the chi squared test for two-way tables
There are many other options, including other statistical tests. For more information,
type “help tabulate”
Descriptive Statistics
Some examples of the tabulate commands are:
• . tabulate q1aproduces table of frequency by region
• . tabulate q1a sexh produces a cross-tab of frequencies by
region and sex of head
• . tabulate q1a hhsize, row produces a cross-tab by region and
hhsize with row percentages
• . tabulate sexh hhsize, cell nofreq produces a cross-tab of overall
percent by sex and hhsize.
• . tab1 q1a q1b hhsize produces three tables, a frequency table
for each variable
• . tab2 q1a poor sexh produces three tables, a cross- tab of
each pair of variables
Descriptive Statistics
• summarize
– The summarize command produces statistics on continuous variables like age, food, cons hhsize.
The syntax looks like this:
summarize [varlist] [if exp] [in range] [, [detail]]
By default, it produces the following statistics:
• Number of observations
• Average (or mean)
• Standard deviation
• Minimum
• Maximum
If you specify “detail” Stata gives you additional statistics, such as
• skewness,
• kurtosis,
• the four smallest values
• the four largest values
• various percentiles.
Descriptive Statistics
• Here are some examples:
• . summarize gives statistics on
all variables
• . summarize hhsize food gives statistics on
selected variables
• . summarize hhsize cons if q1a==3 gives statistics on
two variables for
one region
Descriptive Statistics
• by
– This prefix goes before a command and asks Stata
to repeat the command for each value of a variable.
The general syntax is:
by varlist: command
• Note: bysort command is most commonly used
to shorten the sorting process
example of the by prefix are:
– bysort sex: sum rconsaefor sex of hh head, give stats on real
per capita consumption.
Descriptive Statistics
• help
– The help command gives you information about any
Stata command or topic
help [command]
For example,
• . help tabulate gives a description of the
tabulate command
• . help summarize gives a description of the
summarize command
CREATING NEW VARIABLES
• We have seen how to explore the data using
existing variables so far.
• Now we will discuss how to create new
variables.
• When new variables are created, they are in
memory and they will appear in the Data
Browser, but they will not be saved on the
hard-disk unless you use the save command.
CREATING NEW VARIABLES
• generate
– This command is used to create a new variable. It
is similar to “compute” in SPSS.
• The syntax is;
generate newvar = exp [if exp]
where “exp“ is an expression like
“price*quant” or
“1000*kg“
CREATING NEW VARIABLES
• Cannot be used to change the definition of an
existing variable
• You can use “gen“ or “g” as an abbreviation for
“generate“
• If the expression is an equality or inequality,
the variable will take the values 0 if the
expression is false and 1 if it is true
• If you use “if“, the new variable will have
missing values when the “if“ statement is false
CREATING NEW VARIABLES
• For example,
– generate age2 = age*age
• create age squared variable
– gen yield = outputkg/area if area>0
• create new yield variable if area is positive
– gen price = value/quant if quant>0
• create new price variable if quant is positive
– gen highprice = (price>1000)
• creates a dummy variable equal to 1 for high prices
CREATING NEW VARIABLES
• replace
– This command is used to change the definition of
an existing variable.
• The syntax is the same:
replace oldvar = exp [if exp] [in exp]
CREATING NEW VARIABLES
• For example,
– replace price = avgprice if price > 100000
replaces high values with an average price
– replace income =. if income<=0 replace
negative income with missing value
– replace age = 25 in 1007
replace age=25 in observation #1007
CREATING NEW VARIABLES
• tabulate … generate
– This command is useful for creating a set of
dummy variables (variables with a value of 0 or 1)
depending on the value of an existing categorical
variable.
• The syntax is:
tabulate oldvariable, generate(newvariable)
CREATING NEW VARIABLES
tab q1a, gen(region)

• This creates 6 new variables:

region1=1 if q1a=1 and 0 otherwise


region2 =1 if q1a =3 and 0 otherwise
……
region6=1 if q1a =8 and 0 otherwise
CREATING NEW VARIABLES
• egen
– This is an extended version of
“generate”[extended generate] to create a new
variable by aggregating the existing data.
• The syntax is:

egen newvar = fcn(arguments) [if exp] [in range] , by(var)


CREATING NEW VARIABLES
• count()       number of non-missing values
• diff()       compares variables, 1 if different, 0 otherwise
• fill()       fill with a pattern
• group()       creates a group id from a list of variables
• iqr()       interquartile range
• ma()       moving average
• max()       maximum value
• mean()       mean
• median()     median
• min()     minimum value
• pctile()     percentile
• rank ()    rank
• rmean()     mean across variables
• sd ()      standard deviation
• std()     standardize variables
• sum ()    sums
CREATING NEW VARIABLES
• egen avg = mean(cons) creates
variable of average consumption over entire
sample
• egen avg2 = median(cons), by(sex) creates
variable of median consumption for each
sex
• egen regprod = sum(cons), by(region) creates
variable of total consumption for each
region
CREATING NEW VARIABLES
• Exercise,

• we want to know which households have


expenditure (cons) above the village average.

• I.e. Create a dummy (1 for those who


consume above the village/peasant
association and 0 otherwise)
CREATING NEW VARIABLES
• egen avecon=mean(cons), by( q1c)
• gen highavecon=(cons> avecon)
• list hhid q1c cons avecon highavecon in
650/675
CREATING NEW VARIABLES
• Arithmetic • Relational
+ addition > greater than
- subtraction < less than
* multiplication >= more than or equal
/ division <= less than or equal
^ power == equal
• Logical ~= not equal
~ not != not equal
| or
& and
CREATING NEW VARIABLES
• Here are some examples to illustrate the use
of these operators. Suppose you want to
create a
– dummy variable indicating households in the
Amhara region. One way is to write:
– generate AmD = 0
– replace AmD = 1 if q1a==3
– Or you can get exactly the same result with just
one command:
– generate AmD = (q1a==3)
CREATING NEW VARIABLES
• For example, a household head must be
female head and in Dodota wereda to be
selected.
– gen DDfemale = 0
– replace DDfemale = 1 if q1b==9 & sexh==0
– or an easier way to do this would be:
– gen DDfemale = (q1b==9 & sexh==0)
abs(x) computes the absolute value of x
exp(x) calculates e to the x power.
ln(x) computes the natural logarithm of x
log(x) is a synonym for ln(x), the natural logarithm.
log10(x) computes the log base 10 of x.
sqrt(x) computes the square root of x.
invnorm(p) provides the inverse cumulative normal; invnorm(norm(z)) = z.
normden(z) provides the standard normal density.
normden(z,s) provides the normal density. normden(z,s) = normden(z)/s if s>0 and s not
missing, otherwise, the result is missing.
norm(z) provides the cumulative standard normal.
group(x) creates a categorical variable that divides the data into x as nearly equal-
sized subsamples as possible, numbering the first group 1, the second
group 2, etc. It uses the current order of the data.
int(x) gives the integer obtained by truncating x.
round(x,y) gives x rounded into units of y.
CREATING NEW VARIABLES
• recode
– This command changes the values of a categorical
variable according to the rules specified.
• The syntax is:
recode varname old=new old=new . [if exp] [in
range]
 
CREATING NEW VARIABLES
• Here are some examples:
• recode x 1=2 changes all values of x=1 to x= 2
• recode x 1=2 3=4 changes 1 to 2 and 3 to 4
• recode x 1=2 2=1 exchanges the values 1 and 2 in x
• recode x 1=2 *=3 changes 1 in x to 2 and all other
values to 3
• recode x 1/5=2 changes 1 through 5 in x to 2
• recode x 1 3 4 5 = 6 changes 1, 3, 4 and 5 to 6
• recode x .=9 changes missing to 9
• recode x 9=. changes 9 to missing
CREATING NEW VARIABLES
• Notice that you can use some special symbols
in the rules:
* means all other values
. means missing values
x/y means all values from x to y
x y means x and y

• For example, recode region value 8 and 9 to 7


CREATING NEW VARIABLES
• xtile
– This command creates a new variable that
indicates which category a record falls into, when
the sample is sorted by an existing variable and
divided into “n” groups of equal size.
• Example: xtile can be used to create a variable
that indicates which income quintile a
household belongs to, which decile in terms of
farm size, or which tercile in terms of coffee
production.
CREATING NEW VARIABLES
• The syntax is:
xtile newvar = variable [if exp] [in range] , nq(#)

– where newvar is the new categorical variable


created; variable is the existing variable used to
create the quantile (e.g income, farm size); # is the
number of different categories (eg 5 for quintiles,
3 for terciles)
CREATING NEW VARIABLES
• For example,

xtile consquint = cons, nq(5)


xtile rconsquint = rconsae, nq(10)
MODIFYING VARIABLES
• We begin with an explanation of how to label
data in Stata. Then see how to format
variables.
– rename variable
– label variable
– label define
– label values
– format variable
MODIFYING VARIABLES
• rename variables
– This command is used to rename variables in
order to give other variable name. The command
is
rename old_variable new_variable
• Example: Generate a dummy for the region
variable and rename the new dummy
variables accordingly
MODIFYING VARIABLES
• label define
– This command gives a name to a set of value
labels. For example, instead of numbering the
regions, we can assign a label to each region.
• The syntax is:
label define lblname # "label" # "label" # “label“
[, add modify]
MODIFYING VARIABLES
• where
– lblname is the name given to the set of value labels
– # are the value numbers
– “label”are the value labels
– add means that you want to add these value labels to the
existing set
– modify means that you want to change these values in the
existing set 
• Note that:
• You can use the abbreviation “label def“
• The double quotation marks are only necessary if there are spaces in the
labels
• Stata will not let you define an existing label unless you say “modify” or “add“
MODIFYING VARIABLES
• label values
– This command attaches named set of value labels
to a categorical variable.
• The syntax is:
label values varname [lblname] [, nofix]
MODIFYING VARIABLES
label define reg 1"Tigray" 3"Amhara"
4"Oromia" 7"SNNP",modify
label values q1a reg
•  Some additional commands that may be
useful in labeling
– label dir to request a list of existing label names
– label list to request a list of all the existing value
labels
– label drop to delete a one or more labels
– label save using to save label definitions as a Do-file
– label data to give a label to a data file
MODIFYING VARIABLES
• format
– allows you to specify the display format for
variables. The internal precision of the variables is
unaffected.
•  The syntax for format command is
format varlist %fmt
%fmt description example
-----------------------------------------------------------------------------
Right-justified formats
%#.#g general numeric format %9.0g
%#.#f fixed numeric format %9.2f
%#.#e exponential numeric format %10.7e
%d default numeric elapsed date format %d
%d... user-specified elapsed date format %dM/D/Y
%#s string format %15s

Right-justified, comma formats


%#.#gc general numeric format %9.0gc
%#.#fc fixed numeric format %9.2fc

Leading-zero formats
%0#.#f fixed numeric format %09.2f
%0#s string format %015s

Left-justified formats
%-#.#g general numeric format %-9.0g
%-#.#f fixed numeric format %-9.2f
%-#.#e exponential numeric format %-10.7e
%-d default numeric elapsed date format %-d
%-d... user-specified elapsed date format %-dM/D/Y
%-#s string format %-15s

Left-justified, comma formats


%-#.#gc general numeric format %-9.0gc
%-#.#fc fixed numeric format %-9.2fc

Centered formats
%~#s string format (special) %~15s
-----------------------------------------------------------------------------
ADVANCED DESCRIPTIVE STATISTICS
• tabulate … summarize
– This command creates one- and two-way tables
that summarize continuous variables. With the
“summarize” option, we can put means and other
statistics of a continous variable.
• The syntax is:
tabulate varname1 varname2 [if exp] [in range],
summarize(varname3) options
• You can specify which statistics with options “means”,
“standard” and “freq“
• You can use the abbreviation “tab…sum( )
ADVANCED DESCRIPTIVE STATISTICS
• Some examples:
• tab q1a, sum(cons) gives the mean, std
deviation, and frequency of per capita
expenditure for each region
• tab q1b, sum(cons) mean gives the mean
consumption for each village
• tab q1a sexh, sum(food) gives the mean, std
deviation, and frequency in each
cell of hh head sex per region
ADVANCED DESCRIPTIVE
STATISTICS
• tabstat
– This command gives summary statistics for a set
of continuous variable for each value of a
categorical variable.
• The syntax is:
tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname)
• where
– varlist is a list of continuous variables
– statname is a type of statistic
– varname is a categorical variable
ADVANCED DESCRIPTIVE
STATISTICS
Some facts about this command:
– The default statistic is the mean.
– Optional statistics subcommands include mean, sum, max,
min, range, sd (standard deviation), var (variance),
skewness, kurtosis, median, and pn (nth percentile).
– Without the by() option, tabstat is like “summarize” except
that it allows you to specify the list of statistics to be
displayed.
– With the by() option, tabstat is like "tabulate … summarize
“except that tabstat is more flexible in the statistics and
format
ADVANCED DESCRIPTIVE
STATISTICS
• Examples
– tabstat food hhsize, stats(mean max min) gives mean,
max, and min of food & hhsize
– tabstat food hhsize, by(q1a) gives mean of two
variables for each region
– tabstat food, stats(median) by(q1a) gives the median
food consumption
for each region
• The tabstat command displays summary statistics for
a series of numeric variables in a single table.
ADVANCED DESCRIPTIVE STATISTICS
• table
– This command creates a wide variety of tables. It
is probably the most flexible and useful of all the
table commands in Stata.
• The syntax is:
table rowvar colvar [if exp] [in range], c(clist) [row col]
•  where
– rowvar is the categorical row variable
– colvar is the categorical column variable
– clist is a list of statistic and variables
– row is an option to include a summary row
– col is an option to include a summary column
ADVANCED DESCRIPTIVE
STATISTICS
• Some useful facts about this command:
– The default statistic is the frequency.
– Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min,
median, and pn (nth percentile).
– The c( ) is short for contents of each cell.
– Like tab, it can be used to create one- and two-way frequency tables, but table
cannot do percentages
– Like tab…sum, it can be used to calculate basic stats for each value of a
categorical variable
– Its advantage over tab…sum is that it can do more statistics and it can take
more than one continuous variable
– Like tabstat, it can be used to calculate advanced stats for each value of a
categorical variable
– Its advantage over tabstat is that it can use two (and more) way tables, but its
disadvantage is that it has fewer statistics.
ADVANCED DESCRIPTIVE
STATISTICS
• Here are some examples:
– table q1a , row table of frequencies by region with total row
– table q1a, c(mean income) table of average income by
region
– table q1a, c(mean yield sd yield median yield) table of yield
statistics by region
– table q1a, c(mean yield) format(%9.2f) table of average yields
by region with format .
– table q1a sexh, c(mean yield) table of average yield by region
and sex
– table q1a sexh, c(mean income mean yield) table of avg yield &
income by region & sex
PRESENTING DATA WITH GRAPH
•  The commands that draw graphs are
– graph twoway scatterplots, line plots,
– graph matrix scatterplot matrices
– graph bar bar charts
– graph dot dot charts
– graph box box-and-whisker plots
– graph pie pie charts
PRESENTING DATA WITH GRAPH
• Examples
– graph twoway scatter cons food
• We can show the regression line predicting
cons from food using lfit option.
– twoway lfit cons food
•  The two graphs can be overlapped like this
– twoway (scatter cons hhsize) (lfit cons hhsize)
PRESENTING DATA WITH GRAPH
• Labeling graphs
scatter read write, title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
• Example
scatter ageh cons , title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
NORMALITY AND OUTLIER
• skewness and kurtosis
sum rconspc
sum rconspc, detail
• check normality of a variable visually by
looking at some basic graphs
histogram rconspc
histogram rconspc, normal
histogram rconspc, normal bin(100)
NORMALITY AND OUTLIER
• graph box draws vertical box plots
graph box rconspc, by(sexh)
– y axis is numerical, and the x axis is categorical
– upper and lower bounds of box are defined by the
25th and 75th percentiles
– line within the box is the median
– ends of the whiskers are 5th and 95th percentile
• If rconspc is normal, the median would be in the
center of the box and the end of whiskers would be
equidistant from the box
NORMALITY AND OUTLIER
• The kdensity command with the normal option
kdensity rconspc, normal
– density graph of the residual with a normal distribution
superimposed on the graph
– useful in verifying that the residuals are normally
distributed
• pnorm command produces a P-P plot
pnorm rconspc
– It should be approximately linear if the variable follows
normal distribution
NORMALITY AND OUTLIER
• Qnorm command plots the quantiles of a variable
against the quantiles of a normal distribution
qnorm rconspc
– If the Q-Q plot shows a line that is close to the 45 degree
line, the variable is more normally distributed
• Both P-P and Q-Q plot prove that rconspc is not normal, with
a long tail to the right
• The qnorm plot is more sensitive to deviances from normality
in the tails of the distribution
• The pnorm plot is more sensitive to deviances near the mean
of the distribution
NORMALITY AND OUTLIER
• Dealing with outliers
– delete them from analyses
– use measures that are not sensitive to them, such
as median instead of mean
– transform the data to be more normal
– to replace them by imputation
NORMALITY AND OUTLIER
/* Calculate number of standard deviations from median by sex of hh head */
egen median=median(rconspc), by (sexh)
egen sd=sd(rconspc), by (sexh)
*generate the ratio of the deviation from the median to the standard deviation
gen ratio=(rconspc-median)/sd
* (3 missing values generated)
. *generate an outlier dummy if the value is 3 times the ratio above
gen outlier=1 if ratio>3 & ratio~=.
*(1414 missing values generated)
replace outlier=0 if outlier==. & ratio~=.
*(1411 real changes made)
  tabulate outlier, missing
table sexh outlier, contents(mean rconspc) row col missing
NORMALITY AND OUTLIER
• Listwise deletion
histogram rconspc if outlier==0, normal
• Data transformation
– a log transformation
gen lnrconspc=ln(rconspc)
histogram lnrconspc if rconspc~=., normal
• Imputation
– First the analyst estimates a regression model in which the
dependent variable has missing values
– In the second step, the estimated regression coefficients
are used to predict (impute) missing values of that variable
NORMALITY AND OUTLIER
replace rconspc=. If outlier==1
xi: regress lnrconspc i.q1a i.sexh i.poor hhsize ageh, robust
predict yhat

replace lnrconspc=yhat if rconspc==.

• Or

xi: impute lnrconspc i.q1a i.sexh i.poor hhsize ageh, gen(new1)


STATISTICAL TESTS
• compare
– The compare command is an easy way to check if
two variables are the same
• correlate command
– The correlate command displays a matrix of
Pearson correlations for the variable listed.
– correlate cons hhsize
STATISTICAL TESTS
• ttest command
– We would like to see if the mean of hhsize equals
to 6 by using single sample t-test, testing whether
the sample was drawn from a population with a
mean of 6. ttest command is used for this
purpose.
ttest hhsize=6
• We are also interested that if cons is close to food.
ttest cons=food
STATISTICAL TESTS
• ttest command for independent groups with pooled
(equal) variance
– ttest cons, by(sexh)
• ttest command for independent groups using
unequal variance
– ttest cons, by(sexh) unequal
• hotelling command performs Hotelling's T-squared
test of whether the means are equal between two
groups.
– hotel cons, by(sexh)
LINEAR REGRESSION
• Regression analysis involves estimating an
equation that best describes the data
• One variable is considered the dependent
variable, while the others are considered
independent (or explanatory) variables
• Stata is capable of many types of regression
analysis and associated statistical test
• Here we touch on only a few of the more
common commands and procedures
LINEAR REGRESSION
• Regress
– This is an example of ordinary linear regression by using
regress command.
reg cons hhsize
– This regression tells us that for every extra person (hhsize)
added to a household, total monthly expenditure (cons) will
increase by about 40 Ethiopia Birr
– This increase is statistically significant as indicated by the
0.000 probability associated with this coefficient
LINEAR REGRESSION
– r-squared (r2) which equals to 0.0676. This value tells us
that our independent variable (hhsize) accounts for
approximately 7% of the variation of dependent variable
(cons)
– Running a regression with robust standard errors will
tolerate a non-zero percentage of outliers, i.e., when the
residuals are not iid
– This is very useful when there is hetroscedasticity of
variance.
– The robust option does not affect the estimates of the
regression coefficients
reg cons hhsize, robust
LINEAR REGRESSION
– Stata stores results from estimation commands in e(), and
you can see a list of what exactly is stored using the
ereturn list command.
ereturn list
– Using the generate command, we can extract those results,
such as estimated coefficients and standard errors, to be
used in other Stata commands.
• reg cons hhsize
• gen intercept=_b[_cons]
• display intercept
• gen slope=_b[hhsize]
• display slope
LINEAR REGRESSION
– The estimates table command displays a table
with coefficients and statistics for one or more
estimation sets in parallel columns
estimates table, b se t p
– The predict command computes predicted value
and residual for each observation
predict pred
– When using the resid option the predict command
calculates the residual.
predict e, residual
LINEAR REGRESSION
– We can plot the predicted value and observed value using
graph twoway command.
regress cons food
predict pred
graph twoway (scatter cons hhsize) (lfit pred hhsize)
– The rvfplot command generates a plot of the residual
versus the fitted values. It is used after regress command.
regress cons food
rvfplot
– The rvpplot command produces a plot of the residual
versus a specified predictor
LINEAR REGRESSION
• Hypothesis tests
– The test command performs Wald tests for simple
and composite linear hypotheses about the
parameters of estimation
recode q1a 7/9=7
gen reg1=q1a==1
gen reg3=q1a==3
gen reg4=q1a==4
gen reg7=q1a==7
regress cons hhsize reg1 reg3 reg4 reg7
LINEAR REGRESSION
• test reg3=0
• test reg3= reg4= reg7
– The test command test the hypothesis that region 3
variable is zero (test reg3=0) and all region
variables (region3= region4 = region 7) are zero,
finding that the probability is very low (less than
0.000) so we can reject this hypothesis.
– If you want to test the joint significance of a set of
related variable, you can use
– testparm reg* test of hypothesis that all
region* dummies are zero
LINEAR REGRESSION
• Ramsey RESET to test for omitted variables
(misspecification)
ovtest [, rhs]
– This test amounts to estimating y = xb+zt+u and
then testing t=0 
regress cons hhsize reg3 reg4 reg7
ovtest tests significance of powers of
predicted cons
ovtest, rhs tests significance of powers of
hhsize, reg3, reg4 and reg7
LINEAR REGRESSION
• Example;
 ovtest
 Ramsey RESET test using powers of the fitted values of cons
Ho: model has no omitted variables
F(3, 1441) = 4.47
Prob > F = 0.0039
– The ovtest, reject the hypothesis that there are no
omitted variables, indicated that we need to
improve the specification
LINEAR REGRESSION
• Heteroskedasticity
– We can use the hettest command to run an
auxiliary regression of on the fitted values.
 hettest
Ho: Constant variance
Variables: fitted values of cons
chi2(1) = 81.50
Prob > chi2 = 0.0000
– The hettest indicates that there is
heterorskedasticity which needs to be dealt with
LINEAR REGRESSION
• We can also use information matrix test by
imtest command, which provides a summary
test of violations of the assumptions on
regression errors.
imtest
• The imtest also approved existence of
heteroskedasticity, skweness and kurtosis
problems
LINEAR REGRESSION
– The xi prefix is used to dummy code categorical
variables, and we tag these variables with an “i.”
in front of each target variable
xi: regress cons hhsize i.q1a, robust
– By default, Stata selects the first category in the
categorical variable as the reference category. If
we would like to declare a certain category as
reference category
char q1a[omit] 7
xi:regress cons hhsize i.q1a, robust
LINEAR REGRESSION
– Logistic regression
• logistic poor hhsize ageh sexh, coef
• xi:logit poor hhsize ageh sexh i.q1b
• ereturn list
• estat summarize
• estat ic
• mfx, (options)
– Options
dydx is the default.
eyex specifies that elasticities be calculated in the form of d(lny)/d(lnx)
dyex specifies that elasticities be calculated in the form of d(y)/d(lnx)
eydx specifies that elasticities be calculated in the form of d(lny)/d(x)
LINEAR REGRESSION
– The lrtest command performs a likelihood-ratio
test for the null hypothesis that the parameter
vector of a statistical model satisfies some smooth
constraint.
– To conduct the test, both the unrestricted and the
restricted models must be fitted using the
maximum likelihood method, and the results of at
least one must be stored using estimates store.
– The lrtest command provides an important
alternative to Wald testing for models fitted by
maximum likelihood.
LINEAR REGRESSION
– We would like to see if the introduction of regional dummy
will help our estimation. We perform a likelihood ratio test
using lrtest command.
xi: logit poor hhsize ageh i.q1a
estimates store n1
logit poor hhsize ageh
lrtest n1
Likelihood-ratio test LR chi2(5) = 169.86
(Assumption: . nested in n1) Prob > chi2 = 0.0000
– The null hypothesis is firmly rejected.
– Other hypothesis tests for parameters are the same as
described in OLS
DATA MANAGEMENT
• We can subset data by keeping or dropping
variables, or by keeping and dropping
observations
– keep and drop variables
• The keep command is used to keep variables in the list
while dropping other variables
• The drop command is used to delete variables in the
list while keeping other variables
– keep and drop observations
• The keep if command is used to keep observations if
condition is met and vice versa for drop
DATA MANAGEMENT
• sort
– The sort command arranges the observations of the
current data into ascending order based on the values of
the variables listed
• Variable ordering
– The order command helps us to organize variables in a way
that makes sense by changing the order of the variables
• by command, _N is the total number of observations
within each group listed in by command, and _n is
the running counter to uniquely identify observations
within the group
DATA MANAGEMENT
• Often we don’t have all the info that we need
in one dataset, and we have to merge them
into one (since STATA allows for only one
dataset in memory).
• There are several types of “merging”
datasets…
DATA MANAGEMENT
• As long as the variables
in the files are the same
and the only thing you
need to do is to add
observations, this is
vertical combination.
• For this we use the
append command.
• Since this is used less
often, I will skip it, but
you can look at it in the
help file.
DATA MANAGEMENT
• Appending data files
– concatenates two datasets, that is, stick them
together vertically, one after another
use ERHS1999.dta, clear
append using ERHS1997.dta
append using ERHS1995.dta
– The append command does not require that the
two datasets contain the same variables. But it
highly recommended to use identical list of
variables for append command to avoid missing
values from one dataset
DATA MANAGEMENT
• If the identifying variable
which appears in the files
is unique in both files,
then it's a one-to-one
match. Unique means that
for each value of this
variable, there is only one
observation that contains
it. In the figure below,
country is the identifying
variable. In both datasets,
each country has only one
observation.
DATA MANAGEMENT
• One-to-one match merging
• The merge command sticks two datasets horizontally, one next to
the other. Before any merge, both datasets must be sorted by
identical merge variable
. use p2sec9a.dta, clear
. sort hhid item1234
. save consumption.dta, replace
 .use p_r5, clear
. sort hhid item1234
. save comprice.dta, replace
 . use consumption.dta, clear
. merge hhid item1234 using compri
DATA MANAGEMENT
• One-to-many
matching
– If the identifying
variable is
unique in one
file, but not
unique in the
other, then it's a
one-to-many
matching.
DATA MANAGEMENT
• Collapse
– Sometimes we have data files that need to be
aggregated at a higher level to be useful for us. For
example, we have household data but we really
interested in regional data. The collapse command
serves this purpose by converting the dataset in
memory into a dataset of means, sums, medians
and percentiles
• For instance, we would like to see the mean cons in
each q1a and sex of hh head.
collapse (mean) cons, by(q1a sex)
DATA MANAGEMENT
• The reshape wide command tells system that
we want to go from long to wide after
collapsing . The i() option records row variable
while j() column variable
reshape wide cons, i(q1a) j(sexh)
IMPORT DATA IN STATA
• The insheet command can import data in text format (Tab
delimited, or comma separated values CSV files).
• Syntax:
insheet [variable names] using <filename> [,options]

• Options:
– tab : tab-delimited data
– comma : comma-delimited data
– delimiter("char"): use char as delimiter
– clear: replace data in memory
– names : variable names are included on the first line of the file
GOOD SITES TO LOOK AT!

• STATA HELP – either online or in the software itself.


• http://stataproject.blogspot.com.
• http://www.stata.com/
• http://www.stata.com/statalist/
• http://ideas.repec.org/s/boc/bocode.html
• http://www.princeton.edu/~erp/stata/main.html
• http://www.cpc.unc.edu/services/computer/presen
tations/statatutorial/
• http://www.ats.ucla.edu/stat/stata/
GOOD SITES TO LOOK AT!
• Statalist is hosted at the
Harvard School of Public
Health, and is an email
listserver where Stata users
including experts writing
Stata programs to users like
us maintain a lively dialogue
about all things statistical and
Stata. You
• can sign on to statalist so that
you can receive as well as
post your own questions
through email.

You might also like