STATA Notes 2022

STATA
Choosing a statistical software package is often a strategic decision which entails the
investment of time and money. One should think about the future development and
compatibility of the software support system available from peers and exchanging
experience.
1. Introduction
Stata is a powerful statistical package with smart data-management facilities, a wide array of
up-to-date statistical techniques, and an excellent system for producing publication-quality
graphs. The official website is http://www.stata.com/. Its main strengths are handling and
manipulating large data sets, and capabilities for handling panel and time-series regression analysis.
Stata is designed for the usage by command line, but it also offers a GUI that allows for
working with menus. The command line usage has the advantage that all steps of the
analysis, and thus all results, are easily replicable.
Versions of Stata
There are a number of different versions of STATA available, these are STATA SE (Special
Edition), Intercooled STATA and Small STATA. The standard version is called Stata/IC (or
Intercooled Stata) and can handle up to 2,047 variables. The Stata/SE that can handle up to
32,766 variables.
The Stata Interface
When Stata starts up you see four or five docked windows, depending on the version; the
later versions have 5 windows, which are:
• Command Window: where you type your commands.

• Review Window records your commands/ keeps track of the commands you have
used.
• Results window displays your output/results. Only graphics will appear in a separate
window
• Variables window lists the variables in the data set you are using. By clicking on a
variable its name can be transferred to the command window.
L Siziba HSTS204 2016

• The Properties (version 12upwards) displays properties of your variables and dataset.
Important to note:
✓ You can resize or even close some of these windows. Stata remembers its settings the next
time it runs.
✓ You can also save (and then load) named preference sets using the menu Edit|Preferences.
✓ You can also choose the font used in each window; just right click and select font from the
context menu.
✓ You can change the colour scheme. For earlier versions of Stata the preset scheme is classic,
the traditional black background.
For version 8 and above Stata's graphical user interface (GUI) allows selecting commands
and options from a menu and dialog system. The use of the GUI is always the easier route
however the command language ensures reproducibility of your results. Commands can be
typed on a separate file, called a do file, or in the command window. The GUI can be helpful
when you are starting to learn Stata, particularly because after you point and click on the
menus and dialogs, Stata types the corresponding command for you.
Typing Commands
Stata commands that must be in lowercase.
Stata can work as a calculator using the display command.

Stata marks the lines you type using a dot at the start of a line: Example
. display 5+2
7
Stata commands are case-sensitive.
The Page Up key can be used to recall commands typed earlier on and then edit it using the
arrow, insert and delete keys.
Stata Command Syntax
The standard structure of Stata commands usually follows the following syntax, where bold
indicates keywords and square brackets indicate optional elements:
[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [using filename]
[,options]
We now describe each syntax element:

command:
The only required element is the command itself, which is usually (but not always) an
action verb, and is often followed by the names of one or more variables.
Commands can usually be abbreviated as noted earlier. When we introduce a
command we underline the letters that are required. For example regress indicates
that the regress command can be abbreviated to reg.
varlist:

The command is often followed by the names of one or more variables, for example
describe y or regress y x. Variable names are case sensitive.
=exp:
Commands used to generate new variables, such as generate log_x = log(x),
include an arithmetic expression, basically a formula using the standard operators (+ -
* and / for the four basic operations and ^ for exponentiation), functions, and
parentheses.
if exp and in range:
As we have seen, a command's action can be restricted to a subset of the data by
specifying a logical condition that evaluates to true of false, such as x < 55.
Relational operators are <, <=, ==, >= and >, and logical negation is expressed using
! or ~.
Alternatively, you can specify a range of the data, for example in 1/10 will restrict
the command's action to the first 10 observations.
weight:
Some commands allow the use of weights, type help weights to learn more.
using filename:
The keyword using introduces a file name; this can be a file in your computer, on the
network, or on the internet.
options:
Most commands have options that are specified following a comma. To obtain a list
of the options available with a command type help command where command is the
actual command name.
by varlist:
It instructs Stata to repeat the command for each group of observations defined by
distinct values of the variables in the list (for a given grouping or disaggregating
variable).
A variable list (varlist) is a list of variable names with spaces/blanks in between. There are a
number of shorthand conventions (use of wildcards) to reduce the amount of typing.
myvar* All variables starting with myvar
*var All variables ending with var
my*var All variables starting with my and ending with var
my~var A single variable starting with my and ending with var
my?var All variables starting with my and ending with var with
one other character between
myvar1-myvar5 All variables in the order of the variables window myvar1
through myvar5
The * character indicates to match one or more characters.

The ~ character also indicates to match one or more characters, but unlike *, only one
variable is allowed to match. If more than one variable match, an error message is returned.
The ? character matches a single character. All variables matching the pattern are returned.
The - character indicates that all variables in the dataset, starting with the variable to the left
of the - and ending with the variable to the right of the – are to be returned.

Any command that takes varlist understands the keyword _all to mean all variables. Some
commands are using all variables by default if none are specified (e.g., summarize shows
summary statistics for all variables, and is equivalent to summarize _all).
Working Directory
The working directory is displayed at the bottom left hand corner of the window. Stata
displays the default directory. Any files you save without specifying a directory will
be saved here. To change your working directory, use the cd command:
cd directoryname
Allocating memory
set memory
Stata reads the whole dataset into the working memory, thus, sufficient memory has to be
reserved (or an error message will be displayed). Therefore, you should set the size of the
working memory reserved for Stata before loading a (big) dataset with the command set
memory:
Syntax: set memory Xm [, permanently]
e.g: set memory 200m
X represents the number of megabytes and the permanently option allows for a
permanent setting of the respective memory size. Nonetheless, not more than ¾ of the
available memory of the computer should be reserved for Stata in order to guarantee a
good performance of the system
Help in STATA
Help in STATA can be accessed by typing help command
E.g help graph or help graph_intro
Using Stata professionally
You may need to save your results and keep track of the commands that you have used.
Using a Log File to save results
The default is that all our output go to the Results window, where it can be viewed but
eventually disappears. To keep a record of your results, you should log your session after
which Stata writes all results to both the Results window and to the file you specify.
Use the command: log using filename, text replace
where filename is the name of your log file. Note the use of two recommended options: text
and replace. The text option create logs in plain text (ASCII) format (the file has extension
.log), which can be viewed in an editor such as Notepad or a word processor such as MS
Word otherwise log file can only be viewed using Stata's Viewer and has extension .scml
The replace option specifies that the file is to be overwritten if it already exists the other
alternative is append. , which adds new stuff to what is already in the file.

To close any open logs using the log close command. The problem with this solution is that
it will not work if there is no log open! Alternatively you can use:
capture log close
The capture keyword tells Stata to run the command that follows and ignore any errors.
Use a Do File to save commands
A do file is just a set of Stata commands typed in a plain text file.
Ways of creating a Do file
(i)You can use Stata's own built-in do-file Editor, which has the great advantage that you can
run your program directly from the editor by clicking on the run icon or selecting Tools|Run
from the menu. You can also select just a few commands and run them by selecting
Tools|Run Selection in the menu. To access Stata's do editor use Ctrl-8 in earlier versions or
Ctrl-9 in versions 12 and 13 or select Window|Do-file Editor|New Do-file Editor in the menu
system.
(ii) Alternatively, you can use an editor such as Notepad. Save the file using extension .do
and then execute it using the command do filename.
(iii)You could even use a word processor such as Word, but you would have to remember to
save the file in plain text format, not in Word document format .
Use Comments and Annotations
It is always a good idea to annotate your do files with explanatory comments that provide the
explanation of what you are trying to do.
(i)In the Stata command window you can start a line with a * to indicate that it is a comment,
not a command. This can be useful to annotate your output.
(ii) In a do file you can also use two other types of comments: // and /* */
(a) //is used to indicate that everything that follows to the end of the line is a
comment and should be ignored by Stata. For example you could write
eg gen one = 1 // this will serve as a constant in the model
(b) /* */ is used to indicate that all the text between the opening /* and the
closing */, which may be a few characters or may span several lines, is a comment to
be ignored by Stata.
(c) /// To indicate to Stata that a command continues on the next line you use ///,
which says everything else to the end of the line is a comment and the command itself
continues on the next line.
Continuation Lines

When you are typing on the command window a command can be as long as needed. In a do-
file you will probably want to break long commands into lines to improve readability.
(i) To indicate to Stata that a command continues on the next line you use ///, which
says everything else to the end of the line is a comment and the command itself
continues on the next line. For example you could write
graph twoway (scatter lexp loggnppc) ///

(lfit lexp loggnppc)
(ii) Alternatively tell Stata to use a semi-colon instead of the enter key at the end of
the line to mark the end of a command, using #delimit ;, as in this example:
#delimit ;
graph twoway (scatter lexp loggnppc)
(lfit lexp loggnppc) ;
This is a command that sets the character that marks the end of a command line. The
character chosen is ; This command is helpful when writing do-files because otherwise
commands should be written in a single line. Now all commands need to terminate with a
semi-colon. To return to using carriage return as the delimiter use
#delimit cr
The delimiter can only be changed in do files.
NB***
We start the do file by specifying the version of Stata we are using, in this case 10. This
helps ensure that future versions of Stata will continue to interpret the commands correctly,
even if Stata has changed
2. Data Management
In this section we describe Stata data files, discuss how to read raw data into Stata in free and
fixed formats, how to create new variables, how to document a dataset labelling the variables and
their values, and how to manage Stata system files.
Stata 11 introduced a variables manager that allows editing variable names, labels, types, formats,
and notes, as well as value labels, using an intuitive graphical user interface available under
Data|Variables Manager in the menu system. While the manager is certainly convenient, we still
prefer writing all commands in a do file to ensure research reproducibility.
Stata Files
Stata datasets are rectangular arrays with n observations on m variables
(i) Variable Names
Variable names can have up to 32 characters, but many commands print only 12, and shorter
names are easier to type. Stata names are case sensitive.

(ii) Variable Types
Variables can contain numbers or strings. Numeric variables can be stored as integers (bytes,
integers, or longs) or floating point (float or double). These types differ in the range or precision
of the values they can hold, type help datatype for details.
You usually don't need to be concerned about the storage mode; Stata does all calculations using
doubles, and the compress command will find the most economical way to store each variable in
your dataset, type help compress to learn more.
You do have to be careful with logical comparisons involving floating point types. If you store
0.1 in a float called x you may be surprised to learn that x == 0.1 is never true. The reason is that
0.1 is "rounded" to different binary numbers when stored as a float (x) or as a double (the
constant 0.1). This problem does not occur with integers or strings.
String variables can have varying lengths up to 244 characters in Stata 12, or up to two billion
characters in Stata 13, where you can use str1...str2045 to define fixed-length strings of up to
2045 characters, and strL to define a long string.
To convert string variables numbers you use the command destring or the function real().
Otherwise, you can use encode to convert string data into a numeric variable or decode to convert
numeric variables to strings.
Missing Values
Like other statistical packages, Stata distinguishes missing values. The basic missing value for
numeric variables is represented by a dot . Starting with version 8 there are 26 additional missing-
value codes denoted by .a to .z. These values are represented internally as very large numbers, so
valid_numbers< . < .a < ... < .z.
To check for missing you need to write var >= . (not var == .). Stata has a function that can do
this comparison, missing(varname) and I recommend it because it leads to more readable code,
e.g. I prefer list id if missing(age) to list id if age >= .
Data File Definition

a) Creation within Stata
(i) Entering data via the command window

✓ You can type the input format beginning with the input command.
✓ Following the input command, the types the sequence of variable names (eight letters or less)
separated by blanks. For example, he could type: input quesid age sex marital
✓ You can then enter the values for case one and hit the ENTER key or the delimiter if declared,
continue in a similar manner up to the last case.
✓ For missing data he can enter a period for a numeric variable or a blank for a string variable.
✓ When finished entering the data, you can enter the word, end.
✓ You then type the command to save : save myfilename and the data file will be saved with file
extension .dta.
(ii) Entering data via the data spreadsheet
✓ To get the data spreadsheet type edit in the command window

✓ Treating each column as a separate variable, you can begin entering the data. For example,
enter the respondent quesid values in the first column before entering the other data. Stata will
recognize the data type and allocate the appropriate format itself. The user can double click
on Var1 and then enter a variable name.
(b)Reading Data Into Stata
From Excel
If you have an Excel file, you need to save it as a CSV file (a comma delimited file). Then, the
insheet command will read the file into Stata. E.g
insheet using “C:\myexcelfile.csv”, comma;
On some occasions, you also need to include a variable list. So, the command would read insheet
var1 var2 var3 … using “C:\myexcelfile.csv”, comma;
From another Statistical package (SPSS,SAS)
In this section we discuss how to read raw data files. If your data come from another statistical
package, such as SAS or SPSS, consider using a tool such as Stat/Transfer
(www.stattransfer.com) or DBMSCopy (www.dataflux.com).
Free Format
If your data are in free format, with variables separated by blanks, commas, or tabs, you can use
the infile command.
For an example of a free format file see the data available on the web data given url e.g
http://uz.stats/datasets
We can read the a free format data set into Stata using the command
. infile str20 var1 var2 var3 var4 using http://uz.stats/datasets/hhdata.raw
The infile command is followed by the names of the variables. Because the var1 is a string rather
than a numeric variable we precede the name with str20, which sets the type of the variable as a
string of up to 1420 characters. All other variables are numeric, which is the default type.
The keyword using is followed by the name of the file, which can be a file on your computer, a
local network, or the internet. In this example we are reading the file directly off the internet. And
that's all there is to it. For more information on this command type help infile1. To see what we
got we can list a few cases
. list in 1/3
Fixed Format (e.g from CSPRO)
Survey data often come in fixed format, with one or more records per case and each variable in a
fixed position in each record.

The simplest way to read fixed-format data is using the infix command to specify the columns
where each variable is located. As it happens, the effort data are neatly lined up in columns, so we
could read them as follows:
infix str var1 2-19 var2 20-24 var3 25-29 var4 30-40 using http://uz.stats/datasets/hhdata.raw, clear
This says to read the variable1 from columns 2-19, var2 from columns 20-24, and so on. It is, of
course, essential to read the correct columns. We specified that variable 1 was a string variable
but didn't have to specify the width, which was clear from the fact that the data are in columns 2-
19. The clear option is used to overwrite the existing dataset in memory.
If you have a large number of variables you should consider typing the names and locations on a
separate file, called a dictionary, which you can then call from the infix command.
The infile command can also be used with fixed-format data and a dictionary. This is a very
powerful command that gives you a number of options not available with infix; for example it
lets you define variable labels right in the dictionary, but the syntax is a bit more complicated.
In most cases you will find that you can read free-format data using infile and fixed-format data
using infix. For more information on the trade-offs see help infiling.
Data can also be typed directly into Stata using the input command, see help input, or using the
built-in Stata data editor available through Data|Data editor on the menu system.
Data Label and Notes
Stata lets you label your dataset using the label data command followed by a label of up to 80
characters (244 in Stata SE). You can also add notes of up to ~64K characters each using the
notes command followed by a colon and then the text:
. label data "Household Data"

. notes: Food and Nutrition Security Data set for farmers in Mberengwa District
Users of the data can type notes to see your annotation. Documenting your data carefully always
pays off.
Variable Labels and Notes
You can (and should) label your variables using the label variable command followed by the
name of the variable and a label of up to 80 characters enclosed in quotes. With the infile
command you can add these labels to the dictionary, which is a natural home for them. Otherwise
you should prepare a do file with all the labels. Here's how to define labels for the three variables
in our dataset:
. label variable var1 " First name of RESPONDENT"

. label variable var2 "Surname of Respondent"
Stata also lets you add notes to specific variables using the command notes varname: text. Note
that the command is followed by a variable name and then a colon:

. notes var3: The name of the person in charge at the household
Type describe and then notes to check our work so far.
Value Labels
You can also label the values of categorical variables, for example sexHHH.
. label define sexHHH 1 "male"

. label define sexHHH 2 "Female", add
Stata has a two-step approach to defining labels. First you define a named label set which
associates integer codes with labels of up to 80 characters (244 in Stata SE), using the label
define command. Then you associate the set of labels with a variable, using the label values
command. Often you use the same name for the label set and the variable, as we did in our
example.
One advantage of this approach is that you can use the same set of labels for several variables.
The canonical example is label define yesno 1 "yes" 0 "no", which can then be associated with all
0-1 variables in your dataset, using a command of the form label values variablename yesno for
each one. When defining labels you can omit the quotes if the label is a single word, but I prefer
to use them always for clarity.
. label define yesno 1 "Yes" 2 "No"

. label values chronicill yesno
Creating New Variables
The most important Stata commands for creating new variables are generate/replace and recode,
and they are often used together.
Generate and Replace
The generate command creates a new variable using an expression that may combine constants,
variables, functions, and arithmetic and logical operators. Let's start with a simple example: here
is how to create setting squared:
. gen agecat = 2016-age.
You can also use drop var_names to drop one or more variables from the dataset.
Operators and Expressions
The following table shows the standard arithmetic, logical and relational operators you may use
in expressions:
Arithmetic Logical Relational
+ add ! not (also ~)== equal
- subtract | or != not equal (also ~=)

* multiply & and < less than
/ divide <= less than or equal
^ raise to power > greater than
+ string concatenation >= greater than or equal
Here's how to create an indicator variable for countries with high-effort programs:
. generate adult = age > 17
This is a common Stata idiom, taking advantage of the fact that logical expressions take the value
1 if true and 0 if false. A common alternative is to write
. generate adult1 = 0
. replace adult = 1 if age > 17
The two strategies yield exactly the same answer. Both will be wrong if there are missing values,
which will be coded as high age because missing value codes are very large values. You should
develop a good habit of avoiding open ended comparisons. My preferred approach is to use
. generate adult2 = age > 17 if !missing(age)
which gives true for age above 17, false for age less than or equal to 17, and missing when age is
missing. Logical expressions may be combined using & for "and" or | for "or". Here's how to
create an indicator variable for age between 5 and 15:
. gen age5to15 = (age >=5 & age <= 15)
Functions
Stata has a large number of functions, here are a few frequently-used mathematical functions,
type help mathfun to see a complete list:
abs(x) the absolute value of x
exp(x) the exponential function of x
int(x) the integer obtained by truncating x towards zero
ln(x) or log(x) the natural logarithm of x if x>0
log10(x) the log base 10 of x (for x>0)
logit(x) the log of the odds for probability x: logit(x) = ln(x/(1-x))
max(x1,x2,...,xn)the maximum of x1, x2, ..., xn, ignoring missing values
min(x1,x2,...,xn) the minimum of x1, x2, ..., xn, ignoring missing values

round(x) x rounded to the nearest whole number
sqrt(x) the square root of x if x >= 0
These functions are automatically applied to all observations when the argument is a variable in
your dataset.
Stata also has a function to generate random numbers (useful in simulation), namely uniform(). It
also has an extensive set of functions to compute probability distributions (needed for p-values)
and their inverses (needed for critical values), including normal() for the normal cdf and
invnormal() for its inverse, see help density functions for more information. To simulate normally
distributed observations you can use
.rnormal() // or invnormal(uniform())
There are also some specialized functions for working with strings, see help string functions, and
with dates, see help date functions.
Recoding Variables
The recode command is used to group a numeric variable into categories. Suppose for example a
survey has age in single, and you would like to code it into 5-year age groups.
. recode age (15/19=1) (20/24=2) (25/29=3) (30/34=4) ///

> (35/39=5) (40/44=6) (45/49=7), gen(age5)
Each expression in parenthesis is a recoding rule, and consist of a list or range of values, followed
by an equal sign and a new value. A range, specified using a slash, includes the two boundaries,
so 15/19 is 15 to 19, which could also be specified as 15 16 17 18 19 or even 15 16 17/19. You
can use min to refer to the smallest value and max to refer to the largest value, as in min/19 and
44/max. The parentheses can be omitted when the rule has the form range=value, but they usually
help make the command more readable.
Values are assigned to the first category where they fall. Values that are never assigned to a
category are kept as they are. You can use else (or *) as the last clause to refer to any value not
yet assigned. Alternatively, you can use missing and nonmissing to refer to unassigned missing
and nonmissing values; these must be the last two clauses and cannot be combined with else.
You can also specify value labels in each recoding rule. This is simpler and less error prone that
creating the labels in a separate statement. The option label(label_name) lets you assign a name
to the labels created (the default is the same as the variable name).
. recode age (0/5=1 Under5) (5/14=2 young) (15/24=3 adolescent) ///

> , generate(ageg) label(ageg)
It is often a good idea to cross-tabulate original and recoded variables to check that the
transformation has worked as intended
Managing Stata Files

Once you have created a Stata system file you will want to save it on disk using save filename,
replace, where the replace option, as usual, is needed only if the file already exists. To load a
Stata file you have saved in a previous session you issue the command use filename.
The clear statement deletes the data currently held in memory and any value labels you
might have. Typing clear all removes these objects from memory, ensuring that you start
with a completely clean slate.
To delete cases and variables the drop command is used. E.g drop varlist for deleting
variables and drop if exp or drop in range [if exp] for deleting observations or
cases.
Similarly, you may specify the variables or observations you want to keep, using keep command.
E.g keep varlist for deleting variables and keep if exp or keep in range [if
exp] for deleting observations or cases.
Changing data types
After importing data into Stata from external sources, variables containing only numbers
are sometimes saved in string format. Thus, no arithmetic operations can be performed
with such a variable. The conversion can then be performed with the destring command:
destring [varlist], {generate(newvarlist)|replace}
Variables that are already numeric will not be changed. IF any nonnumerical character is
found this variable will not be changed. It might be necessary to remove or replace non-
numeric characters prior to converting the string variable into numerical format (e.g., decimal
separator comma instead of period).
Data Merging
Adding cases
The append is used to join datasets end to the end with the one currently used. If the data
types of the variables are not the same Stata will promote data types and will keep all
variables when the two datasets have differing variable names:
Syntax: append using filename.dta [, options]
E.g.: append using newdata.dta
Adding variables
Datasets sharing the same kind of observations, but having different variables, can be
joined with the merge command. Then, the currently used dataset is extended with the
corresponding observations from one or more other files with the following options :
One-to-one merge
Syntax: merge 1:1 varlist using filename.dta [, options]
Many-to-one merge
Syntax: merge m:1 varlist using filename.dta [, options]

One-to-many merge
Syntax: merge 1:m varlist using filename.dta [, options]
Many-to-many merge
Syntax: merge m:m varlist using filename.dta [, options]
One-to-one merge by observation

Syntax: merge 1:1 _n using filename.dta [, options]
The current/working and “using” datasets need to share at least one common variable, called
the primary key/matching variable. The match variable(s) is (are) defined in varlist.
The merging options handles the issue of variables and observations in the incoming data
set.
The option keepusing(varlist) specifies the variables to be kept from the “using”
dataset. If keepusing() is not specified, all variables are kept.
The option nokeep causes merge to ignore observations in the “using” dataset that have
no corresponding observation in the working data set.
After merging, Stata automatically generates a variable _merge which contains information
about the matching of the data with the following codes:
1 Observations only from active dataset
2 Observations only from “using” dataset(s)
3 Observations from active and “using” dataset(s)
4 Observations from both, missing values updated
5 Observations from both, conflicting nonmissing values
Data Exportation
Data can be exported out from Stata to a tab-separated text file with outsheet using
filename.
Data can be exported out from Stata to Excel worksheet with export excel using
filename.
Exporting Results to Excel
The putexcel command writes Stata expressions, matrices, tables, images, and returned results
to an Excel file. This allows you to automate exporting and formatting of, for example, Stata
estimation results. Learn more about the putexcel command using help putexcel

3. Stata Graphics
Stata has excellent graphic facilities, accessible through the graph command, see help graph for an
overview or help graph_intro.
The syntax of the graph commands is different from that of most other Stata commands.
A Stata command for creating graphs comprises two elements: the graph command and a
graph type
e.g graph box hhsize
The most common graphs in statistics are X-Y plots showing points or lines. These are available in Stata
through the twoway subcommand, which in turn has 31 sub-subcommands or plot types.
For the twoway graph type, a plot type must also be specified. Here is an example with the
plot type scatter:
To produce a simple scatterplot of y by x you use the command
graph twoway scatter yvar xvar
. graph twoway scatter hhsize income
For the twoway graph type, you can leave out graph to save typing. The most important of
which are scatter and line and for these you can even leave out twoway. The following
commands are therefore identical to the one given above:
. twoway scatter hhsize income
. scatter hhsize income
The plot types of the twoway graph type can be overlaid. Here is an example with
scatter and lfit:
. graph twoway (scatter hhsize income) (lfit hhsize income)
Here both types are set in parentheses. However, you can also separate the plot types with ||
as in the following example, where we also leave out graph and twoway:
. scatter hhsize income || lfit hhsize income
Stata 10 introduced a graphics editor that can be used to modify a graph interactively.
Graph elements
The appearance of graphs is defined by a series of elements.
• Elements that control the display of data, including the shape, color, and size of
the “marker symbols”, as well as lines, bars, and other ways to display data.
• Elements that control the size and shape of the graph, including the “graph region”
and “plot region”. The graph region is the size of the entire graph, including titles,
legends, and surrounding text. The plot region is the space that can hold data
points. Just inside the plot region is the data region. The plot region’s size

is determined by the axes, whereas the data region’s size is determined by the
minimums and maximums of the variables being plotted.
• Elements that convey additional information within the graph region, including,
for instance, reference lines for crucial values, marker symbol labels, or any other
text in the plot region.
• Information outside the plot region, which affects the appearance of the axes that
border the graph region on the left (y axis), bottom (x axis), top (upper axis),
and right (right axis). The appearance of information outside the plot region is
controlled by various elements, for example, tick lines, axis labels, and axis titles.
Fitted Lines
Suppose we want to show the fitted regression line as well. In some packages you would need to
run a regression, compute the fitted line, and then plot it. Stata can do all that in one step using
the lfit plot type. (There is also a qfit plot for quadratic fits.) This can be combined with the
scatter plot by enclosing each sub-plot in parenthesis. (One can also combine plots using two
horizontal bars ||)
graph twoway (scatter yvar xvar)(lfit yvar xvar)
Now suppose we wanted to put confidence bands around the regression line. Stata can do this
with the lfitci plot type, which draws the confidence region as a gray band. (There is also a
qfitci band for quadratic fits.) Because the confidence band can obscure some points we
draw the region first and the points later
graph twoway (lfitci yvar xvar)(scatter yvar xvar)
Note that this command doesn't label the y-axis but uses a legend instead. You could specify a
label for the y-axis using the ytitle() option, and omit the (rather obvious) legend using
legend(off).
graph twoway (lfitci yvar xvar) (scatter yvar xvar), ytitle("Y variable")
legend(off)
Titles, Legends and Captions
There are options that apply to all two-way graphs, including titles, labels, and legends. Stata
graphs can have a title() and subtitle(), usually at the top, and a legend(), note()
and caption(), usually at the bottom, type help title_options to learn more. Usually a
title is all you need. Stata 11 allows text in graphs to include bold, italics, greek letters,
mathematical symbols, and a choice of fonts, type help graph text to learn more.
To add a legend to specify the linear fit and 95% confidence interval, use the order(2
"linear fit" 1 "95% CI") option of the legend to label the second and first items in that
order. (in a do file you can type with annotations as follows)
graph twoway (lfitci yvar xvar) ///

(scatter yvar xvar, mlabel(varname)) ///
, title("Title of graph") ///
ytitle("Yaxis title") ///
legend(ring(0) pos(5) order(2 "linear fit" 1 "95% CI"))
The mlabel(varname) command is used to label the points using text included in another
variable.

A Simple Line Plot
The simplest plot uses all the defaults:
graph twoway line y1, y2, ..., ym, x
The line plot allows you to specify more than one "y" variable, the order is y1, y2, ..., ym, x.
Alternatively, we could have used several line plots: (line y1 x) (line y2 x) etc.
Titles and Legends
graph twoway line y1 y2 x ///

, title("My title ") subtitle("X’s") ///
legend( order(1 "category 1" 2 "ategory 2") )
Here we used three options, which as usual in Stata go after a comma: title, subtitle and
legend. The legend option has many sub options; we used order to list the keys and their
labels, saying that the first line represented category1 and the tcategory2 blacks. To omit a key
you just leave it out of the list. To add text without a matching key use a hyphen (or minus sign)
for the key. There are many other legend options, see help legend_option to learn more.
Line Styles
Stata lets you control the line style in different ways. The clstyle() option lets you use a
named style, such as foreground, grid, yxline, or p1-p15 for the styles used by lines 1 to
15, see help linestyle. This is useful if you want to pick your style elements from a scheme,
as noted further below.
Alternatively, you can specify the three components of a style: the line pattern, width and color:
• Patterns are specified using the clpattern() option. The most common patterns are
solid, dash, and dot; see help linepatternstyle for more information.
• Line widthis specified using clwidth(); the available options include thin, medium
and thick, see help linewidthstyle for more.
• Colors can be specified using the clcolor() option using color names (such as red,
white and blue, teal, sienna, and many others) or RGB values, see help
colorstyle.
Here's how to specify blue for whites and red for blacks:
graph twoway (line y1 y2 x, clcolor(blue red) ) ///

, title("My title ") subtitle("X’s") ///
legend( order(1 "category1" 2 "category2") )
Note that clcolor() is an option of the line plot, so we put parentheses round the line
command and inserted it there.
Scale Options

It looks as if improvements in life expectancy slowed down a bit in the second half of the century.
This can be better appreciated using a log scale, where a straight line would indicate a constant
percent improvement. This is easily done using the axis options of the two-way command, see
help axis_options, and in particular yscale(), which lets you choose arithmetic, log,
or reversed scales. There's also a suboption range() to control the plotting range. Here I will
specify the y-range as 25 to 80 to move the curves a bit up:
Graph Schemes
Stata uses schemes to control the appearance of graphs, see help scheme. You can set the
default scheme to be used in all graphs with set scheme_name. You can also redisplay the (last)
graph using a different scheme with graph display, scheme(scheme_name).
To see a list of available schemes type graph query, schemes. Try s2color for screen
graphs, s1manual for the style used in the Stata manuals, and economist for the style used in
The Economist. Using the latter we obtain the graph shown at the start of this section.
graph display, scheme(economist)
Combining graphs
Stata allows you to combine as many graphs as you want into a joint graph. To do this,
you first save the individual graphs and then combine them using graph combine. We
will demonstrate this using a display of hhsize by income, separated by respondents from
any two districts
. scatter hhsize1 income1, name(dist1, replace)
. scatter hhsize2 income2, name(dist2, replace)
. graph combine dist1 dist2
It is important to set the number of rows and columns in the combined graph. The individual
graphs are placed in the combined graph in rows and columns in a matrix-like fashion. The
positioning of the individual graphs depends on how many rows and columns the matrix has.
. graph combine dist1 dist2, rows(2)
The by() option displays separate graphs for each group defined by the variable in the
parentheses. If more than one variable is entered in the parentheses, graphs are provided
for every combination of the chosen variables.
scatter hhsize income, by(sexhhh, benstatus)
4. Practise with a Sample Data File

Stata comes with a few sample data files. You can load one of the sample files, named lifeexp.dta,
which has data on life expectancy and gross national product (GNP) per capita in 1998 for 68
countries. To see a list of the files shipped with Stata type sysuse dir.
To load the file type sysuse lifeexp (the file extension is optional).
To see what's in the file type describe. (This command can be abbreviated to a single letter but I prefer
desc.)
. sysuse lifeexp
. desc
How many variables does the data set contain?

To see notes about the data set type:
. notes
. notes varname.

Descriptive Statistics
We use the command summarize followed by the names of the variables to get descriptive
statistics for our variables. Type the following command and note the difference:
. summarize
. summarize lexp gnppc
Computing New Variables
We compute a new variable using the generate command with a new variable name and an
arithmetic expression. For example to compute natural logs we use the built-in function log:
. gen loggnppc = log(gnppc)

(5 missing values generated)
Simple Linear Regression
To run a linear regression of dependent variable on an independent variable use the regress
command, which lists the outcome followed by the predictors e.g
. regress lexp loggnppc
Stata omits observations that are missing the outcome or one of the predictors.
Interpret the results.
Following a regression or any estimation command you can retype the command with no
arguments to see the results again. Try typing reg.
Post-Estimation Commands
Stata has a number of post-estimation commands that build on the results of a model fit. A
useful command is predict, which can be used to generate fitted values or residuals
following a regression.
. predict plexp
(option xb assumed; fitted values)
(5 missing values generated)
The command generates a new variable, plexp, that has the life expectancy predicted from
our regression equation.
Plotting the Data and a Linear Fit
A common task is to superimpose a regression line on a scatter plot to inspect the quality of
the fit. We could do this using the predictions we stored in plexp, but Stata's graph
command knows how to do linear fits on the fly using the lfit plot type, and can
superimpose different types of twoway plots. Try the command
. graph twoway (scatter lexp loggnppc) (lfit lexp loggnppc)

In this command each expression in parenthesis is a separate two-way plot to be overlayed in
the same graph.
Listing Selected Observations
It's hard not to notice the country on the bottom left of the graph, which has much lower life
expectancy than one would expect, even given its low GNP per capita. To find which country
it is we list the (names of the) countries where life expectancy is less than 55:
. list country lexp plexp if lexp < 55, clean
The keyword clean after the comma is an option which omits the borders on the listing.
Many Stata commands have options, and these are always specified after a comma.
. list gnppc loggnppc lexp plexp if country == "United States", clean
gnppc loggnppc lexp plexp

58. 29240 10.28329 77 77.88277
Here we restricted the listing to cases where the value of the variable country was "United
States". Note the use of a double equal sign in a logical expression. In Stata x = 2 assigns the
value 2 to the variable x, whereas x == 2 checks to see if the value of x is 2.
Changing from a string to a numeric

Destring var_name,replace
Or Destring var_name,generate newvariable_name
PRACTICAL
Consider the following dataset:
Ag
e
of Ownershi
hhi Sex of Marital of HH p of weigh heigh
d HHH town HHH opinion H Sources of income house t t
Petty
Salary Rentals trade Other
1 Male Harare Divorced agree 28 160 75 680 1500 Yes 65 1.87
Femal
2 e Bulawayo Married disagree 45 640 160 2600 4500 Yes 73 1.74
Femal
3 e Mutare Single disagree 33 468 130 900 2800 Yes 57 1.64
4 Male Bulawayo Married agree 38 537 140 1985 5000 No 66 1.52
5 Male Bulawayo Widowed agree 51 615 180 2355 4800 Yes 82 1.48
6 Male Mutare Married agree 37 512 490 1200 3000 No 94 1.81
Femal
7 e Harare Married disagree 41 530 100 720 1900 Yes 59 1.69
Femal
8 e Bulawayo Divorced agree 57 640 180 2100 5200 Yes 88 1.65

9 Male Mutare Married agree 66 710 150 1000 3400 Yes 76 1.21
Femal
10 e Mutare Widowed disagree 40 520 180 1200 4000 Yes 61 1.43
11 Male Harare Married agree 71 825 640 500 6400 No 63 1.78
12 Male Mutare Married disagree 60 630 200 800 4200 No 98 1.56
Create a do file with all the command for the following procedures
(i) Log your work
(ii) Supply the data by direct input in STATA
(iii) Label all the variables and label values for all categorical variables
(iv) Save the data file in STATA as income.dta
(v) Construct frequency tables for all categorical variables
(vi) Construct bar charts for all the categorical variables, with bars showing percentages
(vii) Construct a stacked bar for the sources of income by town, inserting percentages in each
sector
(viii) Construct a scatter plot for age and salary and comment. Include the line of best fit with
95% confidence bands.
(ix) Calculate the Pearson’s product moment correlation coefficient for age and salary and
comment
(x) Fit a linear regression between age and salary and comment.
(xi) Construct a clustered bar chart for town and marital status and comment
(xii) Split your data by marital status and produce a frequency table for opinion
(xiii) Determine the measures of central tendency for households whose household heads
(HHH) are aged between 40 and 60
(xiv) Compute measures of dispersion for all households.
(xv) Compute the total income and all the quartiles for total income and comment
(xvi) Construct a box-and –whisker plot for ‘other’ income sources
(xvii) Recode age into a different variable with categories 21-30, 31-40, 41-50, 51-60, 61-
70,71-80 and produce a frequency distribution for theis ne variable.
𝑊𝑒𝑖𝑔ℎ𝑡
(xviii) Compute the Body Mass Index – BMI for each HHH, where 𝐵𝑀𝐼 = 𝐻𝑒𝑖𝑔ℎ𝑡 2

Assignment: Due date 19 January 2022: submit your STATA data file,
do- file, log file and a report in MSWord/PDF
Instructions: In groups of 3’s (maximum)(PLEASE maintain the same groups you
worked in for the SPSS assignment)
Do the other question you did not do for the SPSS assignment: Assist the client by
doing an
analysis for them in STATA, produce the syntax for your analysis and write a
report for your analysis in MSWord or Latex:
Submit 3 files on the google form link https://forms.gle/3eaMuvc1WfwgxVVs8
Datafile, do-file, logfile and report(MS word/pdf) on the google forms link.
The Report should have a coverpage with a UZ logo, names, regnos,
Programmes for the group members. Save yourfiles usingall the regnos
separated by _
Q1.
Consider a database that contains an entire quarter’s sales transactions related
to an auto dealership’s sales force. The database documents the details of individual
sales of autos as they occur. Such data will likely include the name of the salesperson, the
vehicle
sold, the amount paid for the vehicle, the commission earned by the
salesperson, any rebates or bonuses the buyer receives on the sale, the amount of
time from first contact with the customer until the eventual sale. The major aim of the manager
is
to understand the performance of their team in every respect
Auto Sales Data—01/01/2005—01/31/2005
Amt Rebat
Rcd No. Slsprn Date Make Model Sales com
paid es
1 Bill 01/02/05 Ford Wgn 24,000 2500 2150
2 Henry 01/02/05 Toyota Sdn 26,500 1000 2550
3 Harriet 01/03/05 Audi Sdn 34,000 0 3400
4 Ahmad 01/06/05 Audi Cpe 37,000 0 5550
5 Ahmad 01/06/05 Ford Sdn 17,500 2000 2325
6 Henry 01/08/05 Toyota Trk 24,500 1500 2300
7 Lupe 01/10/05 Ford Wgn 23,000 2500 2050
8 Piego 01/12/05 Ford Sdn 14,500 500 1400
9 Kenji 01/13/05 Toyota Trk 27,000 1200 2580
10 Ahmad 01/14/05 Audi Cpe 38,000 0 5700
11 Kenji 01/16/05 Toyota Trk 28,500 1500 2700
12 Bill 01/16/05 Toyota Sdn 23,000 2000 2100

13 Kenji 01/18/05 Ford Wgn 21,500 1500 2000
14 Ahmad 01/19/05 Audi Sdn 38,000 0 5700
15 Bill 01/19/05 Ford Wgn 23,000 1000 2200
16 Kenji 01/21/05 Toyota Trk 26,500 1500 2500
17 Lupe 01/24/05 Ford Sdn 13,500 500 1300
18 Piego 01/25/05 Ford Sdn 12,500 500 1200
19 Bill 01/26/05 Toyota Trk 22,000 1000 2100
20 Ahmad 01/29/05 Audi Cpe 36,500 0 5475
LS 2022_UZ stats (Maths & Computational Sciences dept)
21 Bill 01/31/05 Ford Sdn 12,500 500 1200
22 Piego 01/31/05 Ford Sdn 13,000 500 125
Q2.
A parent /guardian has decided to monitor how much their child spends during a certain
semester. The second semester will constitute a data collection period to study expenditures.
The
main aim is to learn from this semester and use that as a basis of parent’s advice to
student(child) regarding their future spending. During the semester under study: 60
expenditures were incurred. The parent’s goal for the analysis is the determination of how and
why expenditures occur over time.
1. What types of purchases are being made?
2. Are there spending patterns occurring during the week, month, and semester?
3. How are the payments of expenditures divided among the credit card and cash?
4. Finally, can some of the expenditures be identified as unnecessary?
The pair agree on 6 data fields to describe each observation:
(1) the number identifier of the week (1–15) for the 15-week semester in which
the expenditure occurs, (2) the date, (3) the weekday (Sunday = Sn, Monday = M,
etc.) corresponding to the date, (4) the amount of the expenditure in dollars,
(5) whether cash (C) or credit card (R) was used for payment, and finally, (6) one
of three categories of expenditure types: food (F), personal (P), and school (S).
Table 2.4 2nd semester university student expenses
Obs. Week Date Weekday Amount Cash/CRedit Card Food/Personal/School
1 6-Jan Sn 111.46 R F
2 7-Jan M 43.23 C S
Week
01
3 8-Jan T 17.11 C S
4 10-Jan Th 17.67 C P
5 13-Jan Sn 107.00 R F

6 14-Jan M 36.65 C P
7 14-Jan M 33.91 C P
Week
02
8 17-Jan Th 17.67 C P
9 18-Jan F 41.17 R F
10 20-Jan Sn 91.53 R F
11 21-Jan M 49.76 C P
12 21-Jan M 32.97 C S
Week
03
13 22-Jan T 14.03 C P
14 24-Jan Th 17.67 C P
15 24-Jan Th 17.67 C P
16 27-Jan Sn 76.19 R F
17 31-Jan Th 17.67 C P
Week
04
18 31-Jan Th 17.67 C P
19 1-Feb F 33.03 R F
20 3-Feb Sn 66.63 R F
Week
21 5-Feb T 15.23 C P
05
22 7-Feb Th 17.67 C P
23 10-Feb Sn 96.19 R F
24 12-Feb T 14.91 C P
Week
06
25 14-Feb Th 17.67 C P
26 15-Feb F 40.30 R F
27 17-Feb Sn 96.26 R F
28 18-Feb M 36.37 C S
29 18-Feb M 46.19 C P
Week
07
30 19-Feb T 18.03 C P
31 21-Feb Th 17.67 C P
32 22-Feb F 28.49 R F
33 24-Feb Sn 75.21 R F
Week
34 24-Feb Sn 58.22 R F
08
35 28-Feb Th 17.67 C P

36 3-Mar Sn 90.09 R F
Week
37 4-Mar M 38.91 C P
09
38 8-Mar F 39.63 R F
10-Mar Sn 106.49 R F
39
40 Week 11-Mar M 27.64 C S

10
41 11-Mar M 34.36 C P
42 16-Mar S 53.32 R S
43 17-Mar Sn 111.78 R F
Week
44 19-Mar T 17.91 C P
11
45 23-Mar S 53.52 R P
46 24-Mar Sn 69.00 R F
Week
12
47 28-Mar Th 17.67 C P
48 31-Mar Sn 56.12 R F
49 1-Apr M 48.24 C S
Week
13
50 4-Apr Th 17.67 C P
51 6-Apr S 55.79 R S
52 7-Apr Sn 107.88 R F
Week
53 8-Apr M 47.37 C P
14
54 13-Apr S 39.05 R P
55 14-Apr Sn 85.95 R F
56 16-Apr T 22.37 C S
57 16-Apr T 23.86 C P
Week
15
58 18-Apr Th 17.67 C P
59 19-Apr F 28.60 R F
60 20-Apr S 48.82 R S

STATA Notes 2022

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STATA Notes 2022

Uploaded by

Copyright:

Available Formats

STATA

The Stata Interface

• Command Window: where you type your commands.

L Siziba HSTS204 2016

Stata commands that must be in lowercase.

Stata can work as a calculator using the display command.

Stata Command Syntax

We now describe each syntax element:

L Siziba HSTS204 2016

The * character indicates to match one or more characters.

L Siziba HSTS204 2016

Help in STATA can be accessed by typing help command

E.g help graph or help graph_intro

Using Stata professionally

Using a Log File to save results

Use the command: log using filename, text replace

L Siziba HSTS204 2016

capture log close

Use a Do File to save commands

A do file is just a set of Stata commands typed in a plain text file.

Ways of creating a Do file

Use Comments and Annotations

eg gen one = 1 // this will serve as a constant in the model

L Siziba HSTS204 2016

graph twoway (scatter lexp loggnppc) ///

The delimiter can only be changed in do files.

Stata datasets are rectangular arrays with n observations on m variables

(i) Variable Names

L Siziba HSTS204 2016

Data File Definition

(i) Entering data via the command window

(ii) Entering data via the data spreadsheet

✓ To get the data spreadsheet type edit in the command window

L Siziba HSTS204 2016

insheet using “C:\myexcelfile.csv”, comma;

From another Statistical package (SPSS,SAS)

. infile str20 var1 var2 var3 var4 using http://uz.stats/datasets/hhdata.raw

Fixed Format (e.g from CSPRO)

L Siziba HSTS204 2016

Data Label and Notes

. label data "Household Data"

Variable Labels and Notes

. label variable var1 " First name of RESPONDENT"

L Siziba HSTS204 2016

Type describe and then notes to check our work so far.

. label define sexHHH 1 "male"

. label define yesno 1 "Yes" 2 "No"

Creating New Variables

Generate and Replace

. gen agecat = 2016-age.

Operators and Expressions

Arithmetic Logical Relational

+ add ! not (also ~)== equal

- subtract | or != not equal (also ~=)

L Siziba HSTS204 2016

/ divide <= less than or equal

^ raise to power > greater than

+ string concatenation >= greater than or equal

. generate adult = age > 17

. generate adult2 = age > 17 if !missing(age)

. gen age5to15 = (age >=5 & age <= 15)

abs(x) the absolute value of x

exp(x) the exponential function of x