You are on page 1of 46

Sun Li

Centre for Academic Computing


lsun@smu.edu.sg
STATA Training Session 1
Outline
Computing Resources
Getting Started with STATA
Running STATA
Datasets in STATA
Data Management with STATA
Exercise 1
Data Descriptions & Simple Graphs
Exercise 2
Download Training Slides , data and Syntax:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/T
raining%20Slides%20and%20Syntax.aspx
Computing Resources
STATA
is a statistical package for managing, analyzing, and graphing
data.
has both command and menu-driven interface
has cross-platform compatibility: Windows, Unix, and Mac.
has three flavors:
the standard Intercooled STATA (2047 variables)
the more limited Small STATA (99 variables)
the extended STATA/SE (32766 variables).
Computing Resources
CAC Computing Resources for STATA users
Windows:
STATA/SE version 10.0
10-user network perpetual license
Installation guide
(http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA-
Software Questions.aspx)
Linux CAC Beowulf Cluster:
STATA/SE version 10.0
Unlimited users
About CAC Beowulf Cluster:
(http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx)
New features in STATA 10.0 (http://www.stata.com/stata10)
Getting Started
Getting Started
Review box
Variable
window
Command line
Results
window
Getting Started
Getting help in STATA
Help menu:
contents : for a list of command categories & language syntax
help : for a STATA command with examples
search: to search help by keywords
From command line:
help list
search logistic models
net search multilevel model
User-written programs SJ, STB, STATAlist, and others:
help net_mnu
Getting Started
Website resources:
The STATA website: http://www.stata.com
The STATA journal reviewed papers, regular columns, user-written
software: http://www.stata-journal.com/
STATA FAQ : http://www.stata.com/support/faqs
STATA User Support : http://www.stata.com/support
Books: http://www.stata.com/bookstore/statabooks.html
CAC STATA support:
Website:
http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx
Contact:
For statistical consultation: Sun Li: lsun@smu.edu.sg
For software installation: TAN SuhWen: swtan@smu.edu.sg
Running STATA
Files in STATA
Commands and Output
STATA Variable Definitions
Missing Values in STATA
Expressions and Functions
Memory Consideration
Running STATA
Files in STATA
.dta STATA dataset
STATA can read and write from/to ASCII files, such as Excel files.
.do STATA do-file, command file
Do files can be edited and displayed by text editor, like Notepad.
.log, .smcl STATA log file, output file
Log files document the commands and analysis results displayed in Result
Window, including error messages. Log files (.log) can be edited and displayed by
text editor.
.gph STATA graph file
Running STATA
Commands and Output
Command prompt driven in:
Batch mode: do-file
Interactive mode: command line
E.g.: verinst -- verifying version and installation of STATA
Running STATA
To save results: log-files
File -> Log -> Begin..., View..., or Close.
.smcl or .log extension.
Record everything in Results Window, including commands, results, error
messages, etc.
If the file already exists, another dialog opens to allow you to decide whether to
overwrite the file with new output or to append new output to the existing file.
From command:
cd // list current working directory
cdD:\lsun // change working directory to be D:\lsun
dir // list files under the current working directory
Running STATA
STATA Variable Definitions
Variable names
1-32 characters: recommend to use 8 characters
Valid character: letters a-z, numbers 0-9 and underscore _
Name must be started with a letter (or underscore, but discouraged because
STATA-generated variables start with an underscore.)
Case-sensitive: lowercase or uppercase letters
Variable types
String (Storage bytes: Str1 to Str80 Str244 in SE)
Numeric (categorical, continuous)
Running STATA
Format of numeric variables
Numeric formula: %w.dg; %w.df; %w.de
w: the total width, including period and decimals
d: number of decimals
Format Formula Example 2 1,000 10,000,000
General %w.dg %9.0g 1 1000 1e+07
Fixed %w.df %9.0f 1 1000 10000000
%9.2f 1.41 1000.00 1.00e+07
Exponential %w.de %10.3e 1.414e+00 1.000e+03 1.000e+07
Running STATA
Missing Values in STATA
Missing values are created in input or import when a numeric field is empty;
or by invalid calculation, e.g. division by zero.
System missing value is shown as a (period). Or a period followed by a
letter, such as . a, . b, etc.
Missing values are interpreted as a very large positive number with:
. < . a < . b < etc
This can lead to mistakes in logical expressions.
Running STATA
Expressions and Functions
Operators
Arithmetic Relational Logical
^ power > greater than ! not
* multiplication < less than ~ not
/ division >= > or equal | or
+ addition <= < or equal & and
- subtraction == equal
!= not equal
~= not equal
Help command:
help functions
Running STATA
Memory Consideration
When your dataset is very large, you may consider to:
Set the size of memory: set memory
Set the maximum number of variables: set maxvar
Set the maximum dimension of matrices: set matsize
e.g. memory
set memory 64m
Parameter Default Min Max
memory 10M 500K
maxvar 5,000 2,047 32,766
matsize 400 10 11,000
Getting started: Q & A
Q1: Is there a way to stop Result Window breaking output into pages, i.e.
how to get rid of this more message and let Result Window roll to
the last line of output?
Hint: command help set to understand system parameters
Q2: Why do I get the error message no room to add more observations
even after I reset STATA memory to load my data set?
Hint: Two important considerations:
1) Make sure that you allocate an amount of memory that is larger than the file that
you are using. Stata will need the extra room to perform any commands or
calculations.
2 Make sure that you do not allocate too much memory because your computer
will not have enough memory (RAM) left to perform other tasks.
Datasets in STATA
Starting Point A Rectangular Matrix
Data Input and Output
Edit Data Properties
Variable Management
Data Reorganization
Date and Time Values in STATA
Datasets in STATA
Starting Point: A Rectangular Matrix
3 2 1
2 23 22 21
1 13 12 11
...
... ... ... ... ...
...
...
NK N N N
K
K
X X X X
X X X X
X X X X
N observations
K variables
Datasets in STATA
Data Input and Output
Load STATA-format dataset:
use [varlist] [if] [in] [using] [filename] [, clear ]
Save data in memory to file:
save [filename] [, save_options]
Clean dataset from memory:
clear
Note:
STATA is case-sensitive.
All STATA commands are lowercase.
STATA allows only one dataset at one time in memory.
Datasets in STATA
varlist : a list of variables with blanks in between.
var1 just one variable
var1 var2 var3 three variables
var* variables starting with var
*var variables ending with var
var1-var3 var1, var2 and var3
if : conditional language
if mpg>40
if mpg>40 & income==70
if mpg>40 | mpg <10
in : command in range
in 1/10 from obs 1 to 10
in 10/l from obs 10 to the last
in 1(2)9 obs 1, 3, 5, 7, 9
Datasets in STATA
Import dataset of other formats
Stata can import tab-delimited ASCII text files directly.
Excel can write tab-delimited ASCII text files
choose FileSave AsSave as type: Text (tab delimited)
Import text file into STATA
Choose FileImportASCII data created by a spreadsheet
Datasets in STATA
Example
sysuse auto, clear //open system dataset auto.dta and clear any dataset in memory if any
save auto, replace //save the data in memory to working directory and replace if any
describe //describe the dataset
browse //open data browser
edit //open data editor
Datasets in STATA
Edit Data Properties
generate x=price/mpg //create new variable from algorithm
rename x priceunit //rename variable
label variable priceunit "price per mpg //label variable
list priceunit in 1/10 //list first 10 obs for the variable priceunit
Edit variable properties from Data Editor in edit mode
Data Management with STATA
Variable Management
recode price (10000/max=5 "10000+") ///
(6000/10000=4 "6000-9999") (5000/6000=3 "5000-5999") ///
(4000/5000=2 "4000-4999") (min/4000=1 "-4000") , generate (pricegrp)
label var pricegrp price group
d pricegrp
codebook pricegrp
recode pricegrp (1/2 = 1 "-5000") ///
(3/4=2 "5000-9999") (5=3 "10000+"), generate (pricegrp2)
codebook pricegrp pricegrp2
save auto, replace
Data Management with STATA
generate x="F //generate a new variable with value = F
replace x="M" in 20/l //replace value of x =M from obs 20 to the last
encode x, generate (xcode) //convert string variable x to be numeric and save it to a new var
d x xcode
browse x xcode
drop x xcode
Data Management with STATA
Data Reorganization
sort foreign //sort dataset by variable foreign
by foreign: summarize price //descriptive statistics of price by foreign group
bysort foreign: summarize price //alternative way
keep price pricegrp foreign //keep the three variables and drop the rest
keep in 1/50 //keep the first 50 obs
drop if price < 4000 //drop obs if price < 4000
save price, replace //save it into a new dataset
Note: sort only sorts in
ascending order. To sort
descending:
gsort price -mpg
Data Management with STATA
dir
use hsmale, clear
codebook gender
use hsfemale, clear
codebook gender
append using hsmale
codebook gender
save hsappend, replace
append : to combine the information from two
files with the same variables but different obs.
Data Management with STATA
dir
use hsdem, clear
sort id
save, replace
use hstest, clear
sort id
merge id using hsdem
list
save hsmerge
tab _merge
Note: Both files must be sorted beforehand by matching key (id in the example above), and the
matching key must have the same name in both datasets.
merge : To combine the information from two
files with different information about the
same obs.
Date and Time Values in STATA
sysuse sp500
d
list date in 1/10
generate year1=year(date)
generate month1=month(date)
d date year1 month1
Data Management with STATA
How STATA records dates and times:
Dates and times are called %t values. %t values are
numerical and integral. The integral value records
the number of time units that have passed from 1960.
Data Management with STATA
d date year1 month1 year2 month2
list date year1 year2 month1 month2 in 1
generate year2=yofd(date)
generate month2=mofd(date)
Data Management: Q & A
Q1: Why does my merge produce a dataset with too many observations?
Q2: How do I create dummy variables?
Q3: How can I list, drop, and keep a consecutive set of variables without
typing the names individually?
Q4: Why does my do-file or ado-file produce different results every time I
run it?
Q5: How do I deal with multiple responses?
http://www.stata.com/support/faqs/data/
Exercise 1
Brief Introduction of Graphics
Plot area
1
0
2
0
3
0
4
0
Y
-
a
x
i
s

t
i
t
l
e
2,000 3,000 4,000 5,000
X-axis title
first legend second legend
Legend
Note: This is the outer region or background
Subtitle: The anatomy of a graph
Title: Figure 1
Brief Introduction of Graphics
sysuse auto, clear
twoway (scatter mpg weight if foreign==0, msymbol(diamond) mcolor(green))
(scatter mpg weight if foreign==1, msymbol(diamond) mcolor(red)),
title(Title: Figure 1) subtitle(Subtitle: The anatomy of a graph)
ytitle(Y-axis title) xtitle(X-axis title)
note(Note: This is the outer region or background)
legend(title(Legend) label(1first legend) label(2second legend))
text(35 3400 "Plot area")
Data Description & Simple Graphs
Describing Datasets
use auto, clear
describe
Data Description & Simple Graphs
Describing Variables
codebook
summarize price mpg weight length
summarize price mpg, detail
bysort foreign: summarize price mpg
Command summarize: provides
descriptive statistics with option for
details.
Data Description & Simple Graphs
hist weight, frequency normal
hist weight, frequency normal start(750) width(250) label(1000(500)5000)
0
5
1
0
1
5
F
r
e
q
u
e
n
c
y
2,000 3,000 4,000 5,000
Weight (lbs.)
0
5
1
0
1
5
F
r
e
q
u
e
n
c
y
1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000
Weight (lbs.)
Data Description & Simple Graphs
graph box mpg
graph box mpg, over(foreign)
graph bar (mean) mpg trunk, over(pricegrp) ///
legend(label(1 "mpg") label(2 "trunk")) ///
blabel(bar, position(inside) format(%9.1f) color(white))
1
0
2
0
3
0
4
0
M
i
l
e
a
g
e

(
m
p
g
)
Domestic Foreign
26.2
10.5
22.5
12.8
20.6
15.4
20.4
14.5
15.0
16.6
0
5
1
0
1
5
2
0
2
5
-4000 4000-4999 5000-5999 6000-9999 10000+
mpg trunk
Data Description & Simple Graphs
Tabulating Data
tab1 foreign pricegrp
tab2 foreign pricegrp
tab2 foreign pricegrp, row column
Command tab1: provides one-way
frequency table.
Command tab2: provides
contingency table.
Data Description & Simple Graphs
tab1 foreign, summarize(price)
tabstat price mpg, by(foreign)
tabstat price mpg, stat(n mean sd p25 p50 p75) by(foreign)
Command tab1: tabulates descriptive
statistics for continuous variables.
Command tabstat: displays table of
summary statistics.
Data Description & Simple Graphs
graph pie, over (pricegrp)
graph pie, over(pricegrp) plabel(_all percent, color(white)) by(foreign)
13.46%
42.31%
17.31%
11.54%
15.38%
18.18%
18.18%
22.73%
31.82%
9.091%
Domestic Foreign
-4000 4000-4999
5000-5999 6000-9999
10000+
Graphs by Car type
-4000 4000-4999
5000-5999 6000-9999
10000+
Graphics: Q & A
http://www.stata.com/support/faqs/graphics/
Exercise 2
Next Session
Statistical Analysis
17 Oct Friday, 9.30am-12pm
Training Room @ Library Level 5
Data Description And Simple Inference
Group Comparison And Correlation
General Linear Regression
Logistic Model
Binary Logistic Model
Ordinal Logistic Model
Multinomial Logistic Model

You might also like