lsun@smu.edu.sg STATA Training Session 1 Outline Computing Resources Getting Started with STATA Running STATA Datasets in STATA Data Management with STATA Exercise 1 Data Descriptions & Simple Graphs Exercise 2 Download Training Slides , data and Syntax: http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/T raining%20Slides%20and%20Syntax.aspx Computing Resources STATA is a statistical package for managing, analyzing, and graphing data. has both command and menu-driven interface has cross-platform compatibility: Windows, Unix, and Mac. has three flavors: the standard Intercooled STATA (2047 variables) the more limited Small STATA (99 variables) the extended STATA/SE (32766 variables). Computing Resources CAC Computing Resources for STATA users Windows: STATA/SE version 10.0 10-user network perpetual license Installation guide (http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA- Software Questions.aspx) Linux CAC Beowulf Cluster: STATA/SE version 10.0 Unlimited users About CAC Beowulf Cluster: (http://research2.smu.edu.sg/CAC/HPC/Wiki/MAIN.aspx) New features in STATA 10.0 (http://www.stata.com/stata10) Getting Started Getting Started Review box Variable window Command line Results window Getting Started Getting help in STATA Help menu: contents : for a list of command categories & language syntax help : for a STATA command with examples search: to search help by keywords From command line: help list search logistic models net search multilevel model User-written programs SJ, STB, STATAlist, and others: help net_mnu Getting Started Website resources: The STATA website: http://www.stata.com The STATA journal reviewed papers, regular columns, user-written software: http://www.stata-journal.com/ STATA FAQ : http://www.stata.com/support/faqs STATA User Support : http://www.stata.com/support Books: http://www.stata.com/bookstore/statabooks.html CAC STATA support: Website: http://research2.smu.edu.sg/CAC/StatisticalComputing/Wiki/STATA.aspx Contact: For statistical consultation: Sun Li: lsun@smu.edu.sg For software installation: TAN SuhWen: swtan@smu.edu.sg Running STATA Files in STATA Commands and Output STATA Variable Definitions Missing Values in STATA Expressions and Functions Memory Consideration Running STATA Files in STATA .dta STATA dataset STATA can read and write from/to ASCII files, such as Excel files. .do STATA do-file, command file Do files can be edited and displayed by text editor, like Notepad. .log, .smcl STATA log file, output file Log files document the commands and analysis results displayed in Result Window, including error messages. Log files (.log) can be edited and displayed by text editor. .gph STATA graph file Running STATA Commands and Output Command prompt driven in: Batch mode: do-file Interactive mode: command line E.g.: verinst -- verifying version and installation of STATA Running STATA To save results: log-files File -> Log -> Begin..., View..., or Close. .smcl or .log extension. Record everything in Results Window, including commands, results, error messages, etc. If the file already exists, another dialog opens to allow you to decide whether to overwrite the file with new output or to append new output to the existing file. From command: cd // list current working directory cdD:\lsun // change working directory to be D:\lsun dir // list files under the current working directory Running STATA STATA Variable Definitions Variable names 1-32 characters: recommend to use 8 characters Valid character: letters a-z, numbers 0-9 and underscore _ Name must be started with a letter (or underscore, but discouraged because STATA-generated variables start with an underscore.) Case-sensitive: lowercase or uppercase letters Variable types String (Storage bytes: Str1 to Str80 Str244 in SE) Numeric (categorical, continuous) Running STATA Format of numeric variables Numeric formula: %w.dg; %w.df; %w.de w: the total width, including period and decimals d: number of decimals Format Formula Example 2 1,000 10,000,000 General %w.dg %9.0g 1 1000 1e+07 Fixed %w.df %9.0f 1 1000 10000000 %9.2f 1.41 1000.00 1.00e+07 Exponential %w.de %10.3e 1.414e+00 1.000e+03 1.000e+07 Running STATA Missing Values in STATA Missing values are created in input or import when a numeric field is empty; or by invalid calculation, e.g. division by zero. System missing value is shown as a (period). Or a period followed by a letter, such as . a, . b, etc. Missing values are interpreted as a very large positive number with: . < . a < . b < etc This can lead to mistakes in logical expressions. Running STATA Expressions and Functions Operators Arithmetic Relational Logical ^ power > greater than ! not * multiplication < less than ~ not / division >= > or equal | or + addition <= < or equal & and - subtraction == equal != not equal ~= not equal Help command: help functions Running STATA Memory Consideration When your dataset is very large, you may consider to: Set the size of memory: set memory Set the maximum number of variables: set maxvar Set the maximum dimension of matrices: set matsize e.g. memory set memory 64m Parameter Default Min Max memory 10M 500K maxvar 5,000 2,047 32,766 matsize 400 10 11,000 Getting started: Q & A Q1: Is there a way to stop Result Window breaking output into pages, i.e. how to get rid of this more message and let Result Window roll to the last line of output? Hint: command help set to understand system parameters Q2: Why do I get the error message no room to add more observations even after I reset STATA memory to load my data set? Hint: Two important considerations: 1) Make sure that you allocate an amount of memory that is larger than the file that you are using. Stata will need the extra room to perform any commands or calculations. 2 Make sure that you do not allocate too much memory because your computer will not have enough memory (RAM) left to perform other tasks. Datasets in STATA Starting Point A Rectangular Matrix Data Input and Output Edit Data Properties Variable Management Data Reorganization Date and Time Values in STATA Datasets in STATA Starting Point: A Rectangular Matrix 3 2 1 2 23 22 21 1 13 12 11 ... ... ... ... ... ... ... ... NK N N N K K X X X X X X X X X X X X N observations K variables Datasets in STATA Data Input and Output Load STATA-format dataset: use [varlist] [if] [in] [using] [filename] [, clear ] Save data in memory to file: save [filename] [, save_options] Clean dataset from memory: clear Note: STATA is case-sensitive. All STATA commands are lowercase. STATA allows only one dataset at one time in memory. Datasets in STATA varlist : a list of variables with blanks in between. var1 just one variable var1 var2 var3 three variables var* variables starting with var *var variables ending with var var1-var3 var1, var2 and var3 if : conditional language if mpg>40 if mpg>40 & income==70 if mpg>40 | mpg <10 in : command in range in 1/10 from obs 1 to 10 in 10/l from obs 10 to the last in 1(2)9 obs 1, 3, 5, 7, 9 Datasets in STATA Import dataset of other formats Stata can import tab-delimited ASCII text files directly. Excel can write tab-delimited ASCII text files choose FileSave AsSave as type: Text (tab delimited) Import text file into STATA Choose FileImportASCII data created by a spreadsheet Datasets in STATA Example sysuse auto, clear //open system dataset auto.dta and clear any dataset in memory if any save auto, replace //save the data in memory to working directory and replace if any describe //describe the dataset browse //open data browser edit //open data editor Datasets in STATA Edit Data Properties generate x=price/mpg //create new variable from algorithm rename x priceunit //rename variable label variable priceunit "price per mpg //label variable list priceunit in 1/10 //list first 10 obs for the variable priceunit Edit variable properties from Data Editor in edit mode Data Management with STATA Variable Management recode price (10000/max=5 "10000+") /// (6000/10000=4 "6000-9999") (5000/6000=3 "5000-5999") /// (4000/5000=2 "4000-4999") (min/4000=1 "-4000") , generate (pricegrp) label var pricegrp price group d pricegrp codebook pricegrp recode pricegrp (1/2 = 1 "-5000") /// (3/4=2 "5000-9999") (5=3 "10000+"), generate (pricegrp2) codebook pricegrp pricegrp2 save auto, replace Data Management with STATA generate x="F //generate a new variable with value = F replace x="M" in 20/l //replace value of x =M from obs 20 to the last encode x, generate (xcode) //convert string variable x to be numeric and save it to a new var d x xcode browse x xcode drop x xcode Data Management with STATA Data Reorganization sort foreign //sort dataset by variable foreign by foreign: summarize price //descriptive statistics of price by foreign group bysort foreign: summarize price //alternative way keep price pricegrp foreign //keep the three variables and drop the rest keep in 1/50 //keep the first 50 obs drop if price < 4000 //drop obs if price < 4000 save price, replace //save it into a new dataset Note: sort only sorts in ascending order. To sort descending: gsort price -mpg Data Management with STATA dir use hsmale, clear codebook gender use hsfemale, clear codebook gender append using hsmale codebook gender save hsappend, replace append : to combine the information from two files with the same variables but different obs. Data Management with STATA dir use hsdem, clear sort id save, replace use hstest, clear sort id merge id using hsdem list save hsmerge tab _merge Note: Both files must be sorted beforehand by matching key (id in the example above), and the matching key must have the same name in both datasets. merge : To combine the information from two files with different information about the same obs. Date and Time Values in STATA sysuse sp500 d list date in 1/10 generate year1=year(date) generate month1=month(date) d date year1 month1 Data Management with STATA How STATA records dates and times: Dates and times are called %t values. %t values are numerical and integral. The integral value records the number of time units that have passed from 1960. Data Management with STATA d date year1 month1 year2 month2 list date year1 year2 month1 month2 in 1 generate year2=yofd(date) generate month2=mofd(date) Data Management: Q & A Q1: Why does my merge produce a dataset with too many observations? Q2: How do I create dummy variables? Q3: How can I list, drop, and keep a consecutive set of variables without typing the names individually? Q4: Why does my do-file or ado-file produce different results every time I run it? Q5: How do I deal with multiple responses? http://www.stata.com/support/faqs/data/ Exercise 1 Brief Introduction of Graphics Plot area 1 0 2 0 3 0 4 0 Y - a x i s
t i t l e 2,000 3,000 4,000 5,000 X-axis title first legend second legend Legend Note: This is the outer region or background Subtitle: The anatomy of a graph Title: Figure 1 Brief Introduction of Graphics sysuse auto, clear twoway (scatter mpg weight if foreign==0, msymbol(diamond) mcolor(green)) (scatter mpg weight if foreign==1, msymbol(diamond) mcolor(red)), title(Title: Figure 1) subtitle(Subtitle: The anatomy of a graph) ytitle(Y-axis title) xtitle(X-axis title) note(Note: This is the outer region or background) legend(title(Legend) label(1first legend) label(2second legend)) text(35 3400 "Plot area") Data Description & Simple Graphs Describing Datasets use auto, clear describe Data Description & Simple Graphs Describing Variables codebook summarize price mpg weight length summarize price mpg, detail bysort foreign: summarize price mpg Command summarize: provides descriptive statistics with option for details. Data Description & Simple Graphs hist weight, frequency normal hist weight, frequency normal start(750) width(250) label(1000(500)5000) 0 5 1 0 1 5 F r e q u e n c y 2,000 3,000 4,000 5,000 Weight (lbs.) 0 5 1 0 1 5 F r e q u e n c y 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 Weight (lbs.) Data Description & Simple Graphs graph box mpg graph box mpg, over(foreign) graph bar (mean) mpg trunk, over(pricegrp) /// legend(label(1 "mpg") label(2 "trunk")) /// blabel(bar, position(inside) format(%9.1f) color(white)) 1 0 2 0 3 0 4 0 M i l e a g e
( m p g ) Domestic Foreign 26.2 10.5 22.5 12.8 20.6 15.4 20.4 14.5 15.0 16.6 0 5 1 0 1 5 2 0 2 5 -4000 4000-4999 5000-5999 6000-9999 10000+ mpg trunk Data Description & Simple Graphs Tabulating Data tab1 foreign pricegrp tab2 foreign pricegrp tab2 foreign pricegrp, row column Command tab1: provides one-way frequency table. Command tab2: provides contingency table. Data Description & Simple Graphs tab1 foreign, summarize(price) tabstat price mpg, by(foreign) tabstat price mpg, stat(n mean sd p25 p50 p75) by(foreign) Command tab1: tabulates descriptive statistics for continuous variables. Command tabstat: displays table of summary statistics. Data Description & Simple Graphs graph pie, over (pricegrp) graph pie, over(pricegrp) plabel(_all percent, color(white)) by(foreign) 13.46% 42.31% 17.31% 11.54% 15.38% 18.18% 18.18% 22.73% 31.82% 9.091% Domestic Foreign -4000 4000-4999 5000-5999 6000-9999 10000+ Graphs by Car type -4000 4000-4999 5000-5999 6000-9999 10000+ Graphics: Q & A http://www.stata.com/support/faqs/graphics/ Exercise 2 Next Session Statistical Analysis 17 Oct Friday, 9.30am-12pm Training Room @ Library Level 5 Data Description And Simple Inference Group Comparison And Correlation General Linear Regression Logistic Model Binary Logistic Model Ordinal Logistic Model Multinomial Logistic Model