You are on page 1of 26

Stata Reference Manual

What you should know about Stata after taking the Stata introduction course A collection of technical hints

Ivan Iachine, Lars Korsholm, Henrik Stvring, Kirstin Vach, Werner Vach

Version 1.5, Feb., 2004

Contents

1 Entering commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1 2 Online help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1 3 Producing output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1 4 The general syntax of Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 2 5 Typical errors and error messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 2 6 Protection of les and data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 4 7 Data checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 5 8 The graph command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 5 9 Stratication using by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 6 10 Generating new variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 6 11 Creating subsamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 8 12 Making tables in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 8 13 Categorization of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 10 14 Using Stata as a pocket calculator: The display command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 12 15 Loops in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 12 16 Working with do-les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 13 17 Reshaping datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 14 18 Working with string variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 15 19 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 16 20 Switching between labels, strings and numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 17 21 Creating variables with statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 19 22 Survival analysis commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 21 23 Online facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 22 24 How to nd a statistical method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 23

1 Entering commands
New command: Type the command in the Stata Command window. To execute press Enter . See section 4 for the syntax of Stata commands. Previous command: Double-click the command in the Review window or press Page Up until you get the appropriate command, then hit Enter . In general Page Up and Page Down browse previously executed commands. Execute a do-le: See Section 16.

2 Online help
Known command name: Use the help menu or the command help:
. help ttest ------------------------------------------------------------------------------help for ttest, ttesti (manual: [R] ttest) ------------------------------------------------------------------------------Mean comparison tests --------------------ttest ... varname = # [if exp] [in range] [, level(#) ]

The command whelp opens a new window with the same information and clickable links. Known name of statistical method: Use the help menu or the command findit:
. findit paired [R] ttest . . . . . . . . . . . . . . . . . . . . . Mean comparison tests (help ttest) . . . . Comparing the p-values between a paired t test and a signrank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Sribney 3/97 Is my boss correct in saying that the p-value given with a paired ttest should always be lower than the signrank? http://www.stata.com/support/faqs/stat/signrank.html

FAQ

...

The findit command often results in hints to the Stata Technical Bulletin (STB) and to commands you can download from the internet. For more on online facilities see section 23.

3 Producing output
Create a log-le: Use the command log using lename. Execute your commands and nish with the command log close. The output is now stored in lename. Usually the le lename will have the extension .smcl. See help log for further information. 1

Copy and Paste: Mark the desired output in the Results window, then copy this with Ctrl-C . Paste this into the le of your choice with Ctrl-V . You may also do Copy and Paste in the ordinary Windows fashion using the mouse or menus.

4 The general syntax of Stata commands


Syntax: The general syntax of a command is
commandname varlist selector, options

varlist can be one or several variable names, or it might be empty. In the case of several variables it is possible to give the varlist as, say, var1-var5, which means all the variables from var1 to var5 in the current order shown by display, or you may use var*, which means all the variables in the dataset that start with the letters var. selector can be something like
if sex=="m" if age>18 in 1/3

As selector we may use any combination of these. Note that the logical equal to symbol is two times ==. in 1/3 means the rst through thrid observation in the data set (in the current order). options vary from command to command. They are either single names (e.g. histo) or include additional information in parentheses (e.g. bin(7) or xscale(0,20)) Note: There is at most one comma in a Stata command! Abbreviations: Usually you can abbreviate command names and options. For example, the following two commands are equivalent:
. regress bweight hypertension ..., robust . reg bw hyp ..., r

Each option and command has a minimal number of letters to be used, you can look this up using the help command. The minimal number of letters are underlined by stata. You can also abbreviate variable names by their rst letter(s), as long as the identication remains unique. In the example abovebweight and hypertension must be the only variables beginning with bw and hyp.

5 Typical errors and error messages


If you are using the windows-version of STATA, all error messages are in red Error messages: Error messages try to inform you about what may be wrong, for example if you misspell a variable name,
.tab variabble variabble not found

if you use an incorrect option 2

.tabulate var1, by(var2) by() invalid

or if the data is assumed to be sorted, but it is not sorted


.by var1: tabulate var2 not sorted

Below the error message in red is an r(xxx) code in blue. This code is clickable and provides more details on what might be wrong and what you should do. The logic of error messages: Stata cannot know what you intend to do, it can only recover errors by syntax checks. This means, that you can get only indirect hints. For example, if you forget to separate an option by a comma, you will get:
.tabulate var1 var2 chi chi not found

because Stata believes, that you meant chi to be a variable. Or if you forget, that by requires parentheses, you get:
.table var1, by var2 by invalid

Here Stata does not realise that you forgot the parentheses, it believes, that you tried to use by as a single option. These examples show that error messages are often very cryptic. Some typical error messages and what they may indicate: error message possible explanation no; data in memory would be lost changing a dataset without saving
no variables dened not sorted

solution/example save the dataset save newdata.dta if you want to use a new dataset clear
use data.dta sort var1 bysort var1: ...

xxx() invalid xxx invalid xxx invalid name no observations

=exp not allowed type mismatch

The not enough space to add more ... error messages: The default installation of Stata starts with a small amount of memory. You have run out of physical memory. The quick solution: save your dataset, clear stata, add more memory, load data again:
. save dummyname file dummyname.dta saved . clear . set memory 16m

xxx not found

no data loaded before using a by-option the data has to be sorted or use the bysort command unknown variable (e.g. incorrect spelling) no comma before option no blank after function incorrect/unknown option incorrect option (e.g. missing ()) incorrect syntax (e.g. ; instead of :) incorrect variable type variable with missings only == is needed wrong variable type for this operation

e.g. tabulate var1 var2 chi e.g. di Binomial (20,10,0.5) e.g. .tab var1, by(var2) correct: by var2: tab var1 e.g. table var1, by var2 e.g. by var1; tab var2 e.g. regress STRINGVAR var e.g. regress MISSINGVAR var e.g. list var1 if var2=0 e.g. list var1 if STRINGVAR==0 (string variable) e.g. list var1 if var2==0 (numeric variable)

(16384k) . use dummyname . erase dummyname.dta

where 16m is 16 mega byte of RAM. Select the amount you need. See help memory The lasting solution: If you are working with a dole (and you should!), then insert at the top of the le:
set memory 16m

and rerun your dole. The (almost) permanent solution: Right-click on the icon, select Properties, and change the path eld to, e.g. C: stata wstata.exe /m16.

6 Protection of les and data


Stata tries to protect you from yourself so that you do not unintentionally lose data. The clear and save commands: When you have performed data manipulations and want to analyze a new dataset or want to exit the session, Stata requires that you decide what to do with your present dataset. Either you must specify save newdata or ignore the changes by typing clear. In the last case Stata also accepts clear as an option, e.g.
. use nextdata, clear

or
. exit, clear

The replace option: When you want to use external les and these exist, Stata will refuse to let you overwrite them unless you deliberately use the replace command, e.g.
. log using myfile file myfile.log already exists r(602); . log using myfile, replace

this will overwrite the contents of myfile. Similar for les containing graphs
. scatter x y, saving(mygraph, replace)

Note: You can use the replace option, even if the corresponding le does not exist. The replace command: See section 10. Be careful with your data! 4

7 Data checking
Before you analyse your data you should verify that they are as expected. describe varlist : Gives an overview of your variables, storage type etc. codebook varlist : Provides detailed information on each variable. See section 19. tabulate and list: The commands tab varname and list varname may give you on screen information on varname but you have to look at the output and remember what you should look for. The assert command: The assert command lets you automize the conformation process. The command does nothing if everything is as expected, but stops with an error message if the assessment fails (and stops executing your dole). Some examples: Simple arithmetics
. assert 2+2==4 . assert 2>3 assertion is false r(9);

If the variable age should contain the age in years (integers) and every one is between 20 and 50.
. assert age==int(age) & age>=20 & age<=50

If the variable sex contains the gender of the person as F or M


. assert sex=="F" | sex=="M"

If datein is the st time an object is observed and dateout is the last time
. assert datein<dateout

This may fail if datein can be missing. If you want to run the assessment allowing for missing cases of datein
. assert datein<dateout if datein!=.

When the assessment fails you list the illegitimate cases. E.g.
. list id sex if !(sex=="F" | sex=="M") id 19 sex f

19.

8 The graph command


We refere to chapter 14 in Introduction to Stata 8 by Svend Juul available from http://www.biostat.au.dk/teaching/software/ 5

9 Stratication using by
by-option: Many commands in Stata allow or require you stratication of your data into groups using the by-option, e.g.
. gr size, box by(sex)

by-construct: Other commands allow a preceding by for a stratied analysis, e.g.


. by sex: sum size

In both cases, you have to sort the data rst:


. sort sex

There exists no common rule, when by-constructs or by-options are allowed. However, this is always indicated in the syntax description offered by the help command.

10 Generating new variables


The generate command: You can use the generate command to generate new variables. In the following example, we generate a variable for body mass index, an indicator of overweight, and an indicator for absence of fever, emesis and fatigue:
. l weight 54 88 102 91 74 height 1.73 1.81 1.77 1.91 1.66 fever 1 1 0 0 0 emesis 0 0 0 1 1 fatigue 1 0 0 0 1

1. 2. 3. 4. 5. . gen . gen . gen . l

bmi=weight/height2 overw=bmi>25 success=(fever) & (emesis) & (fatigue)

1. 2. 3. 4. 5.

weight 54 88 102 91 74

height 1.73 1.81 1.77 1.91 1.66

fever 1 1 0 0 0

emesis 0 0 0 1 1

fatigue 1 0 0 0 1

bmi 18.0427 26.86121 32.55769 24.94449 26.85441

overw 0 1 1 0 1

success 0 0 1 0 0

Note: If you want to generate string variables, you have to specify the length of the string. See Section 18. The replace command: If you want to overwrite an existing value of a variable, you have to use the replace instead of the generate command. For example, if height is recorded in centimeter in the data set, but you want to have it in meter, you just type 6

. replace height=height/100

A perhaps unexpected use of replace appears when you try to dene a new variable with a subgroup dependent denition. For example, if the limit for overweight differs between males and females, typically you use code like
. generate overw=bmi>23 if sex=="m" . replace overw=bmi>25 if sex=="f"

The reason for this is, that the rst statement lls the variable overw already with missing values for all female subjects, which have to be replaced by the second statement. Overview about available functions and operators: In generating new variables, you can connect existing variables by a lot of operators and functions. By help operators or help functions you get an overview. The most important ones are summarized in the following list.
. help operators ------------------------------------------------------------------------------help for operators (manual: [U] 20 Functions and expressions) ------------------------------------------------------------------------------Operators in expressions -----------------------Relational (numeric and string) ------------------> greater than < less than >= > or equal <= < or equal == equal = not equal

Arithmetic ------------------+ addition subtraction * multiplication / division power

Logical ----------------- not | or & and

....

Note that the equal to symbol is two times ==.


. help functions ------------------------------------------------------------------------------help for functions (manual: [R] functions) ------------------------------------------------------------------------------... Mathematical functions ---------------------abs(x) cos(x) exp(x) ln(x) log(x) log10(x) sin(x) sqrt(x) tan(x) ... absolute value cosine of radians exponentiation natural logarithm same as ln(x) base 10 logarithm sine of radians square root tangent of radians

See also section 21. 7

11 Creating subsamples
There are two ways in which you can create subsamples. You can select a subset of your variables (vertical selection) or you can select a subset of your observations (horizontal selection). For both procedures we have the commands drop and keep. For variables: The data set has three variables ID, sex and income.
. drop income

which produce the same result as


. keep ID sex

For observations: Drop all observations associated with female individuals (the code f in the variable sex indicate a female)
. drop if sex=="f"

which produce the same result as


. keep if sex="f"

The consequence of these commands is that the dateset in memory is permantly changed. The dataset on disk is not effected until you issue the save dataname, replace. To save in a new lename type save newdataname

12 Making tables in Stata


The tabstat command: You use tabstat when you want to display a series of summary statistics for one or several variables.
tabstat varlist [, statistics(statname [...]) by(varname) columns(var|stat) long ]

where statname [...] are the summary statistics that you want to display.
. tabstat erateWL, s(n mean sd) variable | N mean sd -------------+-----------------------------erateWL | 170 .19375 .1836576 --------------------------------------------

If you want separate summary statistics for each group dened by varname you should use the options by(varname) c(s) lo.
. tabstat erateS erateWL, s(n mean sd) c(s) by(gender) lo gender variable | N mean sd --------------------+-----------------------------Female erateS | 89 .5580524 .2354242 erateWL | 89 .1685393 .1723018 --------------------+-----------------------------Male erateS | 81 .5925926 .2487279 erateWL | 81 .2214506 .1926505 --------------------+-----------------------------Total erateS | 170 .5745098 .2417539 erateWL | 170 .19375 .1836576 ---------------------------------------------------

Se more details in help tabstat. The table command: You use table when you want to display a series of summary statistics for each level of another variable.
table rowvar [colvar [supercolvar] ...] [, contents(clist) row col [options] ]

The philosophy behind the syntax is that we want a table where for each value in the variable rowvar (and colvar and supercolvar) the cell contains clist with layout format given in options, where clist is summary statistics on third part variables. The option row adds the relative frequency to each cell such that each row sum up to 100% (similar for the option col). For details on the format options see help table.
. table treat, c(n dec med dec p5 dec p95 dec) ----------+----------------------------------------------------------treat | N(decrease) med(decrease) p5(decrease) p95(decrease) ----------+----------------------------------------------------------1 | 205 5.211085 -10.97878 23.59735 2 | 204 16.30814 -2.117609 33.61396 3 | 204 13.19776 -25.15851 30.93353 ----------+-----------------------------------------------------------

The tabulate command: You use the tabulate command when you want to investigate the association between two (or more) variables.
tabulate varname1 varname2 [, all cell chi2 column exact gamma lrchi2 row taub V ...]

The interpretation of the syntax is that we tabulate the frequency count of varname1 versus varname2 with various measures of association, including the common Pearson chi-squared, the likelihood ratio chi-squared, Cramers V, Fishers exact test, Goodman and Kruskals gamma, and Kendalls tau-b.
. tab res treat, chi2 | treat result | 1 2 3 | Total -----------+---------------------------------+---------1 | 74 21 56 | 151 2 | 71 47 35 | 153 3 | 36 57 52 | 145 4 | 24 79 61 | 164 -----------+---------------------------------+---------Total | 205 204 204 | 613 Pearson chi2(6) = 75.7134 Pr = 0.000

It is possible to combine tabulate with summarize to obtain table-like output in a fast way.
. tab treat, summarize(dec) | Summary of decrease treat | Mean Std. Dev. Freq. ------------+-----------------------------------1 | 5.6048431 11.082792 205 2 | 15.710805 11.359821 204 3 | 9.3633245 17.387196 204 ------------+-----------------------------------Total | 10.218785 14.193435 613

See help tabsum. Specialized tables: There exists a number of table commands designed for specic purposes, with epidemiologic data see help epitab, with cross-sectional time dependent data (also called panel data) see help xttab, and with survival data see help ltable. 9

13 Categorization of variables
In many medical applications continuous variables are reduced to variables with a few categories like low, middle and high. Stata supports this step by different functions. Categorizing a variable at specic cutpoints using the recode function: If you want to categorize a variable at specic cut points, you can use the recode function as in the following example. The new variable assigns to each value the upper value of the interval, where the value falls in. Note that you have to ensure, that the last specied cutpoint is not smaller then the maximal value in your dataset in order to obtain the desired result (see generation of catvar1). In general, the last specied value in the arguments of recode is not the last cutpoint, but the value assigned to each value larger than the last but one argument. This property is used in generating catvar2 to assign a missing value to all values larger than 110.

. list var 23 56 67 123 99 17

1. 2. 3. 4. 5. 6.

. gen catvar1=recode(var,50,100,150) . gen catvar2=recode(var,40,60,80,110,.) (1 missing value generated)

. list var 23 56 67 123 99 17 catvar1 50 100 100 150 100 50 catvar2 40 60 80 . 110 40

1. 2. 3. 4. 5. 6.

If you want to recode the values of the grouped variable, you can use the recode command, or you can use the egen command with the group function, which assigns the values 1, 2, 3 etc. to the smallest, the next smallest etc. value. Both are illustrated in continuing our example:

. list var 23 56 67 123 99 17 catvar1 50 100 100 150 100 50 catvar2 40 60 80 . 110 40

1. 2. 3. 4. 5. 6.

. egen catvarg1=group(catvar1) . recode catvar2 40=1 60=2 80=3 110=4 (5 changes made) . list

10

1. 2. 3. 4. 5. 6.

var 23 17 56 67 99 123

catvar1 50 50 100 100 100 150

catvar2 1 1 2 3 4 .

catvarg1 1 1 2 2 2 3

Note that using the group function implies that data are reordered. Categorizing a variable at equidistant cutpoints using the autocode function: autocode is an automated version of recode, which you can use, if the cutpoints are equidistant. You then have only to specify the number of intervals, the smallest cutpoint and the largest cutpoint. Note, that all values larger than the largest cutpoint get assigned the largest cutpoint, so you should ensure, that the largest cutpoint is larger than the maximal value in your dataset. As categorization and recoding is always a dangerous action, you should always try to check the result, for example by a cross tabulation. This is illustrated in the following example, too.
. list var 23 56 67 123 99 17

1. 2. 3. 4. 5. 6.

. gen catvar=autocode(var,5,0,100) . list var 23 56 67 123 99 17 catvar 40 60 80 100 100 20

1. 2. 3. 4. 5. 6.

. tab var catvar, missing | catvar var | 20 40 60 80 100 | Total -----------+-------------------------------------------------------+---------17 | 1 0 0 0 0 | 1 23 | 0 1 0 0 0 | 1 56 | 0 0 1 0 0 | 1 67 | 0 0 0 1 0 | 1 99 | 0 0 0 0 1 | 1 123 | 0 0 0 0 1 | 1 -----------+-------------------------------------------------------+---------Total | 1 1 1 1 2 | 6

Categorizing a variable in groups of equal size using xtile: The xtile command creates a new variable categorizing an existing variable in groups of (approximately) equal size. The number of groups has to be specied using the nq option. This is illustrated in the following example: 11

. list var 23 56 67 123 99 17

1. 2. 3. 4. 5. 6.

. xtile cat2=var, nq(2) . xtile cat3=var, nq(3) . xtile cat4=var, nq(4) . list var 17 23 56 67 99 123 cat2 1 1 1 2 2 2 cat3 1 1 2 2 3 3 cat4 1 1 2 3 3 4

1. 2. 3. 4. 5. 6.

Note, that xtile reorders the dataset. One can use xtile also to categorize at cutpoints dened by another variable. Combining it with pctile allows to categorize at percentiles of subgroups. For further details try help xtile and look into the Stata reference manual.

14 Using Stata as a pocket calculator: The display command


The display command allows you to type in expressions and to look at the results. You can use all operators and functions dened in Stata. Typical examples look like these:
. di 3+4 7 . di 10.6 - 2 * 7.35 -4.1 . di 34 81 . di (2.1 + 2.3)/(4.1 + 47.3) .08560311 . di 2+3, 2+5.6, 3+6 5 7.6 9 . di 23.4-invnorm(0.995)*12.3, 23.4 + invnorm(0.995)*12.3 -8.2827004 55.0827

15 Loops in Stata
The for command: You can execute a series of Stata commands with the command for. Example:
. for num 1/5: replace varX=varX/1000

12

The index X is substituted in each loop. num tells Stata that we use numerical values for X. 1/5 is the list of values 1 2 3 4 5 . The : indicates that hereafter are the Stata commands to be executed in each step of the loop. It is possible to have several indices ( ). Example, I may wish to keep var1-var5 and have new variables var11-var15 in kilo scale.
. for num 1/5 \ num 11/15: generate varY=varX/1000

where tells Stata that here start a second index Y. Further we may nest a for-loop within an other for-loop to obtain matrix form repeatments 1 .
. for A in num 1/5: for B in num 1/5: gen varAB=varA*varB

would generate 25 variables var11, var12, ..., var55. If you use for combined with graph remember the pause option. See also help foreach and help forvalues in Stata 7. See the manual for further details.

16 Working with do-les


What is a do-le: A do-le is a at text le (ie. ASCII format) containing Stata commands. Creating a do-le: Open the Do le editor. Type in the commands you would ordinarily type in the Command window. The editor is similar to for example NotePad. Executing a do-le: Press the Do button (number two top right). Debugging a do-le: Read the error messages. If this doesnt help, try the command set trace on which gives very detailed information on command execution. It is reversed to its original setting by set trace off. The command set trace on place a - in front of each line which is executed. The last line without a - sign contains the error. Often useful in combination with set more off. Why use do-les: For two reasons: 1. Gives you the option of modifying and re-running your commands, ie. it is a time saver (in the long run...). 2. Provides you with documentation on just how you arrived at your precious conclusions. Comments in do-les: It is fruitful to write comments to yourself or any reader in your do-les. You write comments by beginning the line with an asterisk *, then Stata will ignore whatever is in that line. A nice do le looks like:
log using filename, replace * This do-file is an example use data, clear describe ... some other commands log close
1 This

feature is new in Stata 6

13

17 Reshaping datasets
Reshaping wide datasets: Suppose you have the following dataset with measurements of nausea on 3 consecutive days after chemotherapy:

. list in 1/3 id 1 2 3 sex m f m nausea1 78 83 27 nausea2 56 45 22 nausea3 34 67 22

1. 2. 3.

You would like to investigate the increase over time by a regression model. For this, you need a data set, where each line corresponds to one day of one individual. You can use the reshape command to achieve this:
. reshape long nausea ,i(id) j(day) . list in 1/9 id 1 1 1 2 2 2 3 3 3 day 1 2 3 1 2 3 1 2 3 sex m m m f f f m m m nausea 78 56 34 83 45 67 27 22 22

1. 2. 3. 4. 5. 6. 7. 8. 9.

. . regress nausea day, cluster(id)

In Statas terminology, you have changed a dataset from wide format to long format. Note: The i-option species the logical unit, whereas the j-option species the variable which indicates observations within a unit. Reshaping long datasets: Suppose you have the following dataset with measurements of nausea on 3 consecutive days after chemotherapy:
id 1 1 1 2 2 2 3 3 3 day 1 2 3 1 2 3 1 2 3 sex m m m f f f m m m nausea 78 56 34 83 45 67 27 22 22

1. 2. 3. 4. 5. 6. 7. 8. 9.

You would like to make a scatterplot of the measurement on day 2 versus the measurement on day 1. For this you need a dataset where you have the variables nausea1 and nausea2. You can use the reshape command to achieve this: 14

. reshape wide nausea, i(id) j(day) . list in 1/3 id 1 2 3 nausea1 78 83 27 nausea2 56 45 22 nausea3 34 67 22 sex m f m

1. 2. 3.

. gr nausea2 nausea1, twoway .

In Statas terminology, you have changed a dataset from long format to wide format. Note: If you switch from long to wide format, all variables not used as arguments for reshape must be constant within each unit specied by the i-option. Otherwise, you get an error message. Reshaping several variables simultaneously with nonnumeric sufces: In reshaping datasets, the variables can also have nonnumeric sufces, for example left and right. In this case you have to specify the string option. You can also reshape several variables simultaneously. Both is illustrated in the following example:
. list in 1/2 id 1 2 sex m f eyeleft 1 1 eyeright 1 0 earleft 0 1 earright 0 0

1. 2.

. reshape long eye ear, i(id) j(side) string . list in 1/4 id 1 1 2 2 side left right left right sex m m f f eye 1 1 1 0 ear 0 0 1 0

1. 2. 3. 4.

. reshape wide eye ear, i(id) j(side) string . list in 1/2 id 1 2 eyeleft 1 1 earleft 0 1 eyeright 1 0 earright 0 0 sex m f

1. 2.

You can use the reshape command also for more complex situations. Take a look at the Stata Reference Manual.

18 Working with string variables


Generating string variables: If you want to generate a new string variable, you have to specify the length of the variable in the generate statement, e.g.
. gen str3 s="abc"

Operations on strings: If you want to concatenate strings, you can use the + operator: 15

. l treat A A group 2 1

1. 2.

. gen str3 tr_gr=treat+" "+group . l treat A A group 2 1 tr_gr A 2 A 1

1. 2.

There exists a lot of functions to work with strings, especially to switch from numbers to strings and vice versa.
. help functions ------------------------------------------------------------------------------help for functions (manual: [R] functions) ------------------------------------------------------------------------------....

String functions ------------------ returns position in s1 in which s2 is first found or 0 if s1 does not contain s2 length(s) --- returns length of string s lower(s) --- returns lowercased variant of s ltrim(s) --- returns s with leading blanks removed real(s) --- converts s into a numeric value rtrim(s) --- returns s with trailing blanks removed string(n) --- converts n into a string string(n,%fmt) --- converts n into a string with %fmt display format substr(s,n1,n2) --- returns the substring of s starting at n1 for a length of n2; if n1<0, starting position is interpreted as distance from end of string; if n2==., the remaining portion of the string is returned trim(s) --- returns s with leading and trailing blanks removed upper(s) --- returns uppercased variant of s .... index(s1,s2)

Se also section 20.

19 Labels
Labelling an existing variable: If a variable is coded by numerical values, it is often useful to have the meaning of the values and not the values themselves in tabulations and listings. You can achieve this by assigning labels to the variable values using the label command:
. list sex 0 1 age 17 23

1. 2. .

16

. label define labsex 0 male 1 female . label values sex labsex . . list sex male female age 17 23

1. 2.

Note: The labels are only used in representing the values. Internally, they need to be stored as numbers. So you can only use sex as a numeric variable. Distinguishing values and labels: Once a variable is labelled, you might have difculties to nd out, what the real values are. The codebook command shows you always both the values and the labels:
. codebook sex

sex --------------------------------------------------------------- (unlabeled) type: numeric (float) label: labsex range: unique values: tabulation: [0,1] 2 Freq. 1 1 Numeric 0 1 units: coded missing: Label male female 1 0 / 2

Note: If you import datasets from other systems, for example using StatTransfer, values are often already labeled. Hence it is always a good idea to use codebook in the beginning. Note: Some commands, for example list and tabulate, allow a nolabel-option, such that the values instead of the labels are shown.

20 Switching between labels, strings and numbers


Labels and Strings: Sometimes, you would like to use the labels of a variable as strings, for example if you want to create a new variable by concatenating. This is done by the decode command, and encode does the opposite:
. list sex male female age 17 23

1. 2.

. list, nolabel sex 0 1 age 17 23

1. 2.

. decode sex, gen(sexstr) . list

17

1. 2.

sex male female

age 17 23

sexstr male female

. encode sexstr, gen(gender) . . list sex male female age 17 23 sexstr male female gender male female

1. 2.

. codebook gender

gender ------------------------------------------------------------ (unlabeled) type: numeric (long) label: gender range: unique values: tabulation: [1,2] 2 Freq. 1 1 Numeric 1 2 units: coded missing: Label female male 1 0 / 2

Strings and Numbers: The string function allows to change numbers to strings, and the real function allows to change strings to numbers.

. list sex female male age 23 17

1. 2.

. gen str2 agestr=string(age) . gen years=real(agestr) . list sex female male age 23 17 agestr 23 17 years 23 17

1. 2.

. describe Contains data obs: 2 vars: 4 size: 56 (98.5% of memory free) ------------------------------------------------------------------------------1. sex float %9.0g labsex 2. age float %9.0g 3. agestr str2 %9s 4. years float %9.0g -------------------------------------------------------------------------------

18

21 Creating variables with statistics


It is often necessary for an analysis to prepare the dataset by computing new variables with statistics, for example the maximum value observed during a day or subject specic mean values. The following illustrates some typical tools for this task. Computing statistics over several variables using egen: The egen command offers functions like rmax or rmean to compute a maximum or a mean rowwise. This is illustrated in the following example, where we have for each subject and each day a measurement at 6 oclock, 12 oclock and 18 oclock. We can use rmax to compute the maximum within each day:
. list in 1/6 subj 1 1 1 2 2 2 day 1 2 3 1 2 3 val6 23.5 25.8 12.8 14.5 19.8 33.9 val12 34.3 33.6 18.9 17.9 17.3 30.3 val18 22.9 27.8 22.3 22.8 15.4 27.8

1. 2. 3. 4. 5. 6.

. egen maxv=rmax(val6 val12 val18) . list in 1/6 subj 1 1 1 2 2 2 day 1 2 3 1 2 3 val6 23.5 25.8 12.8 14.5 19.8 33.9 val12 34.3 33.6 18.9 17.9 17.3 30.3 val18 22.9 27.8 22.3 22.8 15.4 27.8 maxv 34.3 33.6 22.3 22.8 19.8 33.9

1. 2. 3. 4. 5. 6.

egen offers for this type of tasks the functions rmax, rmin, rmean, rsum, rsd and robs, where the latter gives the number of nonmissing observations. Note that these functions expect a list of variables separated by blanks. Do not confuse them with the functions mean, min, max etc., which are also offered by egen for other purposes. Computing statistics over several observations using collapse: The collapse command allows you to compute statistics from groups of observations. Looking at the last example, we might now be interested in taking the average over three days for each subject. This can be done in the following way:
. list in 1/6 subj 1 1 1 2 2 2 day 1 2 3 1 2 3 val6 23.5 25.8 12.8 14.5 19.8 33.9 val12 34.3 33.6 18.9 17.9 17.3 30.3 val18 22.9 27.8 22.3 22.8 15.4 27.8 maxv 34.3 33.6 22.3 22.8 19.8 33.9

1. 2. 3. 4. 5. 6.

. collapse (mean) meanmax=maxv, by(subj) . list in 1/2 subj 1 2 meanmax 30.06667 25.5

1. 2.

19

You can generate simultaneously several statistics, for example you can use collapse (min) minval6=val6 (max) maxval6=val6, by(subj) in order to generate the minimum and maximum of the measurements at 6 oclock over the three days for each subject. Other statistics offered by collapse are median, sd, sum, iqr and all percentiles. Note: If you have a variable, which is constant within the unit you would like to collapse, and which you want to keep in the new dataset (for example the age and sex of a subject), you can include them in the by-option. (For example: collapse ..., by(subj age sex)) Computing statistics over several observations using egen: Sometimes it is necessary to generate statistics over observations without reducing the dataset, for example if you want to compare single values with subject specic mean values. The egen command together with a by-option allows you to do this in an easy manner. In the following example we have 6 measurements for each subject, and we would like to compare the values with the subject specic means in order to check, when a subject suffers from a high or low value. This can be done in the following way:

. list in 1/12 subj 1 1 1 1 1 1 2 2 2 2 2 2 time 1 2 3 4 5 6 1 2 3 4 5 6 value 17.9 23.7 45.8 37.2 19.4 20.8 44.5 48.7 52.1 46.7 44.5 40.3

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

. egen meanval=mean(value), by(subj) . gen high=val>meanval . list in 1/12 subj 1 1 1 1 1 1 2 2 2 2 2 2 time 1 2 3 4 5 6 1 2 3 4 5 6 value 17.9 23.7 45.8 37.2 19.4 20.8 44.5 48.7 52.1 46.7 44.5 40.3 meanval 27.46667 27.46667 27.46667 27.46667 27.46667 27.46667 46.13333 46.13333 46.13333 46.13333 46.13333 46.13333 high 0 0 1 1 0 0 0 1 1 1 0 0

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

egen offers also functions like min, max, median, sd, iqr, rank, sum and functions for percentiles. A typical use of egen is in standardizing a variable to the range 0-1 for each subject. This looks like
. egen min=min(var), by(subject) . egen max=max(var), by(subject) . gen standvar=(var-min)/(max-min)

20

22 Survival analysis commands


A characteristic feature of survival data is the presence of censoring and left truncation. Without censoring and truncation the data are represented by the survival time variable , which measures the duration of time between the initial event and the nal event. In the presence of censoring and truncation more variables are required to represent the incomplete observation of the survival time . With censoring at time (e.g. end of followup) it is only possible to observe if the nal event occurs before time . The nal event indicator is equal to if (i.e. uncensored observation) and it is equal to if (i.e. censored observation). The censored survival time is equal to if and is equal to if . With left truncation at time the censored observations are only observed if (otherwise no information is collected). Consequently, under right censoring and left . In Stata datasets these variables truncation the survival time is represented by three variable are usually called time, event and time0 respectively. If all subjects enter at time (i.e. =0) the respective variable time0 may be omitted in the dataset. Prepare the dataset for analysis: In order to avoid entering the three variable names representing the survival time observations in each survival analysis command, Stata requires an extra step before any survival analysis command may be executed. This step is carried out using the stset command: stset time, failure(event) enter(time0) This ensures, that the variables time, event and time0 will be used automatically by Stata in all subsequent survival analysis commands to represent the censored observations. When all subjects enter at time 0, the enter() option may be omitted: stset time, failure(event) Kaplan-Meier plot: The sts graph command will produce graphs of Kaplan-Meier estimates of the survival function: sts graph Kaplan-Meier plots with 95% CI: The sts graph command may be combined with by(indepvar) option to produce separate Kaplan-Meier plots for subgroups of the data specied by the different values of indepvar. The gwood option may be used to add pointwise 95% condence intervals to the plots. sts graph, by(indepvar) gwood Kaplan-Meier at age 200: sts list, at(200 201) , are required, because without Note the argument at(200 201), where two time values 201, at(200) will tabulate the Kaplan-Meier estimator at 200 equidistant time points. Estimate median survival: The stci command produces median estimates along with condence intervals: stci, median by(group) Logrank test: The sts test command may be used to compare survival in two or more groups. The groups are dened by distinct values of the indepvar variable. The logrank option species, that the logrank test (default) is to be used for the comparison: sts test indepvar, logrank Cox regression: The stcox command is used to carry out analysis using the Cox regression model: stcox indepvar1 indepvar2 ... indepvarN This will report hazard ratio estimates. To produce estimates of regression coefcients the nohr option may be used. 21

" #!     (

431 321

" 0)(!   

'&% $ 

Increase memory size: Sometimes the extra variables created by the stset command do not t in the available memory. In this case see section 5 for commands to increase the memory size. Note, that you will have to reload and re-stset the dataset after this operation.

23 Online facilities
Stata is web-aware in the sense that it offers commands that allow you to update and enhance your Stataversion, if you are connected to the Internet. The most important commands are: update: Typing update will give an overview of when your Stata system was last updated. The command update query will check whether or not your Stata would benet from an update. Finally you can execute the command update all to update both your ado-les and executable. ndit: In up-to-date Stata 7.0 the command findit will search all relevant Internet sites for Stata material containing your search word. For example:
. findit smooth 13 Sep 2002 13:53:35 Keyword search -------------Keywords: Search: smooth (1) Official help files, FAQs, and STBs (2) Web resources from Stata and from other users

Search of official help files, FAQs, and STBs --------------------------------------------[R] kdensity . . . . . . . . . . . . (help kdensity) Univariate kernel density estimation

[R]

ksm . . . . . . . . . . . . . . . . . . . (help ksm)

Smoothing including lowess

<...cut...> Example . Applied Survival Analysis: Regression Modeling of Time to Event Data . . . . . . . . . . . . . . . . . . UCLA Academic Technology Services 9/01 http://www.ats.ucla.edu/stat/books/asa/default.htm examples from the book Applied Survival Analysis: Regression Modeling of Time to Event Data by David W. Hosmer, Jr. and Stanley Lemeshow <...cut...> STB-53 sg128 . . . Some programs for growth estimation in fisheries biology . . . Salgado-Ugarte, Martinez-Ramirez, Gomez-Marquez, & Pena-Mendoza (help bevholt, fordwal, gullholt, gullplot, nlvbgf, ... if installed) 1/00 pp.35--47; STB Reprints Vol 9, pp.278--293 programs to estimate and plot the von Bertalanffy growth function gr27 . . . . . . . . . An adaptive variable span running line smoother (help autosmoo if installed) . . . . . . . . . . . . . . . P. Sasieni 1/98 pp.4--7; STB Reprints Vol 7, pp.63--68 smooths yvar on xvar where the smooth is a running line fit with a variable span

STB-41

<...cut...>

22

Web resources from Stata and other users ---------------------------------------(contacting http://www.stata.com) 14 packages found (STB omitted) ------------------------------sthaz from http://www.sun.rhbnc.ac.uk/uhss021/stata sthaz. Smoothed hazard (transition/failure) rate plots. / Program by Kenneth L. Simons. / Compute nonparametric estimates of smoothed hazard rates, and create graphs / of the results. The program also can compute and graph standard errors and / confidence bounds. The estimates use hazplot from http://www.sun.rhbnc.ac.uk/uhss021/stata hazplot. Smoothed hazard (transition/failure) rate plots. / Program by Kenneth L. Simons. / hazplot plots hazard rates or smoothed hazard rates. It works only on data in / panel form with integer time variables, and the data must have been stset / using the time0() option. For example, you

<...cut...>

6 references found in tables of contents ---------------------------------------http://www.sun.rhbnc.ac.uk/uhss021/stata/ Materials by Kenneth L. Simons / Here are assorted utilities for Stata. / Check dummy (indicator) variables to ensure they are okay / Distance between latitude & longitude coordinates / Count data points in a geographic radius of each point / Create data points for extra geographic http://www.stata.com/users/njc/ Materials by Nicholas J. Cox, University of Durham / Nicholas J. Cox <N.J.Cox@durham.ac.uk> is a geographer at the University / of Durham and a frequent contributor to Statalist. His areas of interest / include graphics, smoothing, probability distributions, circular statistics, / and <...cut...> (end of search)

First you see what is in the reference manual, on the Stata FAQ pages, and in the STB, where STB refers to the Stata Technical Bulletin, which is a journal where various enhancements (ado-les) are published with examples of their use. Next you get results from searching the web resources for user written resources. Installation: To install a specic package you found with findit just follow the blue clickable links.

24 How to nd a statistical method


The following list should give you some hints as to where you can nd specic statistical methods. Note that Stata offers many more methods than shown in this list. The list should only help you to nd the corresponding Stata command. Hint: a lot of tables and simple calculations for epidemiologists are to be found under epitab.

23

Description ANOVA -test for contingency tables condence intervals for means proportions probabilities percentiles contingency tables correlation Spearman Pearson cumulative distribution function cox regression Fishers exact test Friedman four fold table interrater agreement test Kaplan-Meier curves kappa Kruskal-Wallis test likelihood ratio test linear regression logistic regression log rank test Mann-Whitney two sample test mean, median, sd meta analysis McNemar test multiple linear regression OR (odds ratio) percentiles person years relative frequencies RR (relative risk) risk ratio ROC curves signtest simple linear regression t-test trend tests Wilcoxon matched-pairs signed-ranks test Wilcoxon ranksum test

Stata-command anova or oneway tabulate var1 var2, chi, see also epitab ci or cii (immediate form)

centile tabulate spearman var1 var2 pwcorr [varlist] or correlate [varlist] cdf from STB stcox indepvars tabulate var1 var2, exact, see also epitab friedman from STB, try search friedman tabulate, see also epitab kappa var1 var2 sts graph kappa var1 var2 kwallis lrtest regress depvar [varlist] logistic depvar [varlist] sts test indepvar, logrank ranksum summarize [varlist] or table meta from STB, try search meta symmetry casevar controlvar regress depvar [varlist] cc case-var ex-var or cci a b c d (immediate form) table var1,c(p25 var2 ...) or centile ir tabulate cs cas-var ex-var or csi a b c d (immediate form) cs, csi or ir (for incidence data) roctab or rocfit from STB, try search roc signtest regress depvar [varlist] ttest nptrend signrank, see also signtest ranksum

24

You might also like