Training Module 1: Using Stata for Survey Data Analysis

Project: Poverty mapping and market access in Vietnam Funding: New Zealand Embassy with coordination by The World Bank Implementation: International Food Policy Research Institute (IFPRI) and the Institute for Development Studies (IDS) Lead Trainer: Nicholas Minot, IFPRI Dates: 5-9 August 2002 Host institutions: Information Center for Agriculture and Rural Development Ministry of Agriculture and Rural Development with the Ministry of Labor, Invalids, and Social Affairs and the Ministry of Planning and Investment Hanoi, Vietnam

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Background This is the first of three one-week training modules offered as part of the project “Poverty mapping and market access in Vietnam.” The project is funded by the Embassy of New Zealand and implemented by the International Food Policy Research Institute (IFPRI) in Washington, D.C. and the Institute for Development Studies (IDS) in Sussex, England. The training modules will cover the following topics: 1. Using Stata for survey data analysis 2. Introduction to geographic information systems (GIS) 3. Poverty mapping methods: Combining census and survey data Four characteristics of these modules need to be emphasized because they have implications for the role of the participants. • The training modules are not lecture courses, but rather they are semi-structured hands-on workshops in which trainees will use computers to learn different methods of analyzing data. Thus, active participation of the trainees is expected and necessary to maximize the benefit from the training. The training modules focus on how to use computer software to implement a wide range of topics and analytical methods. In order to cover this range of methods, the course cannot provide detailed explanations of the statistical methods themselves, so it is assumed that trainees have some familarity with concepts such as means, frequency distributions, and regression analysis. The training modules are cumulative in the sense that understanding the material of one day depends on having attended the training course the day before. If you cannot attend the course every day for the full day, it will be difficult to understand the new materials. For this reason, we will ask those who cannot attend regularly to withdraw to make space for other trainees. The training modules will be offered in English. Trainees are not expected to understand all the technical terms used in the course, but they should have a solid understanding of conversational English in order to take full advantage of the training.

At the end of each module, we will issue Certificates of Completion to each trainee who has attended all the sessions and mastered the concepts taught in the course. We reserve the right not to issue Certificates to trainees who do not attend all sessions and those who do not master the material taught. Objectives The objective of this training module is to improve the ability of the trainees to use Stata to generate descriptive statistics and tables from survey data, as well as carry out multiple linear regression analysis of those data. In particular, the course aims to train the participants in the following methods: • • • • • • • • basic file management such as opening, modifying, and saving files advance file management such as merging, appending, and aggregating files documenting data files with variable labels and value labels generating new variables using various functions and operations creating tables to describe the distribution of continuous and discrete variables creating tables to describe the relationships between two or more variables using regression analysis to study the impact of various variables on a dependent variable testing hypotheses using statistical methods

N. Minot

Page 1-1

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

Course requirements In order to take full advantage of the materials taught in the course, trainees must have the following background: • • • Conversational English that allows them to follow the instructions of the trainer Basic statistics such as familiarity with the concepts of means, variance, frequency distributions, and regression analysis Familiarity with computers, including the keyboard and mouse

Organization of the course The training course is divided into ten sections. We will cover some material in all 10 sections, but we may not be able to cover all the material, depending on the background of the trainees. Section 1: Introduction to survey data files Section 2: Introduction to Stata Section 3: Exploring data files with Stata Section 4: Saving and using Stata output Section 5: Creating new variables Section 6: Making tables to describe data Section 7: Making graphs Section 8: Modifying data files Section 9: Introduction to programming with Stata Section 10: Regression analysis with Stata Each section will include some training in the use of Stata commands and a practical application of these commands to the analysis of the 1998 Vietnam Living Standards Survey (VLSS). The VLSS contains over one hundred files, but we will focus our attention on the following files:
Table 1. Sample data programs from the 1998 VLSS

Questionnaire section Extraced from various Section 1A Section 2 Section 6A Section 6B p1 Section 6B p2 Section 6C Section 9B1 Section 9B2 Section 9B4

Topic Household characteristics List of household members Education Type of housing Housing expenses Housing expenses Housing characteristics Rice production Other food crop production Perennial cash crop production

Level Household Individual Individual Household Household Household Household Crop Crop Crop

File name hhexp98n.dta scr01a2.dta scr02a.dta scr06a.dta scr06b1.dta scr06b2.dta scr06c.dta scr09b1.dta scr09b2.dta scr09b4.dta

N. Minot

Page 1-2

and household. province. List of useful terms The following are some key concepts that will be used throughout this training module. the key variables are REG.0 0. PROV. Note that more than one record has the same household identification number. the size of the household.6 5. plot number. For example. each record is a farm plot. Most of you will be familiar with them. there are four variables: the household identification number. The VLSS files usually have between 6000 and 120. and the distance from the house to the nearest source of water. In data set B. and the size of the plot. or provinces. N. villages. the variable HHID is enough to uniquely identify the record so HHID is the only key variable. For example. data set A (below) has 5 records and data set B has 6 records. In data set B (below). so these three variables are not enough to uniquely identify a record. They are considered the “columns” of the data file. The first two records have the same region. or dimensions of each record. households.5 0. Variables are the characteristics.. province. so it is a household-level data set.1 1. whether or not it is irrigated.0 1.2 Data set A HHID 3456 3457 3458 3459 3460 Data set B REG 1 1 1 2 2 3 PROV 4 4 5 26 26 45 HH 1 1 3 2 2 1 PLOT 1 2 1 1 2 1 IRRIG 1 0 1 0 1 1 AREA 1.5 1. REG 1 1 1 2 3 HHSIZE 5 5 4 2 8 DISTWAT 1. Records (or cases or observations) are individual observations such as individuals.4 1. The VLSS files usually have between 10 and 30 variables The level of the dataset describes what each record represents. • In data set A (below).Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 1: INTRODUCTION TO SURVEY DATA FILES 1. each record is a different household.5 0. it is a plot-level data set. They are usually considered to be the “rows” of the data file. farm plots. • In data set B. Minot Page 1-3 . but it is worth reviewing the terms for those that may not know all of them.2 Key variables are the variables that are needed to identify a record in the data. the region where the household lives. HH.000 records. In data set A.4 0. and PLOT because all four variables are needed to uniquely identify the record. location. there are six variables: the region. • • In data set A (below). household.

although most of the description fits the earlier survey as well since the questionnaire and data files are quite similar. forestry. Whenever possible. and fishery activities Non-farm self-employment Food expenditures Non-food expenditures and durable goods Income from remittances Borrowing. and education level.g. The 1998 VLSS had three types of questionnaires: a household questionnaire. REG=2 could be the “Northwest Region”. Minot Page 1-4 . type of roof. each value is associated with a name. or other variables with only two values. Unlike discrete variables. kilometers. For example. Yes/no variables such as whether a household has electricity are also discrete variables. They may represent yes/no. coffee production. For example. sex. and savings Each file contains the data for on section or sub-section of the questionnaire. and so on. lending. For example. variable labels should include the unit (e. The file names include the section number.5639). the part letter. or kilograms and may take fractional values (4. Examples include income. 2. Variable labels are longer names associated with each variable to explain them in tables and graphs. the variable label for HHSIZE might be “Household size” and the label for DISTWAT could be “Distance to water (km)”. continuous variables are usually expressed in some units such as Vietnamese dong. male/female. In this section. and distance to the road. Here we focus on the household questionnaire. Structure of 1998 VLSS data files The Vietnam Living Standards Survey was carried out in 1992-93 and 1997-98. income category. rice consumption. The files cover the following topics: Section 1: Section 2: Section 3: Section 4: Section 5: Section 6: Section 7: Section 8: Section 9: Section 10: Section 11: Section 12: Section 13: Section 14: Household members Education Health Employment Migation Housing Respondents for 2nd round Fertility Agriculture. and a price questionnaires. usually covering several pages of the questionnaire.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Discrete variables (or categorical variables) are variables that have only a limited number of different values. number of trees. Value labels are longer names attached to each value of a variable. km). in the file N. hectares. and sometimes a number indicating the sub-part. Household questionniare The household questionnaire consists of 116 files and about 60 Mb of data (in Stata format). Continuous variables are variables whose values are not limited. we describe the 1997-98 VLSS. Binary variables (or dummy variables) are a type of discrete variable that only takes two values. a community questionnaire. if the variable REG have eight values. REG=1 could be “Northeast Region”. farm size. Examples include region. have/don’t have.

expressed in terms of hectares or number of trees. Part B covers crop production.dta is the file extension for Stata data files. and 1 refers to the 1st column within the question The variable s9b4q031 gives the area planted with a given crop. in the variable: s9b4q031 s9b4 refers to Section 9. and Part B4 covers permanent industrial crops such as tea. Part B4. q02 refers to question 2. coffee. The next variable. B refers to Part B. indicates whether the area is expressed in hectares or trees. 4 refers to the 4th sub-part of Part B. N. and . s9b4q032.dta 09 refers to Section 9. Section 9 covers agriculture. Within each file.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project scr09B4. Minot Page 1-5 . For example. and rubber. the variables are named according to the section and question number.

close. File Open View Save Save as File name Log Save graph Print graph Print results Exit Edit Copy text Copy tables Paste Table copy options Graph copy options Prefs Copy marked text (Control-C can also be used to copy) Copy tables to insert in spreadsheet or word processor Insert something previously copied (Control-V will also paste) Options for how tables are copied Options for how graphs are copied (not in Stata 7) Various options for setting preferences. and 3-5 windows (the number of windows open depends on which windows were open the last time Stata was used). Bring output window to front Bring graph window to front Bring log window to front Open help window (only in Stata 7) Bring command window to front Bring list of recent commands to front Bring list of variables to front Open help window (not in Stata 7) Open window to look at data Open window to write a new program (“Do” file) or edit an existing Open data file View data file (only in Stata 7) Save data file Save data file under new name Select data file name to put in command Open. a tool bar with buttons. For example. Each is described briefly below. or convert log file Save file with graph Print graph Print contents of current window (only in Stata 7) Leave Stata Window Results Graph Log Viewer Command Review Variables Help/search Data editor Do-file editor one Help Contents Information on Stata organized by topic Search Search for information on a certain topic Stata command Search for information on certain Stata command What’s new Differences between different versions of Stata other options allow you to access web sites with Stata news and information N. review. Minot Page 1-6 . If you use Stata a lot. Below we provide a quick description of the different options. you probably will not use the menu bar often because the most common tasks can be done with the buttons on the tool bar and keystrokes. you will see a menu bar across the top. you can save a particularly layout of the different Stata windows or change the colors used in Stata windows. Menu bar The menu bar has lists of commands that can be opened by clicking on a word.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 2: INTRODUCTION TO STATA When you open Stata. 1.

The left column describes the button on the toolbar. It will keep about 300-600 lines of the most recent output. error messages. However. unlike SPSS. while others can be opened using the Windows pull-down menu or the buttons on the tool bar. Open folder Diskette Printer Scroll with traffic light Scroll without light Eye Box 1 Box 2 Box 3 Envelope one Table Table and circle Go X 3. Tool bar The buttons on the tool-bar are designed to make it easier to carry out the most common tasks. Stata Results This window (with the black backgound) shows all recent commands. N. Stata windows Use data file Save data file in memory to disk Print contents of current window Open. deleting earlier output. Minot Page 1-7 . or view log file Bring log window to front (not in Stata 7) Open window with help on using Stata (only in Stata 7) Bring Dialog Window to front Bring Results Window to front Bring Graph Window to front Open window to write a new program (“Do” file) or edit an existing Open window to view and edit data Open window to view data Turn off “More” Stop processing The Stata windows give you all the key information about the data file you are using. If you want to store output in a file. close.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project 2. In Stata 7. These are the Stata windows: Stata Results Stata Command Stata Browser Stata Editor Stata Viewer Variables Review Stata Do-file Editor To see recent commands and output To enter a command To view the data file (needs to be opened) To edit the data file (needs to be opened) To get help on how to use Stata To see a list of variables To see recent commands To write or edit a program (needs to be opened) Each is described in more detail below. recent commands. you must use the log command. Some of them open automatically when you start Stata. the Stata results window does not keep all output generated. and help info. output. and the results of those commands. while the right column tells what the button does. the text is color-coded as follows: white green blue yellow red Stata commands General information and the frame and headings of output tables Commands or error messages that can be clicked on for more information (in Stata 7 only) Numbers in output tables Error messages The slide bar on the right side can be used to look at earlier results that are not on the screen.

A Stata program (or Do-file) is simply a set of Stata commands written by the user. When you open a Stata data file. sort the data or hide certain variables using buttons at the top of the Stata Browser window. Variables This window (tall with a white background) lists all the variables that exist in memory. however. it appears in the Stata Command window and can be executed by pressing the Return key. The advantage of using the Do-file Editor rather than the Stata Command window is that the Do-file allows you to save. Review This window (with a white background) lists all the recent commands. The Stata Browser does not appear automatically when you start Stata. and execute a Stata program. Each time you open Stata. you cannot execute any commands. You can adjust the size and position of each window the way you like it and then save the layout by clicking on Prefs/Save Windowing Preferences. they will be removed from the list. it is sometimes difficult to fit them all on the screen. you can click on Windows/Viewer or click on the eye button on the tool bar. Unlike SPSS. Table 2 (below) provides a list of Stata commands that will be introduced in Module 1: N. If you click on one of the commands. You can also use recent commands again by using the PageUp key (to go to the previous command) and PageDown key (to go to the next command). Stata Viewer This window provides help on Stata commands and rules. The Do-File Editor can be opened by clicking on Windows/Do-file Editor or by clicking on the envelope button. If you create new variables. but any serious data analysis should be carried out using the Do-file Editor. it lists the variables in the file. You can. We do not recommend using this window because you will have no record of the changes you make in the data. To open the Stata Viewer window. type a command in the space at the top and the Viewer will give you the purpose and rules for using that command. Exploratory analysis of the data can be done with the Stata Command window. The slide bar can be used to view earlier commands. The only way to open the Browser is to click on the buttom with a table and magnifying glass. To use the Stata Viewer window. revise. along with some examples. and rerun a set of commands. you also cannot change any of the data. You can insert a variable into the Stata Command window by clicking on it in the Variables window. Stata Browser This window shows all the data in memory. Minot Page 1-8 . Do-file Editor This window allows you to write. Any blue text in the Viewer can be clicked on for more information about that command. either from the Stata Command window or from the Do-file Editor. they will be added to the list of variables. With so many windows. not the Stata Command window. Stata Editor This window is exactly like the Stata Browser window except that you can change the data. edit. It is better to correct errors in the data using a Do-file program that can be saved. save. If you delete variables. when the Stata Browser is open.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Stata Command This window (small with a white background) allows you to enter commands which will be executed as soon as you press the Return key. In addition. the windows will be arranged according to your prefered layout.

Stata commands and topics covered in Module 1 3.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Table 2. Regression analysis regress test testparm predict probit ovtest hettest N. Creating new variables gen replace operators functions recode tab …. Programming creating and using macros creating and using loops matrix algebra 10. Exploring data clear use describe list summarize tabulate tab1 tab2 save help by prefix if suffix in suffix set more set mem set scrollbufsize 4. generate xtile 6. Modifying files drop drop if keep keep if sort compress collapse merge append fillin reshape 9. Making tables labeling data #delimit tabulate … summarize tabstat table using weights 7. Storing commands and output Stata Do-file editor log exporting tables 5. Minot Page 1-9 . Graphs graph histogram scatterplot bar xlabel ylabel connect( ) symbol( ) 8.

If there is no path. Here are some examples of the use command: use hhexp98n use hhexp98n if reg7 == 1 use hhexp98n in 5/25 use househol age sex using hhexp98n use d:\data\VLSS\scr01a2 folder use “d:\data files\VLSS 98\scr01a2” use scr01a2. clear ] use [varlist] [if exp] [in range] using filename [. It is equivalent to “get” in SPSS. You can clear memory using the clear command or by using the clear subcommand as part of the use command (see the use command). Minot Page 1-10 . opens data from one region opens records 5 through 25 of file opens 3 variables from hhexp98n file opens the file scr01a2. you must use double quotes: use “d:\my data\scr02a” You can open a selected variables of a file using a variable list. Stata assumes it is . and labels from the memory to get ready to use a new data file.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 3: EXPLORING DATA FILES This section covers commands that are used for preliminary exploration of data in a file. The following commands and topics are described: clear use describe list summarize tabulate by prefix if suffix in suffix save help set mem set more set scrollbufsize clear The clear command deletes all files. Stata assumes it is in the current folder.dta for analysis. You can use a path name such as: use d:\data\scr02a If the path name has spaces. clear opens the file hhexp98n.dta in the specified use quotation marks if there are spaces clears memory before opening the new file N. variables. You can open selected records of a file using if or in. The syntax is: use filename [. This command does not delete any data saved to the hard-drive. use This command opens an existing Stata data file. clear ] • • • • • • opens new file opens selected parts of file If there is no extension.dta.

0g sample weight hhsizewt float %9. use hhexp98n .0g household code year float %9.head farm float %9. The output includes: • • • • • the number of variables the number of observations (records) the size of the file the list of variables and their characteristics Example 1: Using “describe” to show information about a data file . Minot Page 1-11 .0g Month of interview vlssmphs byte %8. 0:nonfarm) urban98 byte %8.0g =hhsize*wt vill float %9. type “help datatypes” the display type indicates how it will appear in the output.0g Code by 10 regions hhsize long %12.0g Code by 8 regions reg10 int %8. 2 if mphs source sex byte %8.999 vars: 67 6 Jan 2000 08:43 size: 1.741 (98.0g 1 if vlss.553.0g urban 1:urban92.0g hhsize categories wt int %8.5% of memory free) ------------------------------------------------------------------------------storage display value variable name type format label variable label ------------------------------------------------------------------------------househol long %12.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project describe This command provides a brief description of the data file. Example 1 gives the description of the summary file from the VLSS called hhexp98n. describe Contains data from hhexp98n.0g schooling year of HH.dta obs: 5. N. You can use “des” and Stata will understand.head (1:M.0g diploma completed diploma HH.0g agegroup age group of HH.head comped98 float %9.0g urban 1:urban 98. the value label is the name of a set of labels for different values the variable label is a name for the variable that is used in output.0g Year of interview month float %9.0g Age of household head agegroup byte %8. int is used for integers. 0:rural92 province float %9.0g loaiho Type of HH (1:farm.0g Gender of HH.2:F) age int %8.0g Household size hhcat float %9. 0:rural 98 urban92 float %9.0g Code by 7 regions reg8 int %8. and float is used for continuous variables that may have decimals.0g village code [output truncated hee) It also provides the following information on each variable in the data file: • • • • • the variable name the storage type: byte is used for binary variables.0g Province code reg7 int %8.head educyr98 float %9. To see the limits on each storage type.

3. 496. 483. 494. use hhexp98n . all variables will be listed. list househol reg8 vill if vill==32 482. list househol reg7 . 10. list in 1/10 . 8. The syntax is: list [varlist] [if exp] [in range] With varlist. 2. 493. 490. 497.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project list This command lists values of variables in data set. 4. 484. list househol urban98 reg8 in 1/10 1. 9. list if reg7 < 6 Example 2: Using “list” to look at data . househol 3201 3203 3205 3206 3207 3208 3215 3216 3218 3221 3222 3223 3224 3225 3226 3227 reg8 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 vill 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 If you are not careful with list. Here are some examples: . 488. 492. 486. If no list is specified. 7. you can specify which variable’s values will be presented. It is similar to “list” in SPSS. 6. 487. househol 101 103 105 107 108 109 110 111 112 113 urban98 Urban Urban Urban Urban Urban Urban Urban Urban Urban Urban reg8 1 1 1 1 1 1 1 1 1 1 lists entire dataset lists observations 1 through 10 lists selected variables lists observations 1-20 for selected variables lists cases in region is 1 through 5 . Minot Page 1-12 . With if and in. you can specify which records will be listed. list . 5. 495. 485. use the stop buttom (red button with an X). If Stata starts giving you more output than you really want. you will get a lot more output than you want. 489. N. 491. list househol age in 1/20 .

Here are some examples: .1666 85499.059 1756.416092 0 22 food | 5999 7272. while the second gives the statistics only for households in Region 3.59 1081. Min Max -------------+----------------------------------------------------age | 5999 48. farm size.367 0 9792 .777 4634. sum age educyr98 food ricexpd if reg8==3 Variable | Obs Mean Std. summarize . income. Dev.094419 4. • the four smallest values • the four largest values • various percentiles.346 1140.25 ricexpd | 5999 2267. sum age educyr98 food ricexpd Variable | Obs Mean Std. The syntax looks like this: summarize [varlist] [if exp] [in range] [. N.7702 16 95 educyr98 | 5999 7. • kurtosis. Minot Page 1-13 . [detail]] By default.609375 3.such as • skewness.087 1795 13022. it produces the following statistics: • Number of observations • Average (or mean) • Standard deviation • Minimum • Maximum If you specify “detail”. the Red River Delta.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project summarize The summarize command produces statistics on continuous variables like age.887 542.08 ricexpd | 128 2735. Using “summarize” to study continuous variables .224499 0 17 food | 128 5290. Dev.493 747 7344 The first example gives the statistics for the whole sample.07031 11.01284 13. summarize age income . or This is like the “describe” command in SPSS.38184 24 79 educyr98 | 128 8. summarize age income if reg8==3 gives statistics on all variables gives statistics on selected variables gives statistics on two variables for one region Example 3. Min Max -------------+----------------------------------------------------age | 128 42. Stata gives you additional statistics. Notice that residents in the Red River Delta are somewhat younger but with more education than the national averages.

How do they differ? • • • tabulate or tab produce a frequency table for one or two variables tab1 produces a one-way frequency table for each variable in the variable list tab2 produces all possible two-variable tables from the list of variables You can use several options with these commands: • • • • • • all gives all the tests of association for two-way tables cell gives the overall percentage for two-way tables column gives column percentages for two-way tables row gives row percentages for two-way tables nofreq suppresses printing the frequencies. a frequency table for each variable . including other statistical tests. tabulate reg8 sex. a cross-tab of each pair of variables N. row produces a cross-tab by region and sex with row percentages . tab1 reg8 sex ethnic produces three tables. They can produce one-way frequency tables (tables with the frequency of one variable) or two-way frequency tables (tables with a row variable and a column variables. tab1 region sex ethnic produces three tables.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project tabulate. These commands are similar to the “freuqncy” and “crostab” commands in SPSS. tabulate reg7 produces table of frequency by region . cell nofreq produces a cross-tab of overall percentages by region and sex . tabulate reg8 sex. Minot Page 1-14 . tab2 reg8 sex urban98 produces three tables. tab1. type “help tabulate”. chi2 provides the chi squared test for two-way tables There are many other options. tab2 These are three related commands that produce frequency tables for discrete variables. tabulate reg8 sex produces a cross-tab of frequencies by region and sex . a frequency table for each variable . Some examples of the tabulate commands are: . For more information.

2:F) | non farm farm | Total -----------+----------------------+---------No 1 | 1673 2702 | 4375 2 | 888 736 | 1624 -----------+----------------------+---------Total | 2561 3438 | 5999 .31 | 100.000 N. Using “tabulate” on categorical variables .2:F) | non farm farm | Total -----------+----------------------+---------1 | 1673 2702 | 4375 | 38.93 -----------+----------------------+---------2 | 888 736 | 1624 | 54.69 42.76 | 100.24 61. tab farm Type of HH | (1:farm. Minot Page 1-15 .59 | 72.00 | 100. row col chi2 Gender of | Type of HH (1:farm. HH.69 57. tab sex farm.41 | 27. Percent Cum. ------------+----------------------------------non farm | 2561 42.00 | 65.8340 Pr = 0.07 -----------+----------------------+---------Total | 2561 3438 | 5999 | 42.33 78.32 | 100. tab sex farm Gender of | Type of HH (1:farm.69 farm | 3438 57.00 | 100.00 . | 0:nonfarm) | Freq.31 100.head | 0:nonfarm) (1:M.00 100.67 21.68 45.head | 0:nonfarm) (1:M.00 ------------+----------------------------------Total | 5999 100. HH.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Example 4.00 Pearson chi2(1) = 130.00 | 34.

N. Stata gives the count only. Min Max -------------+----------------------------------------------------hhsize | 4375 5. The syntax is: save [filename] [. by sex: sum hhsize _____________________________________________________________________________ __ -> sex = 1 Variable | Obs Mean Std. Using the “by” prefix . Stata gives the count. and cell request Stata to include percentages in two-way tables This prefix goes before a command and asks Stata to repeat the command for each value of a variable.058286 1. There is no equivalent command in SPSS. it will use the current name. Min Max -------------+----------------------------------------------------hhsize | 1624 3.982762 1 19 save This command saves the data in memory.852724 1 16 _____________________________________________________________________________ __ -> sex = 2 Variable | Obs Mean Std. give the frequency table of urban/rural Example 5. row. and the cumulative percentage (see first example in box). the percentage. give stats on household size for each region. You cannot write over an old file unless you specify “replace” (unlike in SPSS). In two-way tables. It is equivalent to “save outfile” in SPSS. unless you ask for other statistics (see second example in box) col.927956 1. The general syntax is: by varlist: command Some examples of the by prefix are: by sex: sum hhsix by reg8: tab urban98 for each sex of head of household. replace ] • • saves file If you do not give a file name. Dev.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project • • • by In one-way tables. Minot Page 1-16 . Dev. sort sex .

this means the datafile is too big for the memory allocated to Stata. summarize in 10/20 . except that in Stata it is not considered a separate command. Unlike SPSS. Remember that the Results Window only stores the most recent 300-600 lines of output. set more off/on is used to turn on and off the continuous scrolling of output. help tabulate . The syntax is: command if exp Examples include: . gives a description of the tabulate command gives a description of the summarize command N. This command increases the memory allocated to Stata. . You cannot set XX greater than the RAM memory in the computer. Also note that | indicates “or” while & indicates “and”. in We have also used in to select records based on the case number. list hhid region income if income>12000 . Some of the more common ones are: set mem XXm sets memory for Stata at XX megabytes. There are 22 set commands. Minot Page 1-17 . This is similar to the “process if” command in SPSS. If you get the error message “No room to add more observations”. but many of them are rarely used. help summarize set The set command is used to control the Stata operating environment.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project if We have already seen several examples of using if to select certain records in carrying out a command. Stata does not automatically store all of your output. Use “set more off” if you are not interested in the intermediate output. tab region if income>10000 & income<20000 range . The syntax is: command in exp For example: . Use “set more on” if you need to be able to read the early output. help list observation number 10 summarize observations 10-20 The help command gives you information about any Stata command or topic help [command] For example. only the final result. not a single =. summarize income if region==1 | region==2 lists data if income is above 12000 frequency table of region if income is in statistics on income for regions 1 and 2 Note that “if” statements always use ==. list in 10 .

For example. What percentage of the population under 5 is female? To get the correct answers. but only 24 percent of the population. What is the average size of a household? 8. urban households make up 29 percent of the sample.000 (500k). 1 N. How many variables and how many records are in hhexp98n? (Answer: describe) 2. Is there a statistically significant difference between the percentage of female-headed households in urban and rural areas? (use the chi2 option) 4. 1. What percentage of the population over 80 years old is female? (use “tab … if . Remember two tricks to make it easier to fix your mistakes: • • You can use PageUp to retrieve the most recent command. It is at the househlold level. XX is expressed in bytes. What percentage of households have female heads? (Answer: tab sex) 3. The default is 32. What percentage of the population is female? (Answer: tab s1aq02) 2. You can click on variables in the Variable window to paste it into the Command window. Type “help set” for a list of other settings in Stata. Sampling weights are described in Section 6. How does the percentage of female headed household vary by region? 7.”) 3. Exercises for exploring the VLSS This section includes some questions that you can answer using the VLSS files provided on your computer and the commands described in this section1. The weights compensate for the fact that some types of households are over-represented in the VLSS sample and others are under-represented.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project set scrollbufsize XX is used to change the amount of output that Stata will store.. What percentage of urban households are considered farm household? (use “if urban98==1” option) 5. you will need to sort and then use by) Household members The file scr01a2 contains information about each member of the household.000 (32k) and the maximum is 500. What is the average size of an urban household in the Red River Delta? (reg8=1 refers to RRD) 9. How does household size vary with across expenditure quintiles? (use quint98b for quintiles. You can answer the following questions using this file: 1. What percentage of farm households are in urban areas? 6. in the Command window and pressing Return. Open the file by entering “use hhexp98n”. It is at the individual level (each record is a person). Minot Page 1-18 . Summary file The file hhexp98n contains summary variables calculated from various other data files. we should use the sample weights which are described later.

Minot Page 1-19 . Only crops that are grown by each household are included in the file. The data are at the crop level. so 100% means normal. what was the average amount of maize harvested. 1) What is the average value of the house. what percentage boil their water before drinking? 4) What is the average value of the house among those who get their water from an inside private tap? 5) What is the average value of the house among those who get their water from a hand-dug well? 6) What is the average value of the house for each type of source of drinking water? (you will need to sort by drinking water type and then use the “by” option) Food crops The file scr09b2 contains information on production of food crops other than rice. Among maize growers. What was the average response? 7. and given to livestock? (you will need an “if” statement that selects both for maize and for area greater than 10. How many households in the sample grow maize? (Answer: tab s9b2cc) 2. according to the respondent? (Answer: sum s6bq12) 2) What are the most important sources of water? (use “tab”) 3) Of those households that think their water is safe before boiling. Does this percentage vary between urban and rural areas? 7. 1. sold. What percentage of women are married? 5. What is the average amount harvested and sold for each food crop other than rice? (you will need to sort and use “by s9b2cc”) 6. sold. The crop codes are in the questionnaire on pages before and after the questions. Among maize growers. what was the average amount of maize harvested. How much are the post-harvest losses in maize relative to the size of the harvest? Tomatoes? N. Is the percentage of spouses away greater for men or for women? Housing characteristics The file scr06b1 contains information about the characteristics of houses. Open the file and use “des” to obtain a list of variables. You can answer the following questions with this file. The following questions can be answered from this file. What percentage of the spouses of family members live in the household? 8.000 m2) 5.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project 4. and given to livestock? 4. Farmers were asked what percentage of the normal harvest did they get this year. meaning that each record represents one crop for one household. Among farmers with more than 1 hectare of maize. What percentage of the women over the age of 20 are married? 6. what was the average area with maize? (Answer: sum s9b2q03 if s9b2cc==8) 3.

how to edit the program. you can click on Windows/Do-file Editor or click on the envelope on the Tool Bar. we present different ways of saving and using the output generated by Stata. Probably the button you will use most is the second-to-last one that shows a N. there are buttons for File/New. Edit/Paste. In general. Do. Why use the Do-file Editor? • • • • It makes it easier to check and fix errors. It allows you to run the commands later. The following topics are covered: using the Do-file Editor log using log off log on log close set logtype moving tables from Stata to Word and Excel Using the Do-file Editor As mentioned in Section 2. we discuss how to store commands and output for later use. Minot Page 1-20 . First. Edit/Copy. The menu bar is similar to the one in Microsoft Word: File/New File/Open File/Save File/Save as File/Insert file File/Print File/Close Edit/Undo Edit/Cut Edit/Copy Edit/Paste Search/Find Search/Replace Tools/Do Tools/Run to open a new. Search/Find. For example. To open the Do-file Editor. Within the Do-file Editor. Second. any time you are running more than 10 commands to get a result. it is easier and safer to use a Do-file to store the commands. blank Do-file to open an existing Do-file to save the current Do-file to saving the current Do-file under a new name to insert another file into the current one to print the Do-file to close the Do-file to undo the last command to delete or move the marked text in the Do-file to copy the marked text in the Do-file to insert the copied or cut text into the Do-file to find a word or phrase in the Do-text to find and replace a word or phrase in the Do-file to execute all the commands or the marked commands in the Do-file to execute all the commands or the marked commands in the Do-file without showing any output in the Stata Results window The tool bar buttons can be used to carry out some of these tasks more quickly. and Run. File/Print. and how to run it. File/Open. It lets you show others how you got your result. Edit/Undo. and It allows you to collaborate with others on the analysis. the Do-file Editor allows you to store a program (a set of commands) so that you can edit it and execute it later. we describe how to store commands a program (Stata calls it a Do-file) . Edit/Cut.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 4: STORING COMMANDS AND OUTPUT In this section. there is a menu bar and tool bar buttons to carry out a variety of editing functions.

which can be used in the Stata Command window or in a do-file (program). Although layout is a matter of personal preference. you can click on the Do button (the second-to-last one) or click on Tools/Do. You can increase the amount of memory allocated to the Stata Results window (see “set scrollbufsize” in Section 3). log using This command creates a file with a copy of all the commands and output from Stata. You do not have to mark the whole command. and when it is full. Saving the output As mentioned in Section 2. If you want to run one or just a few commands rather than the whole file. This makes it easy to switch back and forth. This is the “Do” button for executing the program or the marked part of the program. the keyboard commands may be even quicker to use than the buttons. Finally. You can click on File/Log to get four options: Begin (log using). mark the commands and click on the Do button. 2. The first time you open a log. but even this will probably not be enough for a long session with Stata. we describe the commands. 3. In this section. Close. You can use “log” commands in the Stata Do-file Editor. It looks like a scroll. you can save the layout by clicking Prefs/Save Windowing Preferences. Minot Page 1-21 . You can use the log button on the tool bar. Each time you open Stata. it begins to delete the old results as you add new results. it is not enough to have the cursor on a command). There are four ways to control the log operations. In only stores about 300-600 lines. Thus. but at least one character in the command must be marked in order for the command to be executed (unlike SPSS. the Stata Results window does not keep all the output you generate. The syntax is: log using filename [. 1. you must give a name to the new file to be created. it will use your chosen layout.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project page with text on it. You can use “log” commands in the Stata Command window 4. and resume (log on). When you arrange the windows the way you like. The most useful keyboard commands are: Control-O Control-S Control-C Control-X Control-V Control-Z Control-F Control-H Open file Save file Copy Cut Paste Undo Find Find and Replace To run the commands in a Do-file. it may be useful to have the Stata Results window and the other windows on one side of the screen and the Do-file Editor window on the other. Suspend (log off). The options are: N. append replace [ text | smcl ] ] where filename is that name you give the new file. we need to use log to save the output.

append log using “d:\my data\myfile. copying any new output to the log file that was already defined. results. It is the same as adding the “text” subcommand to every “log using” command. this is the best way to make sure all the log files are in this format. if you use an existing file name but do not say “replace” or “append”.” Since I did not specify a folder. the log is opened using the filename “temp1. First. the only way to start logging again is with “log using. Example 6 shows how the log command can be used. Third. the DOS editor. “Log on” only works after a “log using” and a “log off” command. How are “log off” and “log close” different? “Log off” allows you to turn it back on easily with “log on.” set logtype text This command tells Stata to always save the log files in text (ASCII) format. Then the log is turned off. log close This command is used to turn off the logging and save the file. This is useful if you want to save some of the output but not all. replacing content saves output to an existing file. “Log off” only works after a “log using command.” log on This command is used to restart the logging. so the results of “sum hhsize” is not logged. the log is closed. Finally.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project append replace text smcl Here are some examples: adds the output to an existing file replaces an existing file with the output tells Stata to create the log file in text (ASCII) format tells Stata to create the log file in SMCL format log using temp22 log using temp20. The results from “tab urban98” are saved in the log file. It is the same as adding the “smcl” subcommand to every “log using” command. it saved the file to the default folder which (in this case) was my descktop. replace log using regoutput. set logtype smcl This command tells Stata to always save log files in SMCL format. Minot Page 1-22 . or any word processor. colors. the log is turned on so the results from “sum age” are logged. temp20. but the file does not have any formatting smcl files have formatting (bold. Stata will give an error message that the file already exists log files in text format can be opened with Wordpad.adding to contents saves output in specified file in specified folder Several points should be remembered in using this command: • • • • log off This command temporarily turns off the logging of output.” continuing to use the same log file. so that any subsequent output is not copied to the log file.txt” saves output to a file called temp22 saves output to an existing file. but it is easier. Notepad.. If you prefer text format log files (as I do). etc) but can only be opened with Stata smcl format is the default N. After a “log close” however.

Min Max -------------+----------------------------------------------------age | 5999 48. you can open the Stata Do-file Editor (Windows/Do-file Editor) and open the log file with the Editor (File/Open) if it is in text format.01284 13. you can open Wordpad (Start/Programs/Accessories/WordPad) and then open the log file WordPad (File/Open) To print output from the Stata Results window. ------------+----------------------------------Rural | 4269 71.752292 1.7702 16 95 .00 .Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Example 6: Using “log” to save output . log close log: D:\Documents and Settings\NICHOLAS\Desktop\temp1. You can: • • • • type “view [filename]” in the Stata Command window click on the Viewer button (it looks like an eye) and type”view [filename]” if it is in text format. Dev. but there are several other ways to do it.954292 1 19 . text ---------------------------------------------------------------------------log: D:\Documents and Settings\NICHOLAS\Desktop\temp1. you can click File/Print Results. 13:00:00 --------------------------------------------------------------------------- Using the output The easiest way to look at a log file is with File/Log/View. 12:58:52 . log using temp1.log log type: text closed on: 2 Aug 2002.log log type: text opened on: 2 Aug 2002. N. log off log: D:\Documents and Settings\NICHOLAS\Desktop\temp1. 12:59:48 . Percent Cum. log on ----------------------------------------------------------------------------log: D:\Documents and Settings\NICHOLAS\Desktop\temp1. sum hhsize Variable | Obs Mean Std.log log type: text paused on: 2 Aug 2002. sum age Variable | Obs Mean Std.16 Urban | 1730 28. 12:59:26 ----------------------------------------------------------------------------.00 ------------+----------------------------------Total | 5999 100. Dev. Minot Page 1-23 . | 0:rural 98 | Freq.log log type: text resumed on: 2 Aug 2002. Min Max -------------+----------------------------------------------------hhsize | 5999 4. tab urban98 1:urban 98.84 100.16 71.

If there is a heading that overlaps two columns. the two columns will be merged. you can exclude the heading when you copy the table. 2) 3) Copy the table into a Excel. It is best to copy tables from the Stata Viewer or from the Stata Results window using Edit/Copy Table. To move tables from a log file to an Excel table. Minot Page 1-24 . 1) 2) 3) 4) Open thelog file with Stata Viewer (File/Log/View) Copy the table with Edit/Table Copy or Control-Shift C Paste the table into Word with Control-V Mark the table and then click Table/Insert/Table To move tables from the Stata Results window to Word or Excel. Copy the table into a Word table. follow the above procedures starting with step #2. However. Exercises for logging 1) Use the file hhexp98n and open a log file called “results” to save output. Then do a frequency table of region by urban. 1) Open the log file with Stata Viewer (File/Log/View) 2) Click on File/Print Viewer Unfortunately. 1) Open thelog file with Stata Viewer (File/Log/View) 2) Copy the table with Edit/Table Copy or Control-Shift C 3) Paste the table into Excel To move tables from a log file to a Word table. To avoid this. it is not easy to copy Stata output to other software such as word processors and spreadsheets. Close the log file. one problem with these procedures is that there has to be a clear division between columns. N.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project To print output from a log file.

” You can use “gen” as an abbreviation for “generate” If the expression is an equality or inequality. we described how to explore the data using existing variables. If you want to change an existing variable. we will cover the following commands and options. In this section. When new variables are created. Several points about this command: : • • • • Unlike “compute” in SPSS. you need to use “replace. N. The syntax is. Minot Page 1-25 . but they will not be saved on the hard-disk unless you use the save command. In this section. generate cannot be used to change the definition of an existing variable. The syntax is the same: replace oldvar = exp [if exp] [in exp] Some points to remember: • Replace cannot be used to create a new variable. It is similar to “compute” in SPSS.) to represent missing values For example. generate age2 = age*age gen yield = quant/area if area>0 gen price = value/quant if quant>0 gen highprice = (price>1000) create age squared variable create new yield variable if area is positive create new price variable if quant is positive creates a dummy variable equal to 1 for high prices replace This command is used to change the definition of an existing variable.” Stata wants to make sure you really want to change the variable. generate newvar = exp [if exp] where “exp” is an expression like “price*quant” or “1000*kg”. the new variable will have missing values when the “if” statement is false For example. they are in memory and they will appear in the Data Browser. the variable will take the values 0 if the expression is false and 1 if it is true If you use “if”. • There is no abbreviation for “replace. generate replace tab …. . we discuss how to create new variables.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 5: CREATING NEW VARIABLES In the previous sections. then the old values will be retained when the “if” statement is false • You can use the period (. Stata will give an error message if the variable does not exist. generate operators functions recode xtile generate This command is used to create a new variable. • If you use the “if” option.

13 33.71 5 | 628 10. if income<=0 replace negative income with missing value replace age = 25 in 1007 replace age=25 in observation #1007 tabulate … generate This command is useful for creating a set of dummy variables (variables with a value of 0 or 1) depending on the value of an existing categorical variable. newvariable3. newvariable2.59 19. We can create eight dummy variables as follows: tab reg7. tab reg8.41 1 | 1175 19.46 8 | 1112 18.00 .91 4 | 708 11. The new variables will take the form newvariable1. tab region1 reg8== | 1. Example 7.59 2 | 731 12.0000 | Freq.54 100.19 31. ------------+----------------------------------0 | 4824 80. The syntax is: tabulate oldvariable. It is easier to explain with an example. Percent Cum.47 56.18 6 | 276 4.77 3 | 128 2.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project replace price = avgprice if price > 100000 replaces high values with an average price replace income =. etc. Using “tab…. ------------+----------------------------------1 | 1175 19. Percent Cum.69 81. gen(region) Code by 8 | regions | Freq. Newvariablex will be equal to 1 if oldvariable=x and 0 otherwise.00 ------------+----------------------------------Total | 5999 100.59 100.60 60.41 80.78 7 | 1241 20. notice that there are 1175 households in region 1 (Red River Delta) and the same number of households for which with region1=1. gen” to create dummy variables . gen(region) This creates 8 new variables: region1=1 if reg8=1 and 0 otherwise region2 =1 if reg8=2 and 0 otherwise … region8=1 if reg8=8 and 0 otherwise In Example 7.00 ------------+----------------------------------Total | 5999 100.00 N.80 45. Minot Page 1-26 . Reg8 is a variable that takes values of 1-8 for the different regions of Vietnam. generate(newvariable) The old variable is a categorical (or discrete) variable.

43. by(sex) egen regprod = sum(prod).546 12084. 48.441 8231. You calculate a price variable using household expenditure data.103 8231.” The syntax is: egen newvar = fcn(arguments) [if exp] [in range] .441 6643. by(vill) creates variable of average yield over entire sample creates variable of median income for each sex creates variable of total production for each region .107 14452. 46. 45. gen aboveavg = (rlpcex2>avgexp) .278 6454.103 8231.441 6643.441 6643. 41. 49. 44. by(var) where newvar is the new variable to be created fcn is one of numerous functions such as: count( ) max( ) min( ) mean( ) median( ) rank( ) sd( ) sum( ) argument is normally just a variable var in the by() subcommand must be a categorical variable Suppose you want to estimate the demand for rice using household data.877 avgexp 6643.78 3549. 50.962 6583. To do the same thing in SPSS.441 6643.862 13006. You can replace the missing values with provincial average prices as follows: egen avgprice = mean(price). Minot Page 1-27 .Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project egen This is an extended version of “generate” to create a new variable by aggregating the existing data.103 8231. 47. It is a powerful and useful command that does not exist in SPSS.421 6666. by(province) Here are some other examples: egen avg = mean(yield) egen avg2 = median(income). you would need to create a new file with “aggregate” and merge it with the original file using “match files. by(region) Example 8: Using egen to calculate averages . but some households do not buy rice.1 4785.441 6643.103 aboveavg 1 1 0 1 0 1 0 1 0 0 0 N.103 8231. list househol vill rlpcex2 avgexp aboveavg in 40/50 40. 42.72 3787.75 4145. egen avgexp = mean(rlpcex2). househol 305 315 306 301 310 311 405 409 407 403 401 vill 3 3 3 3 3 3 4 4 4 4 4 rlpcex2 7858.

you cannot use words like “or”. N. • • Use a single equal symbol (=) when defining a variable.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project In Example 8. we calculate the average expenditure for each village with the “egen” command. Then we create a dummy variable based on the expression (rlpcex2 > avgexp). One way is to write: generate RRD = 0 replace RRD = 1 if reg8==1 Or you can get exactly the same result with just one command: generate RRD = (reg8==1) If the expression in parentheses is true. or “gt”. but a topic related to creating new variables. Minot Page 1-28 . a household must be both in the Red River Delta and be a farmer to be selected. If it is false. “eq”. the value is 0. Most of the operators are obvious. For example. Suppose you want you create a dummy variable indicating households in the Red River Delta. “and”.subtraction * multiplication / division ^ power Relational > greater than < less than >= more than or equal <= less than or equal == equal ~= not equal != not equal Logical ~ not | or & and The most difficult rule to remember is when to use = and when to use ==. The list output shows how the village average is repeated for every household in the village and confirms that the dummy variable is correctly calculated. the value is set to 1. such as in an “if” statement and when creating a dummy variable. Here are some examples to illustrate the use of these operators. Logical operators are useful if you want to impose more than one condition. operators This is not a Stata command. suppose you want to create a dummy variable for farmers in the Red River Delta. Unlike SPSS. Use a double equal symbol (==) when you are testing an equality. but some are not. we want to know which households have per capita expenditure (rlpcex2) above the village average. First. Arithmetic + addition . In other words.

provides the standard normal density. computes the square root of x. normden(z.y) computes the absolute value of x calculates e to the x power. etc. but a topic that is related to creating new variables. computes the log base 10 of x. the result is missing. provides the cumulative standard normal. this is not a command. Other functions can be found by typing “help functions” in the Stata Command window. This variable can be created with: gen delta = 0 replace delta = 1 if reg8==1 | reg8==8 or by one command: gen delta = (reg8==1 | reg8==8) You can also combine conditions using parentheses. provides the inverse cumulative normal.s) = normden(z)/s if s>0 and s not missing. otherwise. invnorm(norm(z)) = z. numbering the first group 1. provides the normal density. Here is a list of some of the more commonly-used functions.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project gen RRDfarm = 0 gen RRDfarm = 1 if reg8==1 & farm==1 or an easier way to do this would be: gen RRDfarm = (reg8==1 & farm==1) Or suppose you wanted to create a dummy variable for households in the two deltas. gives the integer obtained by truncating x. It uses the current order of the data. gives x rounded into units of y. This means a household can be in the Red River Delta or it can be in the Mekong River Delta to be selected. N. computes the natural logarithm of x is a synonym for ln(x). Minot Page 1-29 . the natural logarithm. creates a categorical variable that divides the data into x as nearly equal-sized subsamples as possible. gen PDF = ((reg8==1 | reg8==8) & farm==1 & quint98 ==1) functions Again.s) norm(z) group(x) int(x) round(x. We will define poor as in the bottom 20 percent and use the variable quint98. abs(x) exp(x) ln(x) log(x) log10(x) sqrt(x) invnorm(p) normden(z) normden(z. Suppose you wanted a dummy variable that indicates if a household is a poor farmer in one of the deltas. the second group 2.

=9 recode x 9=. Using recode to define a new variable .59 19. ------------+----------------------------------1 | 2034 33. or south of Vietnam.59 2 | 731 12.71 5 | 628 10.13 33.46 8 | 1112 18. tab reg3 reg3 | Freq.69 81.80 45. Minot Page 1-30 .54 100. changes all values of x=1 to x= 2 changes 1 to 2 and 3 to 4 exchanges the values 1 and 2 in x changes 1 in x to 2 and all other values to 3 changes 1 through 5 in x to 2 changes 1.22 100. ------------+----------------------------------1 | 1175 19. 4 and 5 to 6 changes missing to 9 changes 9 to missing Notice that you can use some special symbols in the rules: * .00 ------------+----------------------------------Total | 5999 100.77 3 | 128 2. tab reg8 Code by 8 | regions | Freq. x/y xy means all other values means missing values means all values from x to y means x and y In Example 9.18 6 | 276 4.91 33. It is like the “recode” command in SPSS except that in Stata you do not use parentheses.91 4 | 708 11. recode reg3 1/3 =1 4/6=2 7/8=3 (4824 changes made) .00 N. we create a new variable that indicates whether a household lives in the north. gen reg3 = reg8 .19 31. center.47 56. Percent Cum.78 7 | 1241 20.87 60. 3. The syntax is: recode varname old=new old=new … [if exp] [in range] Here are some examples: recode x 1=2 recode x 1=2 3=4 recode x 1=2 2=1 recode x 1=2 *=3 recode x 1/5=2 recode x 1 3 4 5 = 6 recode x . Example 9.00 ------------+----------------------------------Total | 5999 100.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project recode This command changes the values of a categorical variable according to the rules specified.60 60.78 3 | 2353 39.91 2 | 1612 26. using the reg8 variable.00 . Percent Cum.

Using xtile to create categories . pctile incquint = income. xtile can be used to create a variable that indicates which income quintile a household belongs to.00 100. It is probably easier to explain with examples.70 23.34 2 | 31. tab riceterc farm.80 | 33.34 2 | 2003 33.00 is the new categorical variable created is the existing variable used to create the quantile (e. tab riceterc 3 quantiles | of ricepc | Freq.g income. nq(3) . nq(#) where newvar variable # For example.82 | 33.00 | 100.27 100.47 34.39 | 33. which decile in terms of farm size.39 66. nq(10) pctile coffeeter = coffarea. The syntax is: xtile newvar = variable [if exp] [in range] . ------------+----------------------------------1 | 2000 33.73 3 | 1996 33. Example 10. nq(5) pctile farmdec = farmsize. quantiles | 0:nonfarm) of ricepc | non farm farm | Total -----------+----------------------+---------1 | 46. nq(3) Suppose we want to create a variable indicating the tercile of rice expenditure per capita.39 3 | 21.83 41. farm size) is the number of different categories (eg 5 for quintiles. or which tercile in terms of coffee production. col nof 3 | Type of HH (1:farm. Minot Page 1-31 . Percent Cum.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project xtile This command creates a new variable that indicates which category a record falls into. gen ricepc = ricexpd/hhsize .34 33. 3 for terciles) N.27 -----------+----------------------+---------Total | 100.00 . xtile riceterc = ricepc. when the sample is sorted by an existing variable and divided into n groups of equal size.00 ------------+----------------------------------Total | 5999 100.

Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Exercises for generating new variables 1) Use the file hhexp98n. Create a variable “hhisze” which is equal to the total number of household members. For example. and so on. relat01 is a dummy for being the head. Minot Page 1-32 . create a variable called “hhquint” that indicates the quintile of household size. (use egen) 6) Using the same file. Northeast. Then calculate a new variable equal to the difference between the household expenditure and the regional average expenditure. create a dummy variable called “rurfarm” that is equal to 1 if the household is a rural farm household and 0 otherwise Create another variabled called “upland” that is 1 if the household is in the Northwest. Then do a frequency table of the new variable. or separated and 0 otherwise. divorced. 2) Using the same file. 4) Create a new variable “avgexp” which is equal to the regional average of expenditure (rlpcex2) (hint: use egen). Create a variable called “reg2” which indicates whether a household is in the north or the south of Vietnam based on reg8. relat02 is a dummy for being the spouse. 7) Create a set of dummy variables called “relatxx” based on the relationship of the person to the household head. Then do a frequency table on the new variable. relat03 for a child. or Central Highands. (hint: use tab…gen) N. 3) Using the same file. create a new variable “notmarry” which is 1 if the person is single. 5) Use the file sco01a2.

For example. we described some basic commands for exploring data. In this section. Finally. These are the topics and commands covered in this section: label variable label define label values #delimit tabulate … summarize tabstat table using weights label variable This command is used to attach labels to variables in order to make the output easier to understand. We begin with an explanation of how to label data in Stata. you must use double quotation marks. If there are no spaces. not single The limit is 80 characters for a label. we will describe the use of sampling weights in analyzing survey data.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 6: MAKING TABLES TO DESCRIBE DATA In Section 3. instead of numbering the regions. Minot Page 1-33 . Instead of numbering the different sources of water. This command is like “variable label” in SPSS except that you can only label one variable per command and Stata uses double quotation marks. For example. So we may want to label the variables as follows: label variable reg7 Region label variable rlpcex2 “Per capita expenditure” • • • • • You can use the abbreviation “label var” If there are spaces in the label. we know that reg8 indicates the number of the region where a household lives and that rlpcex2 means real per capita expenditure. quotation marks are optional. But other people using our tables may not know this. we can assign a label to each region. but any labels over 30 characters will probably not look good in a table. add modify] where lblname # “label” add modify Note that: • • You can use the abbreviation “label def” The double quotation marks are only necessary if there are spaces in the labels is the name given to the set of value labels are the value numbers are the value labels means that you want to add these value labels to the existing set means that you want to change these values in the existing set N. Then we describe three commands for generating tables. The syntax is: label define lblname # "label" # "label" # “label” [. we introduce some more powerful and flexible commands for generating results from survey data. we can give them labels. label define This command gives a name to a set of value labels.

Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project • • Stata will not let you define an existing label unless you say “modify” or “add” This command is similar to “value label” in SPSS except that in Stata you give the labels a name and later attach it to the variable. Example 11 shows a frequency table with and without labels. The second table has both the variable label (in the upper left corner of the table) and the labels for the regions. The first table has no labels. add . label define yesno 3 "maybe". label values reg8 reglbl . Then a label var command is used to define the label “Region”. we show how a label list can be used to give the labels assigned to a label name. label values electricity yesno . label define reglbl 7 “Southeast” 8 “Mekong Delta”. modify Some additional commands that may be useful in labeling label dir label list label drop label save using label data to request a list of existing label names to request a list of all the existing value labels to delete a one or more labels to save label definitions as a Do-file to give a label to a data file More information is available by typing “help label” in the Stata Command window. The syntax is: label values varname lblname where varname lblname is the categorical variable which will get the labels is a set of labels that have already been defined by label define Here are some examples of labeling values in Stata. and label values attaches those labels to the reg8 variable. label define yesno 3 "perhaps". while in SPSS you attach it to the variable in the same command. label values This command attaches named set of value labels to a categorical variable. Minot Page 1-34 . Finally. label define yesno 0 no 1 yes . modify defines set of labels called yesno attaches those labels to variable called electricity adds new value label to existing set modifies existing value label . a label define command creates a set of labels. . label variable yield "Yield (tons/hectare)" gives label to variable yield . label define reglbl 1 RRD 2 NW 3 NE 4 NCC 5 SCC 6 CH 7NES 8 MRD . N.

label var reg8 Region . The default is a hard-return.47 56.13 33.00 . The alternative is the semi-colon. The #delimit command solves this problem by allowing you to change the symbol used to indicate the end of the command.78 7 | 1241 20.46 Mekong Delta | 1112 18.78 Southeast | 1241 20. #delimit cr makes the semi-colon the indicator of the end of the command makes the hard-return the indicator of the end of the command N. you may have noticed that the region labels were too long to fit on one line.77 Northeast | 128 2.71 S.00 ------------+----------------------------------Total | 5999 100.18 6 | 276 4.00 . called “cr” by Stata. label define reglbl 1 "Red River Delta" 2 "Northwest" 3 "Northeast" 4 "N.69 81.C. lab list reglbl reglbl: 1 Red River Delta 2 Northwest 3 Northeast 4 N.69 81.C > Coast" 5 "S.59 2 | 731 12.13 33. ------------------+----------------------------------Red River Delta | 1175 19.47 56.18 Central Highlands | 276 4. Using label to make tables more readable . ------------+----------------------------------1 | 1175 19. Minot Page 1-35 . Coast 6 Central Highlands 7 Southeast 8 Mekong Delta #delimit In Example 11. #delimit .C.C Coast 5 S. Coast | 628 10.46 8 | 1112 18.19 31.19 31.71 5 | 628 10.C. label values reg8 reg8lbl .59 Northwest | 731 12. Coast" 6 "Central Highlands" 7 "Southeast" 8 "Mekong Delta" .60 60.91 4 | 708 11.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Example 11.C Coast | 708 11. This is inconvenient when you are writing the command because.54 100.80 45. Percent Cum.54 100. Percent Cum. whether you are in the Do-file Editor or the Stata Command window.59 19. you have to scroll over to read the end of the command.00 ------------------+----------------------------------Total | 5999 100.91 N.60 60. tab reg8 reg8 | Freq. tab reg8 Region | Freq.59 19.77 3 | 128 2.80 45.

The command tabulate by itself gives frequencies and percentages in each cell (cross-tabulations). The syntax is: tabulate varname1 varname2 [if exp] [in range]. label def reglb 1 “North Uplands” 2 “Red River Delta” 3 “NC Coast” 4 “SC Coast” 5 “Central Highlands” 6 “Southeast” 7 “Mekong Delta” . With the “summarize” option. It does not work in the Stata Command window. • The semi-colon is useful if you have long commands • The hard-return is more convenient if you have short commands For example.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Some facts about #delimit: • It can only be used in a Do-file. but they are often used with value labels. the standard deviation. Minot Page 1-36 . • You can specify which statistics with options “means” “standard” and “freq” • You can use the abbreviation “tab…sum( )” • This command is similar to the Stata command “by var3: sum var3” except that the “tab…sum” output is more attractive and “tab…sum” allows two categorical variables • This command is also similar to the SPSS command “means var3 by var1” N.and two-way tables that summarize continuous variables. the regional labels could be entered like this. #delimit cr lab val reg7 reglb An alternative way of dealing with long lines is: label def reglb */ */ */ */ */ */ 1 “North Uplands” /* 2 “Red River Delta” /* 3 “NC Coast” /* 4 “SC Coast” /* 5 “Central Highlands” /* 6 “Southeast” /* 7 “Mekong Delta” The #delimit command and the /* symbols can be used with any command. tabulate … summarize This command creates one. we can put means and other statistics of a continous variable. and the frequency. label var reg7 “Region” #delimit . summarize(varname3) options where varname1 varname2 varname3 options is a categorical row variable is a categorical column variable (optional) is the continuous variable summarized in each cell can be used to tell Stata which statistics you want Some notes regarding this command: • The default statistics are the mean.

sum(hhsize) mean gives the mean household size for urban and rural households tab farm urban98.6804 2768. tab quint98 urban98.0415 209.28157 1316 5 | 6571. sum(rlpcex2) | Summary of Expenditure per capita | exp quint | Mean Std. and frequency in each cell of a 2x2 table of farmers/nonfarmer and urban/rural In Example 12.6628 3597.8277 2278. std deviation. sum(rlpcex2) mean Means of B.3927 | 3331.0415 4 | 3080.1508 5 | 5260.72735 1158 4 | 3080. standard deviation.2957 3 | 2248. • • • The first table is a one-way table (just one categorical variable) showing the mean.92477 917 2 | 1738. it makes the table difficult to read so we do not advise it. pc exp | 1:urban 98.4918 2 | 1738.0415 4 | 3056. sum(rlpcex2) gives the mean.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Some examples: tab reg8. sum(food) gives the mean.9412 5438. tab quint98. and frequency of per capita expenditure for each expenditure quintile.0741 5999 .9809 | 2248.2957 163.6804 . and frequency of per capita expenditure for each region tab urban98.8359 1279.5241 1735. 0:rural | 98 quint | Rural Urban | Total -----------+----------------------+---------1 | 1175.9701 | 1180.4445 1596 ------------+-----------------------------------Total | 3331. sum(rlpcex2) mean | Summary of | Expenditure | per capita quint | Mean ------------+-----------1 | 1180. Although we could have requested all the the default statistics in the two-way table. In the second table.1508 5 | 6571. Minot Page 1-37 . we give the output for three “tab…sum” commands.038 | 3080.M&Reg price adj. we use the “mean” option so only mean per capita expenditure is shown. std deviation.6628 -----------+----------------------+---------Total | 2477. Freq. In the third table. we add a second categorical variable (urban98) making it a two-way table.4918 2 | 1738. Dev.2124 3138.67476 1012 3 | 2248. Example 12: Using tab…sum tab quint98.5102 | 6571. ------------+-----------------------------------1 | 1180.2957 3 | 2242.1508 357.4178 7253.6804 N.6628 ------------+-----------Total | 3331.8682 | 1738.4918 256.

C. Coast | 1813.823 -----------------+---------------------------------------Total | 1772.375 Central Highland | 1158..845 3586.993 3073. The syntax is: tabstat varlist [if exp] [in range] . median.028 S.68 ---------------------------------------------------------- gives mean.81 2906.237 2583. and pn (nth percentile).039 5034.33 3331.617 1973. tabstat is like "tabulate … summarize”except that tabstat is more flexible in the statistics and format It is very similar to the SPSS command “means”.746 2963. and min of farmsize & hhsize three variables gives mean of two variables for each region gives the median farmsize for each region N.817 2604. sum.759 1818. stats(p25 p50 p75 mean) by(reg8) Summary for variables: rlpcex2 by categories of: reg8 (Region) reg8 | p25 p50 p75 mean -----------------+---------------------------------------Red River Delta | 1874.033 Southeast | 2553.21 2784. With the by() option. sd (standard deviation). Minot Page 1-38 . stats(mean max min) tabstat farmsize hhsize.663 Northwest | 1460. Examples tabstat farmsize hhsize.79 3919. Optional statistics subcommands include mean.884 Mekong Delta | 1795. tabstat rlpcex2. stat(statname [.988 2102.C Coast | 1568.891 3866.714 2523. stats(median) by(reg8) Example 13. max. tabstat is like “summarize” except that it allows you to specify the list of statistics to be displayed.964 2143.876 N.454 3851. by(reg8) tabstat farmsize. skewness. range. Without the by() option.563 3392.612 2474.89 1831. Using tabstat to create tables .513 2361.977 Northeast | 1234.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project tabstat This command gives summary statistics for a set of continuous variable for each value of a categorical variable.18 1633. kurtosis. max.993 3610. var (variance). min.002 2114..385 6213.223 2485.]) by(varname) where varlist statname varname is a list of continuous variables is a type of statistic is a categorical variable Some facts about this command: • • • • • The default statistic is the mean.65 3098.

c(mean income mean yield) table of frequencies by region with total row table of average income by region table of yield statistics by region table of average yields by region with format table of average yield by region and sex table of avg yield & income by region & sex Some output from table commands is shown in Example 14. c(mean yield) format(%9. but no summary row. but easier to learn and less flexible in formatting Here are some examples: .Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project table This command creates a wide variety of tables. in a two-way table.1f) means fixed format with 4 digits and one to the left of the decimal point. sum. The second table is the same except that the format option has been added to reduce the size of the numbers. The c( ) is short for contents of each cell. Minot Page 1-39 . count. The first table is a two-way table of average household size by region and urban/rural. but table cannot do percentages Like tab…sum. Usually. It is probably the most flexible and useful of all the table commands in Stata. median. table reg8 sex. The fourth table gives the average per capita expenditure for urban and rural households in each region (the sample did not include any urban areas in the Central Highlands). and pn (nth percentile). table reg8 .and two-way frequency tables. table regi8. Optional statistics are mean. rawsum (unweighted). but its disadvantage is that it has fewer statistics. It is similar to “table” in SPSS. The option format(%4. Also note that it has a summary column. it can be used to calculate basic stats for each value of a categorical variable Its advantage over tab…sum is that it can do more statistics and it can take more than one continious variable Like tabstat. min. it can be used to calculate advanced stats for each value of a categorical variable Its advantage over tabstat is that it can use do two (and more) way tables. c(mean yield sd yield median yield) . table reg8. it can be used to create one. c(mean income) .2f) . table reg8. c(clist) [row col] where rowvar colvar clist row col is the categorical row variable is the categorical column variable is a list of statistic and variables is an option to include a summary row is an option to include a summary column Some useful facts about this command: • • • • • • • • • The default statistic is the frequency. table reg8 sex. it is useful to have both row and column summaries. The syntax is: table rowvar colvar [if exp] [in range]. c(mean yield) .0f) which expresses expenditure as an integer. It uses a format(%6. row . sd. Like tab. max. N.

8 Southeast | 5.0459184 4. Coast | 4.0390625 N. c(mean hhsize) row col --------------------------------------------------| 1:urban 98. c(mean rlpcex2) col format(%6.3971631 4. table reg8 farm.045977 4.0815972 3.8 5.6680791 S. table reg8 urban98.4 4.7797538 Northeast | 5.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Example 14 Using table .0 Northwest | 5.C.9 | Total | 4.0 4.4612717 4.752292 --------------------------------------------------.C Coast | 3466 2238 2604 S.780037 4.4 4.0 4.C.1 3. Minot Page 1-40 .1f) --------------------------------------| 1:urban 98.7 S. c(mean hhsize) row col format(%4.1 5.1028571 4.7 4.6759259 4.7 4.09375 5.0459574 Northwest | 5.9621273 Mekong Delta | 5.8 5.1 4.7 4.8 .0 Mekong Delta | 5. 0:rural | 98 Region | Rural Urban Total ------------------+-------------------Red River Delta | 4.8478261 5. table reg8 urban98. Coast | 4142 2180 3098 Central Highlands | 2737 2044 2114 Southeast | 5914 3316 5035 Mekong Delta | 3657 2569 3074 ------------------------------------------------ N.0 N.8 Central Highlands | 5.C Coast | 4.7 5.C.1373494 4. Coast | 4.C Coast | 4.7352941 5.0 4.0f) -----------------------------------------------| Type of HH (1:farm.6580645 4.3541667 4.5 4.0136364 4.1 4.8702272 4.9 4.8328025 Central Highlands | 5. 0:rural 98 Region | Rural Urban Total ------------------+-------------------------------Red River Delta | 4.8 Northeast | 5.8478261 Southeast | 5.9496403 | Total | 4.6666667 4. | 0:nonfarm) Region | non farm farm Total ------------------+----------------------------Red River Delta | 4738 2410 3393 Northwest | 3534 1962 2362 Northeast | 2528 1779 1832 N.

Similarly. . The GSO estimated the probability that each household would be selected and then calculated the sampling weight as the inverse of that probability.dta and the variable name is “wt” 2. but the principle is the same. such as when we want to calculate the percentage of people that are in households below the poverty line. table reg8 urban98. In your sample. 0:rural | 98 Region | Rural Urban Total ------------------+-------------------Red River Delta | 3482 2439 3134 Northwest | 3441 2330 3206 Northeast | 3692 2087 3291 N.or overrepresenting certain households in a sample and to allow extrapolation of the sample results to the population. For these calculations. the sampling weight is in hhexp98n.C Coast | 3434 1803 3185 S. Sampling weights in the VLSS The calculation of the sampling weights in the VLSS is much more complicated than the example given above. meaning that household in the VLSS sample represents (on average) 2688 households in Vietnam. In the VLSS. 2 This weight is used for calculating averages in which every household has equal weight. so the weights will be much smaller.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project weights What are sampling weights? Sampling weights are used to compensate for under. The new 2001 Vietnam Household Living Standards Survey will have a sample of about 75. Let’s take a simple example: • • • • Suppose you wanted to estimate the total population of Hanoi by interviewing randomly 25% of the households. averages. This variable is simply wt*hhsize.0f) --------------------------------------| 1:urban 98. We can use the table command to generate some statistics about the VLSS weights. you would have to divide the estimated total population (4*H+10*D) by the estimated total number of households (4*h+10*d).000. If you want an estimate of the population of the two cities together. there are h households and H people. Sometimes. and percentages must take into account the sampling weights. Because of clustering and sampling. if you interview 10% of the households in Da Nang and find d households and D people. Coast | 2281 1841 2127 Central Highlands | 1320 1320 Southeast | 1806 2208 1981 Mekong Delta | 3093 2482 2938 | Total | 2869 2242 2689 --------------------------------------- The average weight is 2688. we want to give each person an equal weight.C. If you wanted to estimate the average household size of the two cities. probably around 230. c(mean wt) row col format(%7. Your estimate of the total population of Hanoi would be 4*h. your estimate of the population of Da Nang would be 10*D. N. Minot Page 1-41 . the calculation of sums. you would calculate 4*H+10*D. The basic principle is that the sampling weight is the inverse of the probability of selection. it is better to use the variable wthhsize as a weight. virtually all random-sample surveys must use weights to make estimates that are valid for the whole population. Furthermore.

like tab.70 with the weights. the weights are turned on.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Using sampling weights in Stata The calculation of weighted sums and weighted averages would be very tedious. Here we will focus on pweights and fweights3.75 without the weights and 4.6 million. are weights that denote the inverse of the probability that the observation is included due to the sampling design. sum(rlpcex2) tabstat hhsize [fw=wt]. The syntax for using weights is: command . 4) iweights. In SPSS. or sampling weights. Notice that the number of observations in the second is 1. [weighttype=varname] . 3 N. or frequency weights. c(mean age) gives the weighted frequencies in each region gives the weighted mean household size gives table of weighted mean expenditure by sex of head of household gives the weighted average household size for urban and rural households gives the weighted mean age of heads by region Example 15 shows the effect of weights. This also means that using the raw. are weights that indicate the number of duplicated observations. you turn on the weights and weights are used in all calculations until you turn it off.. For a number of commands.. Stata does not allow pweight. The box also shows that weighted and unweighted means are different.. sum. 3) aweights. The average household size is 4. Stata allows four kinds of weights: 1) fweights. Minot Page 1-42 .. by(urban98) table reg8 [pw=wt]. Stata is different in that you tell Stata which commands should use weights. The first table gives the unweighted percentage of urban and rural households in each department. are weights that indicate the "importance" of the observation in some vague sense. or importance weights. Notice that the urban households represent almost 29 percent of the sample but just 24 percent of population. are weights that are inversely proportional to the variance of an observation. unweighted results would give too much weight to urban households relative to their share of the population. In the second table. or analytic weights. In the case of the VLSS. and tabstat. 2) pweights. but fweight gives the correct percentages and means. but fortunately survey software such as SPSS and Stata do this for us. This means that urban households were slightly over-represented in the original VLSS sample (you can verify in the table above that urban weights are slightly smaller). Type “help weights” in the Stata Command window for more information. This represents the extrapolated number of households. we will generally be using the following syntax: command … [pw=wt] … Here are some examples: tab reg8 [fw=wt] sum hhsize [fw=wt] tab sex [fw=wt].

00 Northeast | 75.41 | 100.97 | 100.41 43.00 25.80 21.00 N.700221 1.00 . Coast | 64. sum hhsize Variable | Obs Mean Std. Dev.00 | 100.C.03 | 100. Dev.00 S.59 | 100.75 15.20 | 100.05 | 100. 0:rural | 98 Region | Rural Urban | Total ------------------+----------------------+---------Red River Delta | 74. Min Max -------------+----------------------------------------------------hhsize | 5999 4.00 | 100.00 0.00 Southeast | 56.954292 1 19 . Coast | 69.908688 1 19 N.16 28.C Coast | 91.00 S. sum hhsize [fw=wt] Variable | Obs Mean Std.15 15.59 15.59 | 100. nof row | 1:urban 98.00 Northwest | 84.36 | 100.42 | 100. tab reg8 urban98. nof row | 1:urban 98.03 25. tab reg8 urban98 [fw=wt]. 0:rural | 98 Region | Rural Urban | Total ------------------+----------------------+---------Red River Delta | 66.41 48.00 Central Highlands | 100.00 Mekong Delta | 78.C Coast | 84.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Example 15.00 .85 | 100.68 30.36 | 100.C.00 Mekong Delta | 74.63 | 100.00 Southeast | 51. Using weights in generating tables .00 Northeast | 84.64 25.00 ------------------+----------------------+---------Total | 71.64 33. Min Max -------------+----------------------------------------------------hhsize | 1.00 N.6e+07 4.00 Central Highlands | 100.37 8. Minot Page 1-43 .752292 1.32 | 100.25 | 100.58 21.00 | 100.00 Northwest | 78.97 35.84 | 100.00 0.00 ------------------+----------------------+---------Total | 75.95 24.

Some options are common to many graph types: title(“text”) specifies the title to use on the graph b2(“text”) specifies title on X axis (b for bottom) l2(“text”) specifies title on Yaxis (l for left) xlabel uses “round” values to label x axis ylabel uses “round” values to label y axis by(var1) repeat graph for each value of var1 Some options for histograms: N. The default graph type depends on the number of variables specified: • • The default graph type is histogram if only one variable is specified. we focus on four types of graph and a few options. The default graph type is two-way scatterplot if two or more variables are specified.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 7: PRESENTING DATA WITH GRAPHS This section provides a brief introduction to creating graphs. all graphs are made with the graph command. These are the subcommands covered in this section: graph histogram twoway bar pie matrix xlabel ylabel connect( ) symbol( ) graph This command generates numerous types of graphs and diagrams. In Stata. but we describe how to make some of the more common graphs. graphtype options where varlist graphtype options is the list of variables to graph is the type of graph are commands to control the look of the graph The eight graph types are: histogram oneway twoway matrix box star bar pie Bar chart based on frequency Scatterplot with one variable Scatterplot with two variables Matrix of two-way scatterplot graphs Box-and-whisker plot Star chart Bar chart of means or sums Pie chart There are too many options to describe here. Minot Page 1-44 . The syntax is: graph [varlist] [if exp] [in range] . In this section. but there are 8 types of charts and numerous subcommands for controling the type and format of graph.

and c as bars graph means of a. bar means histogram of x histogram with 5 bars and rounded axis labels scatter plot of y1 and y2 against x scatter plots of y against x for each region graph sums of a.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project bin(#) freq percent normal specifies that the histogram will have # bars label Y axis in terms of frequency label Y axis in terms of percent draws a normal curve with the means and SD of the variable Some options for two-way scatterplots: connect( ) to specify how points are connected symbol( ) to specify what the marker look like Some options for bar charts: means graphs means of variables given stack stack the bars for each variable rather than putting them side by side Here are some examples of the graph command: graph x graph x. It was inserted into Word by clicking Edit/Copy Graph in Stata and then Control-V in Word. bar graph a b c. b. Example 16. b. xlabel ylable. Two-way scatterplot graph N. by(region) graph a b c. bin(5) xlabel ylabel graph y1 y2 x graph y x. and c as bars Example 16 shows the result of the command graph ricexpd hhsize. Minot Page 1-45 .

bar means by(reg8) ylabel Example 18.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project In Example 17. Bar chart of per capita expenditure by region N. Minot Page 1-46 . a histogram was created with the command: graph rlpcex2 if rlpcex2<20000. Histogram of per capita expenditure in Vietnam In Example 18. then the graph was created with: graph rlpcex2. xlabel ylabel normal bin(20) Example 17. the data were sorted by reg8.

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

SECTION 8: MODIFYING DATA FILES This section describes a number of commands that are used to modify and combine data files in Stata. We begin with a five simple commands and then move to five more complex ones. rename drop keep sort compress collapse merge append reshape fillin rename This command renames variables. Some examples: rename oldname newname rename s1aq06y age drop This command deletes records or variables. Examples are: drop if age>140 drop if area==. drop temp1 temp2 keep This command deletes everything but specified observations or variables. Examples include: keep if age <= 140 keep househol age rlpcex2 sort This command sorts the records in the file according to the value of specified variables. Examples are: sort reg8 househol sort urban98 compress This command reduces the size of the file by changing the data storage types. It will not make any changes that would cause Stata to lose data. This command has no options or arguments. sorts data file by reg8 and within each region by househol ID sorts by the dummy variable urban98 keeps only records in which age is 140 or under keeps only variables househol and rlpcex2, deleting others deletes records in which age is greater than 140 deletes records in which area is missing deletes variables temp1 and temp2

N. Minot

Page 1-47

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

collapse This command is used to create a new data file by aggregating the existing one. It allows you to change the level of the data file. Person-level data can be collapsed to the household level to calculate the size of the household. Crop –level data can be collapsed to the household-level to calculate the value of agricultural production per household. The syntax is: collapse (stat) varlist1 (stat) varilist2, by(varlist3) where stat varlist1 varlist2 varlist3 refers to one of the statistics are the variables to be aggregated using the first statistic are the variables to be aggregated using the second statistic are the categorical variables which define the aggregation

Some points about the collapse command: • • • • • The default statistic is mean Optional statistics are mean, sum, rawsum, count, max, min, median, and pn (the nth percentile, where n is between 1 and 100) The output file will have one record for each value of varlist3 in the by( ) option If no by( ) option is given, then the data will be collapse to one record This is similar to “aggregate” in SPSS except Stata does not require you to define a new name for the aggregated variable (by default, it uses the old variable name).

Examples of the collapse command: creates a dataset of provincial means ot age, education, and income collapse (median) income, by(province) creates a dataset of provincial medians of income collapse (mean) age (median) income, by(reg8) creates a dataset of regional means of age and regional medians of income collapse (mean) age educ (median) income creates a dataset with overall means & medians
Example 19. Using collapse to calculate household size
. use scr01a2 . sum s1aq05y

collapse age educ income, by(province)

Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------s1aq05y | 28069 1969.846 19.95489 1899 1998 . collapse (count) idcode if . gen hhsize = idcode . sum hhsize Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------hhsize | 6002 4.751583 1.95443 1 19 s1aq11==1, by(househol)

N. Minot

Page 1-48

Module 1: Using Stata to Analyze Survey Data

IFPRI-IDS Poverty Mapping Project

In Example 19, we use collapse to calculate the average household size from the person-level data. The first sum command shows that there are 28,069 records in the original person-level file. After the collapse, the second sum command indicates that there are just 6002 records. It also shows that the average household size (unweighted) is 4.75, the same figure we found in the hhexp98n file in Section 3. merge This command combines two files with different variables into one file. Until now, all the commands we have worked with used just one file. The VLSS has over one hundred files, however, and often we would like to combine data from differerent files. For example, • • • • to calculate expenditure we need to combine the files for food expenditure and non-food expenditure to calculate school attendance rates, we need to combine the file with age and the file with school attendance to examine the relationship between the value of the house and housing characteristics, we need to combine several files. to calculate the value of agricultural production, we need to combine the files for rice, other food crops, annual industrial crops, and permanent industrial crops.

Files can be combined vertically (top to bottom). In this case, the two files have different records and are linked by having the same variables. The files below have different records but the same varaibles. The first file has crops 1-10, while the second file has crops 11-20. They can be combined with append as described later. hhid 101 101 102 102 103 hhid 101 102 102 103 103 103 Two files before append crop area quant 1 4 1 7 2 crop 16 12 13 11 16 19 area quant value hhid 101 101 102 102 103 101 102 102 103 103 103 One file after append crop area quant value 1 4 1 7 2 16 12 13 11 16 19

══► value

Files can be combined horizontally (side to side). In this case, the two files have different variables and are linked by having the same observations (person, household, crop, etc.) The files below have different variables but the same records (household). The command merge will combine records with the same household identification number (hhid). This would allow an analysis of how housing value (in the second file) varies according to expenditure quntile (in the first).

N. Minot

Page 1-49

we merge the list of household members (scr01a2) with the education file (scr02a). The merge command in Stata is similar to the “match files” command in SPSS. After renaming and dropping more variables. rename some variables. _merge=1 means it is from the original data set only _merge=2 means it is from the new data set only _merge=3 means it is from both data sets. It is a good idea to run a “tab _merge” command after every merge to check the merger.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project hhid 101 102 103 201 202 203 204 region urban exppc hhid 101 102 103 201 202 203 204 region urban exppc Two files before merge farm hhid housetype water elect 101 102 103 201 202 204 204 ║ ║ ▼ One file after merge quint farm housetype water elect value quint value The syntax for the merge command is: merge [varlist] using filename where varlist filename is the list of variables in common is the data file that the current data set will with merged with Some notes about the merge command: • • Both the original file and the new file must be sorted by the common variable(s) before merging A variable called _merge is create which indicates the source of each record. N. and then sort. Next. delete others. the des command shows that we haveage and sex from the first file and attend from the second. we merge the member file and the education file. Minot Page 1-50 . • • Some examples: use members merge hhid perid using educ use hhchar merge housing using hhid opens file “members” merges files “members” with “educ” with hhid and perid as the common variables opens file “hhchar” merges “hhchar” and “housing” using hhid as the common variable In Example 20. We open the household member file.

gen attend=(s2aq03==1 | s2aq03==3)*100 . [TEN] bao nhieu.761 (99.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Example 20.0g HOGIADINH idcode byte %8.0g 2. Gioi tinh : age int %8.0g 6. Using merge to calculate school attendance use "D:\Vietnam Pov Mapping\Training\SCR01A2. keep househol idcode sex age . sort age . clear .633 vars: 4 16 Dec 1999 15:00 size: 343.DTA obs: 28. des Contains data from D:\Vietnam Pov Mapping\Training\SCR01A2. rename s1aq02 sex .. des Contains data from D:\Vietnam Pov Mapping\Training\SCR01A2.0g 6.0g MA HIEU: sex byte %8.0g ---------------------------------------------------------------------------Sorted by: Note: dataset has changed since last saved . graph attend if age<25. Minot Page 1-51 .?SO NAM: _merge byte %8. merge househol idcode using scr02a .DTA". sort househol idcode . rename s1aq06y age .7% of memory free) ----------------------------------------------------------------------------storage display value variable name type format label variable label ----------------------------------------------------------------------------househol long %12.633 vars: 6 16 Dec 1999 15:00 size: 486.0g HOGIADINH idcode byte %8.0g attend float %9.0g 2.0g MA HIEU: sex byte %8.DTA obs: 28. bar means ylabel by(age) N.?SO NAM: ----------------------------------------------------------------------------Sorted by: househol Note: dataset has changed since last saved .. Gioi tinh : age int %8. drop cluster-s2aq22 .5% of memory free) ----------------------------------------------------------------------------storage display value variable name type format label variable label ----------------------------------------------------------------------------househol long %12. [TEN] bao nhieu.596 (99.

Example 21. Graph of school attendance by age append This command combines two files with different records but the same variables. For example. This command is similar to “join files” in SPSS. Minot Page 1-52 . or total income. N. we combine them with append. the agricultural production data is found in six files: scr09b1 scr09b2 scr09b3 scr09b4 scr09b5 scr09b6 rice production other food production annual industrial crops permanent industrial crops fruit crops agro-forestry crops In order to calculate the value of agricultural production. the append command is useful in analyzing household expenditure and agricultural production. We will illustrate the method by combining the rice and other food files. In the VLSS.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project The graph in Example 21 shows the percentage attending school for each age from 0 to 24. it is necessary to combine these files. Since they have similar variables but refer to different observations (crops). The syntax is: append using filename where filename is the name of the file to be added to the current data set. crop sales.

261 records (8720+10541) and 6 variables.541 vars: 6 11 Jul 1999 15:45 size: 316.4% of memory free) Sorted by: Note: dataset has changed since last saved N. we type the following commands: use "D:\Vietnam Pov Mapping\Training\SCR09B1. short save foodprod use riceprod.261 vars: 6 4 Aug 2002 02:53 size: 654. clear rename s9b2cc crop rename s9b2q03 area rename s9b2q04 prod rename s9b2q06 saleq rename s9b2q071 buyer keep househol crop area prod saleq buyer des. This will require more than 10 commands.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Because the variables are not quite the same. short table crop. clear rename s9b1cc crop rename s9b1q03 area rename s9b1q04 prod rename s9b1q061 saleq replace saleq = saleq*(1/.7% of memory free) Sorted by: househol crop Note: dataset has changed since last saved Contains data from D:\Vietnam Pov Mapping\Training\SCR09B2.230 (99.DTA obs: 8. In the file.DTA". so it is probably worth creating a Do-file by clicking on Window/Do-file Editor. clear append using foodprod save allfood des. Using append to combine files Contains data from D:\Vietnam Pov Mapping\Training\SCR09B1.600 (99.0f) In Example 22 are selected results from the Stata Results window. The combined file has 19. we need to rename the variables before combining the files. The other food file (after modification) contained 10.720 vars: 6 11 Jul 1999 16:13 size: 261. c(mean area mean prod mean saleq) format(%6.541 records and 6 variables. short save riceprod use "D:\Vietnam Pov Mapping\Training\SCR09B2.dta obs: 19.DTA". The rice file (after modification) contained 8760 records and 6 variables. Example 22.7% of memory free) Sorted by: househol Note: dataset has changed since last saved Contains data from allfood.874 (99.67) if s9b1q062==2 rename s9b1q071 buyer keep househol crop area prod saleq buyer des. Minot Page 1-53 .DTA obs: 10.

The syntax is easy: fillin varlist where varlist is the list of variables. Suppose we are working with crop production data. The records exist only for crops grown by the household. . . 1 2 . . every combination of which we want to exist in the file. . but most households only grow 2-4 of them. 1 4 N. 1 4 prod 3 . 1 1 . 1 . 1 2 . as shown below: File in original form hhid 1 1 1 2 2 3 3 3 crop 1 3 5 1 5 1 4 5 area 3 1 1 1 1 2 1 4 prod 3 1 1 1 1 2 1 4 If we calculate the average area for each crop. it will give the average area among those growing the crop. Again. 1 . If we want the average area including the non-growers. the values of the other variables will be missing. When it has to insert record. the command would be fillin hhid crop Stata will look for all the values of hhid and all the values of crop in the file. it is easier to give an example than to describe it. . then it will make sure every hhid-crop combination has a record. it is not easy to calculate. Stata allows you to fill in the “missing” records of crops not grown by each household. The new file would look like this: File after fillin command hhid 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 crop 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 area 3 . Minot Page 1-54 . . Data are collected on 5 crops. Using our example above.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project fillin This command inserts additional records into a file so that all combinations of two or more variables are in the file. 1 1 .

in which additional loans are stored in additional variables. they are just arranged differently. One way to store this data is with a wide file. the result will be the average among growers. SPSS does not have a similar command. we will get the same answer as above. What do we mean by “wide” and “tall”. For information on how to implement reshape.. The reshape command allows you to convert one type of file into the other. in which additional loans are stored as additional records.=0 recode prod . File in “tall” format hhid 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 loannbr 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 amount source Notice that both files have the same number of data points (30) for loan amount and source of the loan. For more information.=0 then the averages will include the zeroes. while a tall file stores this information using additional records. reshape The command changes a file from tall to wide or from wide to tall. A wide file stores additional information as separate variables. Because missing values are not counted. This is an extremely useful command. type “help reshape” in the Stata Command window. But if we replace the missing values with zeros: recode area . particularly for dealing with crop data and expenditure data. An example will be easier to understand. Minot Page 1-55 .” N. type “help reshape. File in “wide” format hhid amount1 1 2 3 4 5 source1 amount2 source2 amount3 source3 The other way to store the data is with a tall file.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project If we calculate the average area and production on this file. Suppose a household credit survey asks about the amount and source of the three most recent loans.

predict phat creates variable yhat with predicted values creates variable e with residuals creates variable index with the value of sum of XB creates variable phat with the predicted probability regress y with x’s as independent variable same regression but only in one region region* means all variables starting with region. Minot Page 1-56 . regress y x1 x2 x3 x4 x5 . Type “help regress” for more information. residuals. after regression analysis. predict yhat. regress y x1 x2 x3 x4 x5 if region==1 . predict e. xb . by region: regress y x1 x2 region* predict This command can be used to obtain predictions. etc. regress y x1 x2 x3 . One variable is considered the dependent variable.. xb . Example 23 presents the results of a regression analysis of the determinants of rice expenditure.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 9: REGRESSION ANALYSIS This section describes the use of Stata to do regression analysis. The results indicate that rice expenditure is greater in larger households headed by older males. resid . The commands described in this section are: regress test. Regression analysis involves estimating an equation that best describes the data. N. The syntax is: regress depvar varlist [if exp] [in range] [options] where depvar varlist is the dependent variable is the list of independent variables The regress command has many options for specifying the type and format of the output. testparm predict probit ovtest hettest regress This command carries out a regression analysis on the variables specified. Stata is capable of many types of regression analysis and associated statistical test. options] Two of the most common options are: xb predicted values of y are put in newvarname e residuals of the regression are put in newvarname For example: . Some examples of the command: . predict index. predict newvarname [if exp] [in range] [. probit poverty age sex housing . In this section.. we touch on only a few of the more common commands and procedures. while the others are considered independent (or explanatory) variables.

71 5 | 628 10. Note that Stata automatically dropped one of the regional dummy variables to avoid perfect multicollinearity.34716 49.18113 133. Std. Using regress to examine determinants of rice expenditure .000 -.13 33.8754 region2 | 312.0000 0.1813 region4 | -21.00 ------------+----------------------------------Total | 5999 100.69 81.0825 595. gen age2 = age^2 .4138 -18.71 0.748 -------------+-----------------------------Total | 7.08 0.19595 4.36 0. Percent Cum.358447 77.659207 2.00 .000 291. Compared to the Central Highlands (region6). even after controling for other factors.53664 region6 | (dropped) region7 | 37.000 -154.027237 educyr98 | -6.005 -251.628 80.04 0.38 -----------------------------------------------------------------------------ricexpd | Coef. Err.6041 0.0023 age | 51.49 0.16549 48.80 45.9203 52.11081 region5 | -147.15175 49. households in the Northeast and Red River Delta spend more on rice.3815993 sex | -109.35 0.75859 51.60 60.8715 region3 | 443.96463 -0.13 0.77 3 | 128 2.72827 60.6032 718. regress ricexpd hhsize age age2 sex rlpcex2 educyr98 urban98 region* Source | SS df MS -------------+-----------------------------Model | 4.670341 -2.7768 22.304 -44. Interval] -------------+---------------------------------------------------------------hhsize | 414.80 0.46 8 | 1112 18.57349 133.8000e+09 5998 1300436.75 0.9933 425.29 -6.532 -405.47 56.0472076 -10.4741431 .5666869 -.000 41.78 7 | 1241 20. Minot Page 1-57 . gen(region) Code by 8 | regions | Freq.0103408 .89404 -1.42 0.6721 24. tab reg8.7119e+09 14 336562752 Residual | 3.30 0.675 -123.73708 -2.424376 urban98 | -453.0187889 .000 -501. use hhexp98n.17 0.18 6 | 276 4.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project Example 23.0043095 4.000 211.84751 1.013 -11.8123 region1 | 36.93277 -4.73 0. 5984) Prob > F R-squared Adj R-squared Root MSE = = = = = = 5999 652.49933 6.79 0.53 ------------------------------------------------------------------------------ Rice expenditure is positively related to per capita expenditure (though interestingly.451 -59.58 0.7333 -64.9242 _cons | -777.82034 rlpcex2 | .59 2 | 731 12.91 4 | 708 11.000 403. ------------+----------------------------------1 | 1175 19.14 Number of obs F( 14. N.34054 0.59 19.4978 5.0881e+09 5984 516065. the coefficient was negative if you exclude the urban dummy variable).916 -535.60 0.1319 77.59324 150.9143 51.000 .56194 5.465 -61. Urban households consume significantly less rice than rural households.54 100.19 31. t P>|t| [95% Conf.82956 10.259 -40.877 region8 | 55.223 123.66363 age2 | -.75018 0.957 413.000 -1018. clear .

When this assumption is violated. . rhs hettest Another assumption behind regression analysis is that the variance of the error term is constant across the sample. regress y age female educ region1 region2 region3 region4 . An alternative is the dprobit command which reports the derivative of the probability with respect to each independent variable instead of the coefficient. This command tests for heteroskedasticity. powers of the fitted values are used for z . ovtest. For example. and x3 run a probit with y as dependent and x’s as independent run a “robust” probit (weaker assumptions about error) run the probit in one region only test of hypothesis that all region* dummies are zero N. hettest [varlist] tests significance of powers of predicted y tests significance of powers of x1. dprobit y x1 x2 x3 if reg8 ==1 ovtest Regression analyis generates the best unbiased linear estimates of the “true” coefficients provided that some assumptions are satisfied. options] Probit analysis is used when the dependent variable is a categorical variable with only two values.1 . If the rhs option is not specified. robust . rhs] This test amounts to estimating y = xb+zt+u and then testing t=0.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project test This command tests linear hypotheses about the estimated parameters from the most recently estimated model. test region1=region2 test hypothesis that region1 coef = region2 coef . One assumption is that there are no missing variables that are correlated with the error term. Examples include: . ovtest . testparm region* probit This command carries out a probit regression analysis of the specified variables. probit y x1 x2 x3 . x2. the problem is called heteroskedasticity. regress y x1 x2 x3 . test region1 region2 region3 region4 test of hypothesis that four region dummies are zero If you want to test the hypothesis that a set of related variables are all equal to zero. The syntax is: ovtest [. Otherwis. test educ=. The syntax is: probit depvar indepvars [if exp] [in range] [. Examples of the test are: . test region4 = (region1+region2)/2 test hypothesis given by equation .1 test hypothesis that educ = 0. the powers of the indiependent variables are used.. probit x1 x2 x3. you can use the related testparm command: . Minot Page 1-58 . This command performs a Ramsey RESET to test for omitted variables (misspecification).

0 region5 = 0. Regression tests . The hypothesis is rejected. This is not surprising since each is statistically significant on it own.97 0. 5984) = Prob > F = 26. The test command tests the hypothesis that both age variables are zero. ovtest Ramsey RESET test using powers of the fitted values of ricexpd Ho: model has no omitted variables F(3. The ovtest rejects the hypothesis that there are no omitted variables. If varlist is specified.0230 .0 region4 = 0.93 Prob > chi2 = 0.18 Prob > F = 0.88 0.0 Constraint 6 dropped F( 7. the fitted values are used for z.0 region3 = 0. the variables specified are used for z. hettest Cook-Weisberg test for heteroskedasticity using fitted values of ricexpd Ho: Constant variance chi2(1) = 1473.0 region2 = 0.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project This command tests t=0 in Var(e)=s^2exp(zt).0 region6 = 0.0000) so we can reject this hypothesis. testparm region* ( ( ( ( ( ( ( ( 1) 2) 3) 4) 5) 6) 7) 8) region1 = 0.0 region8 = 0.0000 . The parmtest command tests the hypothesis that all the region coefficient are equal to zero (that region does not influence rice expenditure). finding that the probability is very low (less than . Examples are: .0 F( 2. 5981) = 3.0 age2 = 0. This test is also known as the Breusch-Pagan test for heteroskedasticity.0 region7 = 0. hettest test whether variance related to predicted y .0000 . If varlist is not specified. Example 24. indicating that we need to improve the specification (prices would be a good start). hettest x3 test whether variance related to x3 Example 24 gives the result of some tests related to the regression analysis shown earlier. And finally. test age age2 ( 1) ( 2) age = 0. Minot Page 1-59 . meaning that the regional coefficients are jointly significant.0000 test age age2 N. regress y x1 x2 x3 . hettest indicates that there is heteroskedasticity which needs to be dealth with. 5984) = Prob > F = 59.

macros Macro assign a set of word or a number to a name. your colleague can change the global command once rather than trying to change the path in every command that opens a file or saves a file. but rather to give you an idea of the kinds of things that can be done with Stata. • • “Global” macros stay in memory until you leave Stata “Local” macros exist only with a program or a loop The syntax is relatively simple: global gmname = “ expression “ local lmname = “ expression “ To use these macros later. Stata repeats the commands as long as the expression is N. By using the macro. To fully describe Stata programming would require more space than is available here. you must use special symbols to tell Stata they are macros: $gmname `lmname’ One use of the global macro is to store the name of the folder with the data. The syntax is: while exp { commands } where exp is an expression. Local macros are used (among other places) in loops with the while command. so we will discuss them in the next section. global path = “d:\data\vlss\1998\household” use “$path\scr09b2. Furthermore. There are two types of macros.dta” In addition to saving you some time. while This command starts a loop. this macro is useful if you share the program with others who have different names for the folders on their computer. Minot Page 1-60 .Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project SECTION 10: INTRODUCTION TO PROGRAMMING WITH STATA This section provides a very quick introduction to the topic of programming with Stata. allowing groups of Stata commands to be repeated until some condition is met. I do not (yet) know enough about it to teach it. We touch on three topics: • • • creating and using macros creating and using loops matrix algebra The purpose here is not to provide a comprehensive description of how to program with Stata.

The while statement says that the commands in brackets will be repeated until the condition r<=8 is no longer true. the default maximum matrix size is 40x40. gen(region) local r = 1 while `r' <= 8 { regress housval roof floor wall room area water if region`r' == 1 local r = `r' + 1 } The tab command creates a dummy variables for each region (region1. The maximum matrix size can be changed using set matsize 500 3.000x11. When r reaches 9. the loop stops because the while condition is no longer true. are any Stata commands that you want to repeat define the beginning and the ending of the commands to be repeated This is an example of a loop that uses local macros to carry out a regression analysis of the determinants of housing value for each region: tab reg8. Minot Page 1-61 .000. Creating matrices by hand Examples: matrix mymat = (1. Setting the maximum matrix size For regular Stata. etc). 1. the default maximum is 400x400..2] = B makes matrix D equal to matrix B calculates beta using regression equation redefines C matrix in terms of old values defines matrix using sub-set of A matrix redefines subset of A matrix as equal to B sets the maximum size for a matrix at 500x500 commas separate elements. matrix Stata has a special set of commands for matrix algebra. region2. These can be used to implement custom econometric procedures or for doing calculations on the output of regression analysis.5]/2 matrix A[2. but this can be increased up to 11. complex set of commands (type “help matrix” for more information). 2. but this can be increased up to 800x800 with the matsize command.Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project commands brackets true. backslash indicates new row creates a row vector creates a column vector N..4) matrix myvec = (1 5 3 1 3) matrix mycol = (1/5/3/1/3) 2.2\3. For Stata SE. the if statement is “if region3==1). The first local command creates a macro called “r” that is equal to 1. This is a very short summary of a very long list of complex commands. the regress command is carried out in one region (when r=3.. On each loop. The second local command increases the value of r each time that the loop is completed.. Manipulating matrices Examples: matrix D = B matrix beta = syminv(X'*X)*X'*y matrix C = (C+C')/2 matrix sub = A[1. Then Stata goes on to the next command after the bracket.

Module 1: Using Stata to Analyze Survey Data IFPRI-IDS Poverty Mapping Project 4. For example. X may have a very large number of rows and a small number of columns. Minot Page 1-62 . all the regression commands create the following: e(b) e(V) coefficient vector variance-covariance matrix of the estimates And these matrices can be used as follows: matrix beta = e(b) matrix cov = e(V) 6. Type “help matacum” for more information. In many cases. 15. Converting variables into matrices and vice versa Variables can be converted into matrices and likewise matrices can be converted into variables. Using matrices created by Stata Some Stata commands create matrices which can be retrieved and used. Accumulating cross-product matrices Most statistical computations involve matrix operations such as X'X or X'WX. Stata has a special command for calculating cross-products in these cases. Type “help mkmat” for more information. 5. Matrix utilities matrix dir matrix list matrix rename matrix drop lists the currently defined matrices displays the contents of a matrix renames a matrix deletes a matrix creates a vector called beta with the estimated coefficients creates a matrix called cov with the estimated covariances N.

Sign up to vote on this title
UsefulNot useful