You are on page 1of 16

Department of Public Health and Clinical Medicine

Institute of Medicine, Sahlgrenska Academy

University of Gothenburg

Stata for Data Management and Analysis


Stata Exercise Modules

Professor Nawi Ng

Updated on 2 Oct 2019

Course: MPH/MGH 301

Epidemiology and Biostatistics with special reference to Social Epidemiology


Introduction to the Datasets
Download and extract the SAGE zipfile in your root directory. The SAGE folder consists of two main
sub-folders. Within the Datasets folder, you find Source Datasets folder where the source datasets are
located. The Working Datasets folder consists working Stata data files including the Stata syntax files
and other Stata outputs. Within the Instruments folder, you find the SAGE individual-level and
household-level questionnaires.

In this Stata training, we will use the two datasets from the World Health Organization’s Study of global
AGEing and adult health (WHO-SAGE) in Mexico. In brief, the WHO SAGE is multicountry
longitudinal study on adults aged 50 years and older from nationally representative samples in China,
Ghana, India, Mexico, Russian Federation and South Africa. The baseline data collection (SAGE Wave 1)
was conducted during 2007-2010, the 2nd Wave during 2014-2015, and the 3rd Wave during 2018-2019.
In some of the countries, the SAGE baseline was conducted among the participants in the WHO World
Health Survey (WHS) in 2002-2004, hence the WHS data is considered as Wave 0. This is the case in
Mexico.

The first dataset, labelled as “SAGE Mexico cross-sectional data V12.dta”, contains SAGE Wave 1 data
from 3265 older individuals in Mexico. The second dataset, labelled as “SAGE Mexico longitudinal data
long format V12.dta”, contains data from 1359 individuals who participated in Wave 0, Wave 1 and
Wave 2 of SAGE in Mexico. Refer to the SAGE questionnaires to review how each of the question has
been formulated in the questionnaire.

Please note that these datasets are abridged and should be used for training purpose only. The datasets are
not suitable for scientific research as it does not contain full data from all the SAGE participants.

Please visit the WHO SAGE website for further information about SAGE and to apply for access to the
full dataset for research (https://www.who.int/healthinfo/sage/en/)

Figure 1. The WHO Study on global AGeing and adult health (SAGE)
Module 1: Using Stata for Data Analysis
Download the Stata Version 16 software from the student portal (https://studentportal.gu.se/english/). The
software is available under the menu Services & Tools. When you run Stata for the first time, remember
to update the software. Choose Help and select Check for updates.

1.1. The Stata environment


The different Stata windows could be accessed by choosing the menu Window on the top of the screen.

Stata user are highly motivated to learn how to


conduct data management and statistical
procedures using syntax/command, which can be
typed in the COMMAND window, or can be
typed and saved (.do – the extension for a Stata
do-file is DO) in the DO-FILE EDITOR window.
Once a data file (.dta – the extension for a Stata
data file is DTA) is opened, its variable list is
shown in VARIABLES window, and the whole
dataset can be browsed or edited in the DATA
EDITOR window. The RESULTS window
shows all the syntaxes run and their results. Stata
is equipped with complete help references, which
can be assessed by typing the syntax help. Once
the help syntax is typed, the VIEWER window
will pop-up with detailed information on different
syntaxes available in Stata. Graph in Stata are
shown in the GRAPH window.

Figure 2. Stata window menu

Figure 3. The
Stata working
environment

Module 1 - Page 1 of 14
1.2. The grammatical rules of Stata syntax

Figure 4. The grammatical rules of Stata syntax

Stata variables and commands are case-sensitive. The variables Sex and sex are two distinct variables.
The recommendation is to use lower case variables. Stata commands are modified by qualifiers and
options with a comma (,) in between. A command prefix is a qualifier that preceded the command, and a
colon (:) should be placed between command prefix and subsequent command.

tabulate smoker, by(sex)

• tabulate is a Stata command to produce tables of frequencies.


• smoker is a qualifier in the Stata syntax, which is usually the variable(s)’ names.
• by(sex) is an option in the Stata syntax, which is placed behind a comma and is unique for
different commands.
• Type help tabulate to see other options available for the tabulate command.
• The above-mentioned syntax requests Stata to produce the table of frequency count for the
variable smoker and stratify it by the variable sex.

bysort(sex): tabulate smoker

• bysort(sex) is a command prefix, in this case it is a command prefix for the command tabulate. A
colon should be placed after the command prefix.

Utilise the command HELP to view more detailed information for Stata commands.

Module 1 - Page 2 of 14
Figure 5. The viewer window of help tabulate.png

Module 1 - Page 3 of 14
1.3. Data types in Stata
The command help data types provides a quick reference for different data types allowed in Stata.

Figure 6. The data types in Stata

1.4. Operators in Stata


The command help operators provides a quick reference for different operators (arithmetic, logical,
numeric and string operators) that can be used to write expressions in Stata.

Figure 7. Operators in Stata

Module 1 - Page 4 of 14
1.5. Setting the working directories, open and save Stata dataset
Use the command global to set the working directory. It is more efficient to set the working directories so
the paths where a dataset is located need not to be written repeatedly.

The following command will set the directory to where the original datasets are located, and name it
sourcedir.

The path in between the quotation marks indicates where the original SAGE datasets are saved.

global sourcedir “/Users/nawing/Downloads/SAGE/Datasets/Source Datasets/”

The following command will set the director in where the working datasets will be saved. We name it as
workdir.

global workdir “/Users/nawing/Downloads/SAGE/Datasets/Working Datasets/”

1.6. Open and save Stata datasets


The commands use and save are used to open and save a Stata dataset, respectively.

1.6.1. To open the SAGE datafile in the Source datasets folder, use the menu File/Open.

Find the datafile and click Open. Typing the following command in the Command window will trigger
the same action. Including the option clear at the end of the command will ensure that the dataset is being
loaded into a clean memory.

Syntax to load Stata-format dataset:

use filename [, clear nolabel]

Syntax to load subset of Stata-format dataset:

use [varlist] [if] [in] using filename [, clear nolabel]

use “/Users/nawing/Downloads/SAGE/Datasets/Source Datasets/SAGE Mexico cross-sectional data


V12.dta”, clear

To open the dataset which is located in the sourcedir, type the following command.

use “${sourcedir}SAGE Mexico cross-sectional data V12.dta”, clear

It is possible to open a sub-set of a dataset using the option varlist (to open selected variables) and in (to
open selected observations). For example, to load only the variables q0002, quintile_c, Ethnicity and
Religion and the first 100 observations from the SAGE dataset, type the following command in the
command window.

use q0002 quintile_c Ethnicity Religion using “${sourcedir}SAGE Mexico cross-sectional data V12.dta”
in 1/100, clear

Module 1 - Page 5 of 14
1.6.2. To save the SAGE datafile in the Working datasets folder, use the menu File/Save as.

Locate the folder, name the file and click on Save. Typing the following command in the Command
window will trigger the same option. Including the option replace at the end of the command will replace
the existing dataset with the latest one in the memory.

save “${workdir}SAGE Mexico analytical dataset.dta”, replace

1.7. Importing and exporting dataset from Stata


1.7.1. To import a SPSS dataset into Stata, use the command import spss.
. import spss using "${sourcedir}personality.sav", clear
(44 vars, 459 obs)

. save "${workdir}personality.dta", replace


(note: file /Users/nawing/Downloads/SAGE/Datasets/Working Datasets/personality.d
> ta not found)
file /Users/nawing/Downloads/SAGE/Datasets/Working Datasets/personality.dta save
> d

1.7.2. To export Stata data to SPSS, use the command export excel.

Use the option firstrow(variables) to indicate that the first row in the dataset contains the variable names.
. export excel using "${sourcedir}personality.xls", firstrow(variables) replace
file /Users/nawing/Downloads/SAGE/Datasets/Source Datasets/personality.xls saved

1.7.3. To import an Excel dataset into Stata, use the command import excel.

Use the option firstrow to indicate that the first row in the dataset contains the variable names.
. import excel using "${sourcedir}personality.xls", firstrow clear
(44 vars, 459 obs)

1.8. Exploring the Stata dataset


1.8.1. To record your commands and results during a Stata session, use the command log using.

The log file can be saved in either smcl or log format. smcl log file has the same properties as the results
in Stata results window and can be copied to Excel easily. log format can be opened in any word
processors, but it is not easy to copy the results to Excel.
. log using "${workdir}analysis results 2019-10-03.log", replace
--------------------------------------------------------------------------------
name: <unnamed>
log: /Users/nawing/Downloads/SAGE/Datasets/Working Datasets/analysis res
> ults 2019-10-03.log
log type: text
opened on: 2 Oct 2019, 1s:54:38

Module 1 - Page 6 of 14
1.8.2. Be familiar with the dataset structure using the command describe

describe produces a summary of the dataset in memory or of the data stored in a Stata-format dataset.

Describe data in memory: describe [varlist] [, memory_options]

Describe data in file: describe [varlist] using filename [, file_options]

Using the drop-down menu: Select Data / Describe data / Describe data in memory. Leave the variables
box empty to display all variables. We can view the selected variables using the * (asterisk) and – (dash)
symbols. Using q20* will show all variables start with q20, and q2001-q2049 will display all variables i
between q2001 and q2049.
. describe

Contains data from /Users/nawing/Downloads/SAGE/Datasets/Working Datasets/SAGE M


> exico analytical dataset.dta
obs: 3,265 SAGE Mexico Household Data 2010
vars: 115 2 Oct 2019 22:54
--------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------
q0002 long %10.0g q0002: household id number
q0104 double %10.0g q0104 q0104: urban/rural
q0401 double %10.0g q0401 q0401: total number of persons in
hh
hhweight float %9.0g post-stratified household weight
id str10 %10s individual case id 1st wave
q0101b long %12.0g q0101b: psu code
q1000cactual byte %10.0g q1000cactual
actual result code
q1009 byte %10.0g q1009 q1009: sex
q1011 int %10.0g q1011 q1011: age in yyys
q1012 byte %10.0g q1012 q1012: marital status
q1015 byte %10.0g q1015 q1015: ever schooled
q1016 byte %10.0g q1016 q1016: highest level

/part of the table is removed for brevity/

--------------------------------------------------------------------------------

. describe q0002 q0104 hhweight id q1011 q4001-q4060

storage display value


variable name type format label variable label
--------------------------------------------------------------------------------
q0002 long %10.0g q0002: household id number
q0104 double %10.0g q0104 q0104: urban/rural
hhweight float %9.0g post-stratified household weight
id str10 %10s individual case id 1st wave
q1011 int %10.0g q1011 q1011: age in yyys
q4001 byte %10.0g q4001 q4001: arthritis
q4010 byte %10.0g q4010 q4010: stroke
q4014 byte %10.0g q4014 q4014: angina
q4022 byte %10.0g q4022 q4022: diabetes
q4025 byte %10.0g q4025 q4025: chronic lung disease
q4033 byte %10.0g q4033 q4033: asthma
q4040 byte %10.0g q4040 q4040: depression diagnosis
q4060 byte %10.0g q4060 q4060: hypertension

Module 1 - Page 7 of 14
TASK: Review section 1.3 on the different data types in Stata and discuss the appropriateness of the
choice of storage type in the dataset.

1.8.3. To get familiar with the contains of the dataset, use the command codebook

codebook examines the variable names, labels, and data to produce a codebook describing the dataset.
codebook [varlist] [if] [in] [, options]

Using the drop-down menu: Select Data / Describe data / Describe data contents (codebook): Leave the
variables box empty to display all variables. Selected variables can be viewed by using the * (asterisk)
and – (dash) symbols.
. codebook q0002 id q4001 q2506

--------------------------------------------------------------------------------
q0002 q0002: household id number
--------------------------------------------------------------------------------
type: numeric (long)
range: [10002,321198] units: 1
unique values: 2,993 missing .: 0/3,265

mean: 159135
std. dev: 80233.1

percentiles: 10% 25% 50% 75% 90%


50551 100603 150577 220458 261140

--------------------------------------------------------------------------------
id individual case id 1st wave
--------------------------------------------------------------------------------
type: string (str10)
unique values: 3,265 missing "": 0/3,265
examples: "1409085201"
"1413056503"
"1418080701"
"1424083102"

--------------------------------------------------------------------------------
q4001 q4001: arthritis
--------------------------------------------------------------------------------
type: numeric (byte)
label: q4001, but label does not exist
range: [1,2] units: 1
unique values: 2 missing .: 1,668/3,265

tabulation: Freq. Value


173 1
1,424 2
1,668 .

Module 1 - Page 8 of 14
--------------------------------------------------------------------------------
q2506 q2506: true height (cm)
--------------------------------------------------------------------------------
type: numeric (double)
label: q2506, but label does not exist
range: [128.8,998] units: .1
unique values: 384 missing .: 1,743/3,265

mean: 173.492
std. dev: 121.226

percentiles: 10% 25% 50% 75% 90%


143.9 149.1 155.2 163.2 169.8

1.8.4. To view the dataset loaded in the memory, use the command summarize.

summarize calculates and displays a variety of univariate summary statistics. If no varlist is specified,
summary statistics are calculated for all the variables in the dataset.

summarize [varlist] [if] [in] [weight] [, options]


. summarize q0002 q0104 hhweight id q1011 q4001-q4060

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
q0002 | 3,265 159135 80233.13 10002 321198
q0104 | 2,280 1.238596 .426319 1 2
hhweight | 3,255 5.142753 11.36429 0 162.3806
id | 0
q1011 | 1,670 63.49521 14.74605 -8 105
-------------+---------------------------------------------------------
q4001 | 1,597 1.891672 .3108917 1 2
q4010 | 1,597 1.96243 .1902145 1 2
q4014 | 1,597 1.981841 .1335682 1 2
q4022 | 1,597 1.809643 .3927057 1 2
q4025 | 1,597 1.946775 .2245518 1 2
-------------+---------------------------------------------------------
q4033 | 1,597 1.968691 .1742053 1 2
q4040 | 1,597 1.881027 .3238582 1 2
q4060 | 1,597 1.664371 .4723579 1 2

TASK: Review the results above. Do you see any problems with some of the variables?

Module 1 - Page 9 of 14
1.8.5. To recode values of a variable, use the command recode.

The variables q4001-q4060 are dichotomous variables denoting presence of chronic diseases diagnosed
by health care professional. In the dataset, values in the variables are coded as 1(Yes) and 2(No). A
common way to treat dichotomous variable is to code the value as 0(No) and 1(Yes). We can recode the
values of the variables q4001-q4060 to 0 and 1 using the command recode.
. recode q4001-q4060 (1=1)(2=0)
(q4001: 1424 changes made)
(q4010: 1537 changes made)
(q4014: 1568 changes made)
(q4022: 1293 changes made)
(q4025: 1512 changes made)
(q4033: 1547 changes made)
(q4040: 1407 changes made)
(q4060: 1061 changes made)

. summarize q4001-q4060

Variable | Obs Mean Std. Dev. Min Max


-------------+---------------------------------------------------------
q4001 | 1,597 .1083281 .3108917 0 1
q4010 | 1,597 .0375704 .1902145 0 1
q4014 | 1,597 .018159 .1335682 0 1
q4022 | 1,597 .1903569 .3927057 0 1
q4025 | 1,597 .0532248 .2245518 0 1
-------------+---------------------------------------------------------
q4033 | 1,597 .0313087 .1742053 0 1
q4040 | 1,597 .1189731 .3238582 0 1
q4060 | 1,597 .3356293 .4723579 0 1

TASK: What are the prevalence of different self-reported chronic diseases in the study sample?

1.8.6. To check for the completeness of the values of the variables in the dataset, use the command
misschk

This command misschk is a user-written command, not a standard Stata command. Therefore, one has to
install the command from the Web on its first use. Use the command search to find the relevant
command on the web. Use the command ssc install to install the command directly to your Stata.

search misschk

ssc install misschk

Module 1 - Page 10 of 14
The dataset consists of 3265 observations, however, only about half of the sample filled in the
questionnaire, therefore many of the variables in the dataset contain missing values.
. misschk q0002 q0104 hhweight id q1011 q4001-q4060

Variables examined for missing values

# Variable # Missing % Missing


--------------------------------------------
1 q0002 0 0.0
2 q0104 985 30.2
3 hhweight 10 0.3
4 q1011 1595 48.9
5 q4001 1668 51.1
6 q4010 1668 51.1
7 q4014 1668 51.1
8 q4022 1668 51.1
9 q4025 1668 51.1
10 q4033 1668 51.1
11 q4040 1668 51.1
12 q4060 1668 51.1

Warning: this output does not differentiate among extended missing.


To generate patterns for extended missing, use extmiss option.

Missing for |
which |
variables? | Freq. Percent Cum.
----------------+-----------------------------------
_2_45 67890 12 | 985 30.17 30.17
__3__ _____ __ | 10 0.31 30.47
___45 67890 12 | 610 18.68 49.16
____5 67890 12 | 73 2.24 51.39
_____ _____ __ | 1,587 48.61 100.00
----------------+-----------------------------------
Total | 3,265 100.00

Missing for |
how many |
variables? | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,587 48.61 48.61
1 | 10 0.31 48.91
8 | 73 2.24 51.15
9 | 610 18.68 69.83
10 | 985 30.17 100.00
------------+-----------------------------------
Total | 3,265 100.00

1.8.7. To keep or drop observations in the dataset, use the command keep or drop.

It is important to read the manual of a secondary dataset to understand how the data is being created and
structured.In the SAGE data, the variable q1000cactual denotes those who complete the interview (coded
as 1) and those who did not complete the interview (coded as 0). Those who did not complete the
interview should therefore be dropped from the dataset. One can either drop those who did not complete
the interview (drop if q1000cactual!=1) or keep those who completed the interview (keep if
q1000cactual==1).

Module 1 - Page 11 of 14
. tabulate q1000cactual

actual |
result code | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,660 50.84 50.84
2 | 8 0.25 51.09
3 | 9 0.28 51.36
4 | 5 0.15 51.52
5 | 168 5.15 56.66
6 | 407 12.47 69.13
7 | 3 0.09 69.22
8 | 588 18.01 87.23
9 | 1 0.03 87.26
10 | 1 0.03 87.29
11 | 349 10.69 97.98
12 | 2 0.06 98.04
13 | 59 1.81 99.85
14 | 5 0.15 100.00
------------+-----------------------------------
Total | 3,265 100.00

. drop if q1000cactual!=1
(1,605 observations deleted)

. keep if q1000cactual==1
(0 observations deleted)

. misschk q0002 q0104 hhweight id q1011 q4001-q4060

Variables examined for missing values

# Variable # Missing % Missing


--------------------------------------------
1 q0002 0 0.0
2 q0104 0 0.0
3 hhweight 10 0.6
4 q1011 0 0.0
5 q4001 65 3.9
6 q4010 65 3.9
7 q4014 65 3.9
8 q4022 65 3.9
9 q4025 65 3.9
10 q4033 65 3.9
11 q4040 65 3.9
12 q4060 65 3.9

Warning: this output does not differentiate among extended missing.


To generate patterns for extended missing, use extmiss option.

Missing for |
which |
variables? | Freq. Percent Cum.
----------------+-----------------------------------
__3__ _____ __ | 10 0.60 0.60
____5 67890 12 | 65 3.92 4.52
_____ _____ __ | 1,585 95.48 100.00
----------------+-----------------------------------
Total | 1,660 100.00

Module 1 - Page 12 of 14
Missing for |
how many |
variables? | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,585 95.48 95.48
1 | 10 0.60 96.08
8 | 65 3.92 100.00
------------+-----------------------------------
Total | 1,660 100.00

TASK: How many % of missing data do the variables q4001-q4060 contain?

1.8.8. To review the summary of a numeric variable, use the command inspect

The inspect command provides a quick summary of a numeric variable that differs from the summary
provided by summarize or tabulate. It reports the number of negative, zero, and positive values; the
number of integers and nonintegers; the number of unique values; and the number of missing; and it
produces a small histogram. Its purpose is not analytical but is to allow you to quickly gain familiarity
with unknown data. inspect [varlist] [if] [in]

We can complement the results of inspect by using the command tabulate. It seems that there are a few
individuals with age less than 0. We make any negative values up to 0 to become missing value (represent
by a dot in Stata). We use the command recode.
. inspect q1011

q1011: q1011: age in yyys Number of Observations


-------------------------- ---------------------------------------
Total Integers Nonintegers
| # Negative 3 3 -
| # Zero - - -
| # Positive 1,657 1,657 -
| # ----------- ----------- -----------
| # # Total 1,660 1,660 -
| . . # # . Missing -
+---------------------- -----------
-8 105 1,660
(76 unique values)

. tabulate q1011

q1011: age |
in yyys | Freq. Percent Cum.
------------+-----------------------------------
-8 | 3 0.18 0.18
22 | 2 0.12 0.30
24 | 2 0.12 0.42
/part of the results are removed/
96 | 2 0.12 99.88
97 | 1 0.06 99.94
105 | 1 0.06 100.00
------------+-----------------------------------
Total | 1,660 100.00

Module 1 - Page 13 of 14
. recode q1011 (min/0=.)
(q1011: 3 changes made)

1.8.9. To view the data in data browser, use the command browse.

To edit the data in data editor, use the command edit.

Using the drop-down menu: Select Data / Data browser: This will open a new data browser window in
which user can only view data. User can select to display all variables or selected variables.

browse brings up a spreadsheet-style data browse for browsing existing data.

browse [varlist] [if] [in] [, nolabel]

Using the drop-down menu: Select Data / Data editor: This will open a new data editor window in which
user can only view and edit data. User can select to display all variables or selected variables.

edit brings up a spreadsheet-style data editor for entering new data and editing existing data.

edit [varlist] [if] [in] [, nolabel]

1.8.10. To display the values of variables, use the command list.

list displays the values of variables. If no varlist is specified, the values of all the variables are displayed.
list [varlist] [if] [in] [, options]
. list q0002 q0104 hhweight id q1011 q4001-q4060 in 1/2

+-----------------------------------------------------------------------+
1. | q0002 | q0104 | hhweight | id | q1011 | q4001 | q4010 | q4014 |
| 10015 | 1 | .4142829 | 1401001501 | 67 | 0 | 0 | 0 |
|-----------------------------------------------------------------------|
| q4022 | q4025 | q4033 | q4040 | q4060 |
| 0 | 0 | 0 | 0 | 0 |
+-----------------------------------------------------------------------+

+-----------------------------------------------------------------------+
2. | q0002 | q0104 | hhweight | id | q1011 | q4001 | q4010 | q4014 |
| 10027 | 1 | .4142829 | 1401002701 | 75 | . | . | . |
|-----------------------------------------------------------------------|
| q4022 | q4025 | q4033 | q4040 | q4060 |
| . | . | . | . | . |
+-----------------------------------------------------------------------+

1.8.11. Use the command log close to close the log file at the end of the Stata session.

Find the log file that we created in the working directory folder, and try to open it. It is possible to copy
the result tables into Excel or Word for further exploration.
. log close
name: <unnamed>
log: /Users/nawing/Downloads/SAGE/Datasets/Working Datasets/analysis res
> ults 2019-10-03.log
log type: text
closed on: 2 Oct 2019, 10:54:38

Module 1 - Page 14 of 14

You might also like