Professional Documents
Culture Documents
University of Gothenburg
Professor Nawi Ng
In this Stata training, we will use the two datasets from the World Health Organization’s Study of global
AGEing and adult health (WHO-SAGE) in Mexico. In brief, the WHO SAGE is multicountry
longitudinal study on adults aged 50 years and older from nationally representative samples in China,
Ghana, India, Mexico, Russian Federation and South Africa. The baseline data collection (SAGE Wave 1)
was conducted during 2007-2010, the 2nd Wave during 2014-2015, and the 3rd Wave during 2018-2019.
In some of the countries, the SAGE baseline was conducted among the participants in the WHO World
Health Survey (WHS) in 2002-2004, hence the WHS data is considered as Wave 0. This is the case in
Mexico.
The first dataset, labelled as “SAGE Mexico cross-sectional data V12.dta”, contains SAGE Wave 1 data
from 3265 older individuals in Mexico. The second dataset, labelled as “SAGE Mexico longitudinal data
long format V12.dta”, contains data from 1359 individuals who participated in Wave 0, Wave 1 and
Wave 2 of SAGE in Mexico. Refer to the SAGE questionnaires to review how each of the question has
been formulated in the questionnaire.
Please note that these datasets are abridged and should be used for training purpose only. The datasets are
not suitable for scientific research as it does not contain full data from all the SAGE participants.
Please visit the WHO SAGE website for further information about SAGE and to apply for access to the
full dataset for research (https://www.who.int/healthinfo/sage/en/)
Figure 1. The WHO Study on global AGeing and adult health (SAGE)
Module 1: Using Stata for Data Analysis
Download the Stata Version 16 software from the student portal (https://studentportal.gu.se/english/). The
software is available under the menu Services & Tools. When you run Stata for the first time, remember
to update the software. Choose Help and select Check for updates.
Figure 3. The
Stata working
environment
Module 1 - Page 1 of 14
1.2. The grammatical rules of Stata syntax
Stata variables and commands are case-sensitive. The variables Sex and sex are two distinct variables.
The recommendation is to use lower case variables. Stata commands are modified by qualifiers and
options with a comma (,) in between. A command prefix is a qualifier that preceded the command, and a
colon (:) should be placed between command prefix and subsequent command.
• bysort(sex) is a command prefix, in this case it is a command prefix for the command tabulate. A
colon should be placed after the command prefix.
Utilise the command HELP to view more detailed information for Stata commands.
Module 1 - Page 2 of 14
Figure 5. The viewer window of help tabulate.png
Module 1 - Page 3 of 14
1.3. Data types in Stata
The command help data types provides a quick reference for different data types allowed in Stata.
Module 1 - Page 4 of 14
1.5. Setting the working directories, open and save Stata dataset
Use the command global to set the working directory. It is more efficient to set the working directories so
the paths where a dataset is located need not to be written repeatedly.
The following command will set the directory to where the original datasets are located, and name it
sourcedir.
The path in between the quotation marks indicates where the original SAGE datasets are saved.
The following command will set the director in where the working datasets will be saved. We name it as
workdir.
1.6.1. To open the SAGE datafile in the Source datasets folder, use the menu File/Open.
Find the datafile and click Open. Typing the following command in the Command window will trigger
the same action. Including the option clear at the end of the command will ensure that the dataset is being
loaded into a clean memory.
To open the dataset which is located in the sourcedir, type the following command.
It is possible to open a sub-set of a dataset using the option varlist (to open selected variables) and in (to
open selected observations). For example, to load only the variables q0002, quintile_c, Ethnicity and
Religion and the first 100 observations from the SAGE dataset, type the following command in the
command window.
use q0002 quintile_c Ethnicity Religion using “${sourcedir}SAGE Mexico cross-sectional data V12.dta”
in 1/100, clear
Module 1 - Page 5 of 14
1.6.2. To save the SAGE datafile in the Working datasets folder, use the menu File/Save as.
Locate the folder, name the file and click on Save. Typing the following command in the Command
window will trigger the same option. Including the option replace at the end of the command will replace
the existing dataset with the latest one in the memory.
1.7.2. To export Stata data to SPSS, use the command export excel.
Use the option firstrow(variables) to indicate that the first row in the dataset contains the variable names.
. export excel using "${sourcedir}personality.xls", firstrow(variables) replace
file /Users/nawing/Downloads/SAGE/Datasets/Source Datasets/personality.xls saved
1.7.3. To import an Excel dataset into Stata, use the command import excel.
Use the option firstrow to indicate that the first row in the dataset contains the variable names.
. import excel using "${sourcedir}personality.xls", firstrow clear
(44 vars, 459 obs)
The log file can be saved in either smcl or log format. smcl log file has the same properties as the results
in Stata results window and can be copied to Excel easily. log format can be opened in any word
processors, but it is not easy to copy the results to Excel.
. log using "${workdir}analysis results 2019-10-03.log", replace
--------------------------------------------------------------------------------
name: <unnamed>
log: /Users/nawing/Downloads/SAGE/Datasets/Working Datasets/analysis res
> ults 2019-10-03.log
log type: text
opened on: 2 Oct 2019, 1s:54:38
Module 1 - Page 6 of 14
1.8.2. Be familiar with the dataset structure using the command describe
describe produces a summary of the dataset in memory or of the data stored in a Stata-format dataset.
Using the drop-down menu: Select Data / Describe data / Describe data in memory. Leave the variables
box empty to display all variables. We can view the selected variables using the * (asterisk) and – (dash)
symbols. Using q20* will show all variables start with q20, and q2001-q2049 will display all variables i
between q2001 and q2049.
. describe
--------------------------------------------------------------------------------
Module 1 - Page 7 of 14
TASK: Review section 1.3 on the different data types in Stata and discuss the appropriateness of the
choice of storage type in the dataset.
1.8.3. To get familiar with the contains of the dataset, use the command codebook
codebook examines the variable names, labels, and data to produce a codebook describing the dataset.
codebook [varlist] [if] [in] [, options]
Using the drop-down menu: Select Data / Describe data / Describe data contents (codebook): Leave the
variables box empty to display all variables. Selected variables can be viewed by using the * (asterisk)
and – (dash) symbols.
. codebook q0002 id q4001 q2506
--------------------------------------------------------------------------------
q0002 q0002: household id number
--------------------------------------------------------------------------------
type: numeric (long)
range: [10002,321198] units: 1
unique values: 2,993 missing .: 0/3,265
mean: 159135
std. dev: 80233.1
--------------------------------------------------------------------------------
id individual case id 1st wave
--------------------------------------------------------------------------------
type: string (str10)
unique values: 3,265 missing "": 0/3,265
examples: "1409085201"
"1413056503"
"1418080701"
"1424083102"
--------------------------------------------------------------------------------
q4001 q4001: arthritis
--------------------------------------------------------------------------------
type: numeric (byte)
label: q4001, but label does not exist
range: [1,2] units: 1
unique values: 2 missing .: 1,668/3,265
Module 1 - Page 8 of 14
--------------------------------------------------------------------------------
q2506 q2506: true height (cm)
--------------------------------------------------------------------------------
type: numeric (double)
label: q2506, but label does not exist
range: [128.8,998] units: .1
unique values: 384 missing .: 1,743/3,265
mean: 173.492
std. dev: 121.226
1.8.4. To view the dataset loaded in the memory, use the command summarize.
summarize calculates and displays a variety of univariate summary statistics. If no varlist is specified,
summary statistics are calculated for all the variables in the dataset.
TASK: Review the results above. Do you see any problems with some of the variables?
Module 1 - Page 9 of 14
1.8.5. To recode values of a variable, use the command recode.
The variables q4001-q4060 are dichotomous variables denoting presence of chronic diseases diagnosed
by health care professional. In the dataset, values in the variables are coded as 1(Yes) and 2(No). A
common way to treat dichotomous variable is to code the value as 0(No) and 1(Yes). We can recode the
values of the variables q4001-q4060 to 0 and 1 using the command recode.
. recode q4001-q4060 (1=1)(2=0)
(q4001: 1424 changes made)
(q4010: 1537 changes made)
(q4014: 1568 changes made)
(q4022: 1293 changes made)
(q4025: 1512 changes made)
(q4033: 1547 changes made)
(q4040: 1407 changes made)
(q4060: 1061 changes made)
. summarize q4001-q4060
TASK: What are the prevalence of different self-reported chronic diseases in the study sample?
1.8.6. To check for the completeness of the values of the variables in the dataset, use the command
misschk
This command misschk is a user-written command, not a standard Stata command. Therefore, one has to
install the command from the Web on its first use. Use the command search to find the relevant
command on the web. Use the command ssc install to install the command directly to your Stata.
search misschk
Module 1 - Page 10 of 14
The dataset consists of 3265 observations, however, only about half of the sample filled in the
questionnaire, therefore many of the variables in the dataset contain missing values.
. misschk q0002 q0104 hhweight id q1011 q4001-q4060
Missing for |
which |
variables? | Freq. Percent Cum.
----------------+-----------------------------------
_2_45 67890 12 | 985 30.17 30.17
__3__ _____ __ | 10 0.31 30.47
___45 67890 12 | 610 18.68 49.16
____5 67890 12 | 73 2.24 51.39
_____ _____ __ | 1,587 48.61 100.00
----------------+-----------------------------------
Total | 3,265 100.00
Missing for |
how many |
variables? | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,587 48.61 48.61
1 | 10 0.31 48.91
8 | 73 2.24 51.15
9 | 610 18.68 69.83
10 | 985 30.17 100.00
------------+-----------------------------------
Total | 3,265 100.00
1.8.7. To keep or drop observations in the dataset, use the command keep or drop.
It is important to read the manual of a secondary dataset to understand how the data is being created and
structured.In the SAGE data, the variable q1000cactual denotes those who complete the interview (coded
as 1) and those who did not complete the interview (coded as 0). Those who did not complete the
interview should therefore be dropped from the dataset. One can either drop those who did not complete
the interview (drop if q1000cactual!=1) or keep those who completed the interview (keep if
q1000cactual==1).
Module 1 - Page 11 of 14
. tabulate q1000cactual
actual |
result code | Freq. Percent Cum.
------------+-----------------------------------
1 | 1,660 50.84 50.84
2 | 8 0.25 51.09
3 | 9 0.28 51.36
4 | 5 0.15 51.52
5 | 168 5.15 56.66
6 | 407 12.47 69.13
7 | 3 0.09 69.22
8 | 588 18.01 87.23
9 | 1 0.03 87.26
10 | 1 0.03 87.29
11 | 349 10.69 97.98
12 | 2 0.06 98.04
13 | 59 1.81 99.85
14 | 5 0.15 100.00
------------+-----------------------------------
Total | 3,265 100.00
. drop if q1000cactual!=1
(1,605 observations deleted)
. keep if q1000cactual==1
(0 observations deleted)
Missing for |
which |
variables? | Freq. Percent Cum.
----------------+-----------------------------------
__3__ _____ __ | 10 0.60 0.60
____5 67890 12 | 65 3.92 4.52
_____ _____ __ | 1,585 95.48 100.00
----------------+-----------------------------------
Total | 1,660 100.00
Module 1 - Page 12 of 14
Missing for |
how many |
variables? | Freq. Percent Cum.
------------+-----------------------------------
0 | 1,585 95.48 95.48
1 | 10 0.60 96.08
8 | 65 3.92 100.00
------------+-----------------------------------
Total | 1,660 100.00
1.8.8. To review the summary of a numeric variable, use the command inspect
The inspect command provides a quick summary of a numeric variable that differs from the summary
provided by summarize or tabulate. It reports the number of negative, zero, and positive values; the
number of integers and nonintegers; the number of unique values; and the number of missing; and it
produces a small histogram. Its purpose is not analytical but is to allow you to quickly gain familiarity
with unknown data. inspect [varlist] [if] [in]
We can complement the results of inspect by using the command tabulate. It seems that there are a few
individuals with age less than 0. We make any negative values up to 0 to become missing value (represent
by a dot in Stata). We use the command recode.
. inspect q1011
. tabulate q1011
q1011: age |
in yyys | Freq. Percent Cum.
------------+-----------------------------------
-8 | 3 0.18 0.18
22 | 2 0.12 0.30
24 | 2 0.12 0.42
/part of the results are removed/
96 | 2 0.12 99.88
97 | 1 0.06 99.94
105 | 1 0.06 100.00
------------+-----------------------------------
Total | 1,660 100.00
Module 1 - Page 13 of 14
. recode q1011 (min/0=.)
(q1011: 3 changes made)
1.8.9. To view the data in data browser, use the command browse.
Using the drop-down menu: Select Data / Data browser: This will open a new data browser window in
which user can only view data. User can select to display all variables or selected variables.
Using the drop-down menu: Select Data / Data editor: This will open a new data editor window in which
user can only view and edit data. User can select to display all variables or selected variables.
edit brings up a spreadsheet-style data editor for entering new data and editing existing data.
list displays the values of variables. If no varlist is specified, the values of all the variables are displayed.
list [varlist] [if] [in] [, options]
. list q0002 q0104 hhweight id q1011 q4001-q4060 in 1/2
+-----------------------------------------------------------------------+
1. | q0002 | q0104 | hhweight | id | q1011 | q4001 | q4010 | q4014 |
| 10015 | 1 | .4142829 | 1401001501 | 67 | 0 | 0 | 0 |
|-----------------------------------------------------------------------|
| q4022 | q4025 | q4033 | q4040 | q4060 |
| 0 | 0 | 0 | 0 | 0 |
+-----------------------------------------------------------------------+
+-----------------------------------------------------------------------+
2. | q0002 | q0104 | hhweight | id | q1011 | q4001 | q4010 | q4014 |
| 10027 | 1 | .4142829 | 1401002701 | 75 | . | . | . |
|-----------------------------------------------------------------------|
| q4022 | q4025 | q4033 | q4040 | q4060 |
| . | . | . | . | . |
+-----------------------------------------------------------------------+
1.8.11. Use the command log close to close the log file at the end of the Stata session.
Find the log file that we created in the working directory folder, and try to open it. It is possible to copy
the result tables into Excel or Word for further exploration.
. log close
name: <unnamed>
log: /Users/nawing/Downloads/SAGE/Datasets/Working Datasets/analysis res
> ults 2019-10-03.log
log type: text
closed on: 2 Oct 2019, 10:54:38
Module 1 - Page 14 of 14