Professional Documents
Culture Documents
From the information given in Week 1, It is assumed that you already: (1) know how to access Stata
and you have practiced opening the program, and (2) read the Week 1 handout on Getting Started in
Stata.
In this practical you will learn about Stata layout, the structure of the dataset, derive a variable,
reshape your dataset and calculate some simple measures of Y using basic commands. In the end,
you will use these commands to answer some questions.
The detail: For practicals, we will use the same dataset. This gives you a chance to get familiar
with the data. The dataset is from the very famous Framingham Heart Study. The Framingham
Heart Study is a longitudinal cohort study that began in the 1940s in the town of Framingham, USA.
At the time, there was a poor understanding of what was causing the large increases in heart disease
in Americans. Since its commencement, more than 1000 papers have been published using data
from the Framingham study. It is most famous for producing the heart disease risk scores. At study
conception (baseline), data was collected from n=5209 Framingham residents aged 30–62 years.
Every few years the Framingham cohort have been followed up, some individuals have been lost and
others have died. The aim of this practical is to get familiar with Stata, Framingham and calculate
various outcome measures using the Framingham data.
You will find the Framingham dataset and a Data Dictionary on MyUni (week 2 practical). The Data
Dictionary is essential for understanding the variables in the database.
If needed, re-familiarise yourself with the handout about Stata that was provided in Week 1.
Results
window
Variables
window
Command
history
Properties
window
Command window
Variables window: Lists the variables in the dataset. It has a search function (top row) to use as a
shortcut for looking for a specific variable
Command window: This is where you can type in commands to tell Stata what to do
Command history: This window lists commands used during the current period of Stata use
Results window: Shows the results of commands that you have asked Stata to execute
Properties window: This area provides more information about variables in the dataset
Notice also, that after the search if you click on one help file and it is not about the command that you
wanted, you can just press the back button and the search results will be displayed again.
4.2 Properties window
Let’s now explore the Properties window. To begin with, click on the RANDID variable at the Variables
window. After you select this variable, you’ll perceive that the Properties window is now showing details
about the RANDID variable. This window is showing that the RANDID variable is stored in ‘long’ form.
Variables in long form are made up of integers (numbers) and the ‘Format’ component shows that the
integer is permitted to be up to 10 characters long (The %10g instructs Stata how to display the variable,
you can find more information if you type help format). If a variable is stored as a ‘string’ variable, it is
stored as text. This means that you will not be able to use it in analyses (it is not held in numerical
format). A ‘double’ variable is the most precise storage form of all, holding the value of the number
down to 1.414 x 10-16 level of precision. We don’t need that level of precision in epidemiological analyses
so storing variables in double format just uses up storage space.
The Data Editor window shows how the data is structured. The top of the columns is the variable name
(see RANDID, SEX etc) and the contents of the column have the data. The Variables and Properties
windows are shown here. These can be closed by clicking on the pin icon in the top right corner of the
window. However, it’s useful having the Variables window open because you can filter or search for a
specific variable.
Each row holds data for a unique individual at a unique time. As explained in the Data Dictionary, the
RANDID variable uniquely identifies each individual. But see in the Data Editor how a RANDID can appear
across 1, 2 or 3 rows? In this dataset, there are 3 waves of data collected from 3 different periods. For
individuals with 3 rows of data, they had data collected at all 3 waves. However, where there are only 1
or 2 rows of data, individuals have not had data collected for an entire wave (or 2). Possible reasons for
missing data could include declining to participate in a particular wave, unable to be contacted (known as
lost to follow up), or the participant may have died. Missing data can also be shown in other ways. For
example, the column TOTCHOL has many values, but notice that there are some that show a ‘.’ (full stop
symbol). In this case, although the individual participated in the wave, the data specifically for TOTCHOL
(Serum Total Cholesterol) wasn’t collected. The full stop symbol ‘.’ indicates that the data point for that
person, wave, and variable is missing.
The variable SEX is stored as number 1’s and 2’s. How do we know which numbers are male and which
are female? We need to consult the Data Dictionary. However, to avoid having to consult the Data
Dictionary every time, we can label the variables using STATA. Labelling the variables makes analyses
easier to interpret. To label the variables, type the following commands pressing enter at the end of each
line. (For details see help label).
As you can see in the figure below, STATA now shows the tables with the labels in the place of the
numbers. This makes the interpretation of the analysis easier.
However, if after the labeling, you still want to see the numbers instead of the labels, you can accomplish
this at any time by typing:
Open the Data Dictionary and look at the variables list on page 2. According to the Data Dictionary, SEX
has 5022 men and 6605 women. This is what we see in our table, so we are confident that our data looks
right. You can do this with other variables too, take time to look around the dataset and familiarise
yourself with what it includes. This may save you time later in the pracs. We will come back to the tab
command later in this practical.
4.6 Do files
We are going to type and run commands in a ‘do’ file. We can save our do files to keep a record of what
we have done. It’s very helpful as analyses get complex very quickly. And, you can start to build up a
library of syntax from which to draw on later. Also, I will prepare answers to prac’s in do files, therefore
in order to get solutions to the prac’s, you will need to know how to run the do files.
Three different ways to open a do file (all have the same outcome):
1. Go to the Main Menu and click: Window > Do-File Editor > New Do-File
2. Click CTRL + 9
3. Click the pencil-on-paper icon on the toolbar
Place your cursor in the do file, type tab SEX, highlight the text and then press the CTRL+D. This action
will execute the command tab SEX that you wrote on your do file. Go to the Results window and you’ll
see that it shows the same output generated previously by typing in the Command window. When you
execute a command on a do file (by highlighting the text and pressing CTRL+D), it is the same as if you
typed it and pressed ENTER at the Command window.
When you include an asterisk (*) at the beginning of a text, text appears in green font and STATA
understands this as a comment. The green font indicates the text is non-executable (it is not asking Stata
to execute a command). The inclusion of comments is useful because it allows you to keep notes about
your analysis. For example, you can create a comment for the labeling of the variables you did before.
It’s a good idea to start a do file with the date, author, and purpose of the file. For example, I would
usually write something like this at the beginning of my do file:
* 20 February 2023
* Written by Angela Gialamas
* Purpose: Solution for Intro to Epi, Week 3 Prac
Now save the do file by clicking on File > Save then saving with a sensible filename and location. Later you
can open your do file and return to your work.
sum AGE
This is equivalent to summarize AGE. In the help file see that the first 3 letters are underlined
(summarize). This tells you that summarize command can be shortened to sum
Percentiles Smallest
1% 36 32
5% 40 33
10% 42 33 Obs 11,627
25% 48 33 Sum of Wgt. 11,627
In addition to the summarize command, to visually examine values of a variable, the histogram command
is useful (See help hist for details).
hist AGE
You can see that STATA will open a new window with the histogram graph. The histogram graph shows
how the frequency of the age is distributed. For example, you can perceive in the graph that most of the
Framingham study participants were aged between 50 and 60 years old. This is in accordance with what
we discover before when we typed the summarize command, which showed a mean of 54.79 years old.
Notice also that the histogram starts with the minimum of 32 years old and goes to a maximum of 81
years old, two values the table generated with the summarize command had also shown us before.
Tables of numbers can be tricky to interpret, but graphs assist us in making sense of the data!
In the new window in which the histogram is displayed, you can click on File > Save. Now go ahead and
click on Save as type:. You can see that STATA allows you to save the image in several file formats,
including the common image format PNG (*.png). Saving the graph into an image file is incredibly useful
since, later on, you can import the image directly to an assignment or paper.
You can see that the new graph plots in the background the normal distribution. But why should we
bother? The normal distribution is a very famous distribution, as you might have heard before, and if the
variable is normally distributed there are specific statistical techniques that can be readily applied (more
on this later on this course). Therefore, researchers and statisticians often try to assess if a variable is
normally distributed and plotting the graph of the normal distribution on the background is one way of
doing that.
Check other variables, such as hist SYSBP, norm. Compared to AGE, which one looks more “normal" to
you?
Now let’s go back to the summarize command. The summarize command is used for continuous variables
but see what happens if you use summarize for a dichotomous variable. (A dichotomous variable has
only 2 values, e.g. male & female).
sum SEX
It does not make sense to have a mean value for SEX, which has been given arbitrary values (1 = Male; 2 =
Female). Unfortunately, Stata is an incredibly powerful statistical package but it cannot judge whether a
command makes sense or not – it will execute your commands regardless of whether they are sensible.
So be thoughtful about what you ask Stata to do.
Summarize can also be used to view summary statistics for many variables at the same time.
summarize AGE SYSBP DIABP
Digest each piece of output and how changing the commands alters the output.
tab SEX: With only one variable (SEX) we see a tidy table with the frequency (i.e. number of males and
females), their percentages of the total and the cumulative percentage. The cumulative percentage is
more interesting when there are more than 2 categories.
tab SEX PREVCHD: With two variables (SEX PREVCHD) we generate a typical 2x2 table (also known as
contingency table). Time to check PREVCHD in the Data Dictionary… It is the prevalence of coronary
heart disease (CHD), and it has values of 0 (free of disease) or 1 (has disease). Back to the table, we can
see that the number of males without CHD is 4523, the number with CHD is 499. You can look at females.
Notice that the total number of males and females is the same across all tables – always good to check
that whether we are losing any of our sample when we type a command.
tab SEX PREVCHD, row col: The final command adds in row and column percentages.
Here’s how to read the columns: Of the people with no CHD, 41.94% were males and 58.06% were
females.
Now you do the PREVCHD=1 column: Is there a higher proportion of men or women who have CHD?
To read the rows: Of the males, 9.94% had CHD. Of the females, 5.19% had CHD.
tab CURSMOKE
Check your numbers are right by comparing with the Data Dictionary.
We can select only the smokers by applying CURSMOKE==1 to our tab command. This will allow us to see
the proportion of men and women who currently smoke. Note that you need to use two equals symbols
‘==’.
With the bysort command, you are telling STATA that you want to tabulate SEX but sorting it by the
values of the variable CURSMOKE (0 = not current smoker; 1 = Current Smoker). Notice that the
frequencies displayed using this command are equivalent to when we used the command tab SEX
CURSMOKE, but now it is possible to see also the percentages and the cumulative percentages.
The use of if and bysort can be used with many other commands, including summarize.
summarize AGE if SEX==1
In this case, the result will show you the mean age of men (54.5 years) instead of the mean age of all
participants (54.8 years).
Reflection: You should be starting to see how we can to build commands in Stata. Having a repertoire of
commands at the front of your mind becomes easier the more you use Stata.
8 Generate a new variable, then name it, label it, then look at it using tab
We need to learn how to generate new variables from other variables in our dataset. Here we are going
to generate a new variable in which the body mass index (BMI) will be categorised. The first step is to
look at the variable which is going to be recoded. Looking at the Data Dictionary, you’ll see that only the
minimum and maximum values are given.
summarize BMI
Shows frequency, mean, SD, minimum and maximum values. Check the min and max are the same as the
Data Dictionary.
hist BMI
What do you notice about the distribution? Would you say it is typical of an adult population of this era?
tab BMICAT
BMI
categories Freq. Percent Cum.
You should see the table above in the Results window. If you don’t, look at the error messages in the
Results window and try to figure out what went wrong. Recheck your commands. This might be a good
time to mention the ‘drop’ command. You can undo what you’ve typed by dropping the BMICAT variable
and having another go at generating the BMICAT variable. Important note: it is not possible to recover a
variable once it has been dropped. Your only hope is to re-open the original dataset that you
downloaded and re-run the syntax.
help drop
drop BMICAT
Look down the list of id numbers (RANDID). Individuals who appear in multiple waves have multiple
rows. For example, RANDID 2448 has two rows, one for PERIOD=1 and another for PERIOD=3. This
means that data was collected for the person with id number 2448 in wave’s 1 and 3. For some people,
you can see that they have prevalent CHD at baseline, or acquire CHD at a later wave (e.g. RANDID 12629
has CHD in period 2 but did not have CHD in period 1). Notice also that the participant RANDID 12629
doesn’t have information on PERIOD 3. This data structure is referred to as ‘long’ form.
In this section, we are going to manipulate the data into the wide form. In contrast to the long form, in
the wide form each individual has only one row. For example, although RANDID 2446 participated in two
waves and previously had two rows on the long form, in the wide form this individual will have a single
row. The difference is that in the wide form the variables will be divided in respect to the three waves. So,
instead of the variable PREVCHD, we will have the variables PREVCHD1, PREVCHD2, and PREVCHD3,
containing the prevalent CHD status of the individual at each wave of the study.
Now it may seem confusing, but it will become easier when you visualize the data. But before conducting
the reshaping, our first step is to include a PERIOD for each participant. Watch what happens to RANDID
2448 (which is missing PERIOD 3).
We are using the tsset command to tell Stata that we are using time-series data (e.g. PERIOD 1, 2 and 3
reflect different periods in time). The tsfill command fills in the gaps in the time variable PERIOD. If you
now look at RANDID 12629 you will see that there PERIOD 3 has been introduced, and all the data within
it is set to missing (“.”)
Now let’s proceed with the reshaping. Read the help file to get an understanding of what you are trying
to do.
help reshape
reshape wide PREVCHD, i(RANDID) j(PERIOD)
When we type the reshape command we get an error message that some variables are not constant
within RANDID (“variable BMI not constant within RANDID”). You’ll notice that many of the other
variables in the dataset differ over time. For example, the variable BMI for participant RANDID 6238 has
the values of 28.73 for PERIOD=1, 29.34 for PERIOD=2 and 28.5 for PERIOD=3. This variable is not
constant and the reshape command won’t work when values are not constant over PERIOD. Therefore, if
a variable is not constant across all periods, we need to specify that we want to include that variable in
the reshaping command or we need to drop it.
So before reshaping, we are going to drop all of the variables that we don’t need to calculate incidence.
(An alternative to this is to use the ‘preserve’ command, which you can investigate if you wish). After
calculating incidence, we will not save the dataset so we will be able to ‘restore’ it later, by opening it
again without any loss of data.
(Here we have dropped all other variables from our dataset and are working only on those above. This is
an irrevocable loss of information. To get the lost data back you will have to reopen the dataset.)
reshape wide CURSMOKE PREVCHD PREVAP PREVMI PREVSTRK PREVHYP, i(RANDID) j(PERIOD)
Now check out what you have done in in the Data Editor.
browse
We have kept the data for the variables in the list above (dropped everything else), plus you have the
values for each of these variables at each wave of the study.
If you want to get all of the data back, you will need to reopen the dataset. This is because we used the
command keep. Go ahead and type clear. STATA will close the dataset without saving it, so you can
reopen the file you saved from MyUni and the original dataset without any changes will be there.
Incidence
An easy way of looking at incidence (new cases of CHD) at PERIOD 2 is to do a simple cross tabs.
tab PREVCHD1 PREVCHD2, row
From this, we can see that the number of new cases at PERIOD 2 who did not have CHD at PERIOD 1 is
152, which is 4% of the cohort. We can also check that other aspects of the data look right. For example,
all the people who had CHD at PERIOD 1 must be coded as having CHD at PERIOD 2, as a diagnosis of CHD
cannot be revoked.
If we wanted to look prevalent cases of CHD at PERIOD 2 it will include cases at baseline (when
PERIOD==1, so notice the number of 288 cases at the bottom of the table) as well as new cases in PERIOD
2. So in fact, you can see prevalence and incidence in this simple little table.
11 Age-standardisation (direct)
Before doing this section, read the supplementary slides on age standardisation. The idea is that we weight
the populations to make them similar to allow us to compare death rates (or disease rates) across
populations with different age structures. It is a process of ‘weighting’ the data. In this example, we will use
direct standardisation and compare the Standardised Mortality Ratios for men and women at baseline.
In the previous section, we did not keep the AGE variable so you first need to re-open the original dataset.
Next, you will keep only the baseline data (for simplicity) and look at the range of ages present at baseline.
keep if PERIOD==1
sum AGE if PERIOD==1
recode AGE (30/34 = 1 “30 – 34”) (35/39 = 2 “35 – 39”) (40/44 = 3 “40 – 44”) (45/49 = 4 “45 – 49”) (50/54 =
5 “50 – 54”) (55/59 = 6 “55 – 59”) (60/64 = 7 “60 – 64”) (65/69 = 8 “65 – 69”) (70/74 = 9 “70 – 74”),
gen(agecat)
Notice that you need to type the code as a single line in the Command Window.
Be careful with the use of spaces, (40/44 = 3 “40 – 44”) is different from (40/44=3“40 – 44”). In addition,
when you copy the command to your do file, you can include ‘///’ to allow syntax to spread across multiple
lines and still be executable.
Therefore, if you highlight the three lines and press “Execute selection (do)”
tab agecat
tab agecat SEX
The next step is to examine the dstdize help file in Stata. Examples of how to standardise for populations are
available online. If you want to explore further have a look at the example provided by stata here.
help dstdize
Usually, the dstdize command in stata is used with aggregate data (e.g. data that has already been grouped
according to age, sex, and any other characteristic) and is compared against another population. Here we
have individual level data. So in order to use the dstdize command, we need to create a variable called pop
that tells stat to use the current population as the population upon which to standardise.
gen pop=1
The syntax below generates a standardized rate for the prevalence of CHD (PREVCHD) at baseline
(PERIOD==1) and comparing the rates of CHD by SEX.
The table at the very bottom of the output summarizes the data in each table. You can see totals for the
sample size, crude and adjusted rates and their confidence intervals.
You can see that the age-standardised death rate for men (0.43) is much higher than women (0.28).
For simplicity, we have used the data within our dataset to standardise our population. This means that we
have used the age distribution within our population as the ‘standard’ population. We can do direct
standardisation with other population distributions, see dstdize help file for instructions if you ever want to
do this.
9. Practice questions: Use what you have learned to answer the following questions
Use the Data Dictionary (DD) to find the variables you need
You can complete this exercise using the commands given in this prac (sum, tab, if and bysort) but if
you happen to know other commands, by all means, go ahead and use them
Prevalence of hypertension at This time you’ll need to use the tab, if and the bysort command.
baseline for males and (Hint: You could also do this using SEX==1 and repeating if SEX==2
females separately but by using the bysort you will get the same results but running the
command only one time. Try to learn how to use the bysort
command to complete this exercise).
Incidence of coronary heart Open the original dataset without the modifications made in this
disease (CHD) from period 1 to practical. For this exercise, you will need to keep the relevant
period 2 variables, then reshape the dataset, and then calculate incidence.
Cumulative incidence of Use the command clear and open the original dataset again. You will
diabetes up to period 3 need to keep the relevant variables, reshape the dataset and
calculate the incidence from period 1 through to 3.
Good work!