You are on page 1of 18

INTRODUCTION TO EPIDEMIOLOGY

PRACTICAL WEEK 2: Getting Started in Stata & Measures of Y

From the information given in Week 1, It is assumed that you already: (1) know how to access Stata
and you have practiced opening the program, and (2) read the Week 1 handout on Getting Started in
Stata.

In this practical you will learn about Stata layout, the structure of the dataset, derive a variable,
reshape your dataset and calculate some simple measures of Y using basic commands. In the end,
you will use these commands to answer some questions.

The detail: For practicals, we will use the same dataset. This gives you a chance to get familiar
with the data. The dataset is from the very famous Framingham Heart Study. The Framingham
Heart Study is a longitudinal cohort study that began in the 1940s in the town of Framingham, USA.
At the time, there was a poor understanding of what was causing the large increases in heart disease
in Americans. Since its commencement, more than 1000 papers have been published using data
from the Framingham study. It is most famous for producing the heart disease risk scores. At study
conception (baseline), data was collected from n=5209 Framingham residents aged 30–62 years.
Every few years the Framingham cohort have been followed up, some individuals have been lost and
others have died. The aim of this practical is to get familiar with Stata, Framingham and calculate
various outcome measures using the Framingham data.

You will find the Framingham dataset and a Data Dictionary on MyUni (week 2 practical). The Data
Dictionary is essential for understanding the variables in the database.

Here’s a summary of what to do:


1. Open Stata
2. Open the Data dictionary and read about the dataset
3. Learn about Stata view/layout
4. Open the Framingham dataset, the Data Editor, a Do file and help files
5. View data, learn a little about different data types
6. Write simple commands that show you how to label a variable, generate new variables,
reshape your dataset, calculate prevalence and incidence, and compare age-standardised
death rates for men and women
7. Complete some short answer questions to test some of what you have learned
1. Open Stata

If Stata is installed on Click on the Start menu button


the computer

Select All Programs, then Stata from your list of programs

If using ADAPT Follow instructions here

If needed, re-familiarise yourself with the handout about Stata that was provided in Week 1.

2. Open the Data Dictionary


 Download and save the Data dictionary from MyUni (link provided at Week 2 Practical).
 Open the Data dictionary and familiarise yourself with the contents. You don’t need to
memorise everything. Just know what information is available to refer to when you need it.
What is a data dictionary? It’s a file that describes the content of the database, the variables and
how they are coded. It is good practice to have a data dictionary for datasets and well-
established studies have one.

3. Open the dataset


 Download and save the dataset (link provided on MyUni, Week 2 Practical). Save it in a
sensible location with a sensible name (e.g. Framingham.dta), so that you can easily find it
for now and for future pracs.
 Open the dataset, click File > Open, locate where you stored the database, click on the
dataset name, click Open. After you open the dataset, you should see something similar to
the image below (notice that now the variables of the Framingham study are displayed on
the upper-right “Variables” window).
4. Stata layout

Results
window
Variables
window
Command
history

Properties
window

Command window

Variables window: Lists the variables in the dataset. It has a search function (top row) to use as a
shortcut for looking for a specific variable
Command window: This is where you can type in commands to tell Stata what to do
Command history: This window lists commands used during the current period of Stata use
Results window: Shows the results of commands that you have asked Stata to execute
Properties window: This area provides more information about variables in the dataset

4.1 Help files


Place your cursor in the Command window and type help datatypes (then press enter)
A new window will open. This is a ‘help’ file. Stata has inbuilt help files to that you can call on to help you
with writing commands. At first help files don’t seem all that helpful as some knowledge about using
Stata is assumed. However, the more you use Stata the more you understand the help files. Help files are
available for every command used in Stata. If you don’t know what is the command that you want, you
can use the search function, for example, by typing search data types. You’ll see that STATA can help you
find the appropriate help file for the command you’re looking for. In this case, the option help datatypes
already appears as the second option.

Notice also, that after the search if you click on one help file and it is not about the command that you
wanted, you can just press the back button and the search results will be displayed again.
4.2 Properties window
Let’s now explore the Properties window. To begin with, click on the RANDID variable at the Variables
window. After you select this variable, you’ll perceive that the Properties window is now showing details
about the RANDID variable. This window is showing that the RANDID variable is stored in ‘long’ form.
Variables in long form are made up of integers (numbers) and the ‘Format’ component shows that the
integer is permitted to be up to 10 characters long (The %10g instructs Stata how to display the variable,
you can find more information if you type help format). If a variable is stored as a ‘string’ variable, it is
stored as text. This means that you will not be able to use it in analyses (it is not held in numerical
format). A ‘double’ variable is the most precise storage form of all, holding the value of the number
down to 1.414 x 10-16 level of precision. We don’t need that level of precision in epidemiological analyses
so storing variables in double format just uses up storage space.

4.3 Variables window


In the Variables window click on SEX and then look at the Properties window. This tells you that the
variable named SEX has the label SEX, is a byte (smallest storage form) and its format. Click on some
other variable names and see how their properties differ.
4.4. Data Editor
On the Main menu click on Data > Data Editor > Browse
The “Data Editor” window will open.

The Data Editor window shows how the data is structured. The top of the columns is the variable name
(see RANDID, SEX etc) and the contents of the column have the data. The Variables and Properties
windows are shown here. These can be closed by clicking on the pin icon in the top right corner of the
window. However, it’s useful having the Variables window open because you can filter or search for a
specific variable.

Each row holds data for a unique individual at a unique time. As explained in the Data Dictionary, the
RANDID variable uniquely identifies each individual. But see in the Data Editor how a RANDID can appear
across 1, 2 or 3 rows? In this dataset, there are 3 waves of data collected from 3 different periods. For
individuals with 3 rows of data, they had data collected at all 3 waves. However, where there are only 1
or 2 rows of data, individuals have not had data collected for an entire wave (or 2). Possible reasons for
missing data could include declining to participate in a particular wave, unable to be contacted (known as
lost to follow up), or the participant may have died. Missing data can also be shown in other ways. For
example, the column TOTCHOL has many values, but notice that there are some that show a ‘.’ (full stop
symbol). In this case, although the individual participated in the wave, the data specifically for TOTCHOL
(Serum Total Cholesterol) wasn’t collected. The full stop symbol ‘.’ indicates that the data point for that
person, wave, and variable is missing.

4.5. Using the command Window


Look at the SEX variable in the Data Editor. Notice the contents of the column – it only contains numbers
1 and 2. We will look at the SEX variable using the tab command, which ‘tabulates’ data (see help tab for
more information). Return to the main Stata display, place your cursor in the Command window, type
tab sex and press enter.
Uh oh! Stata is case sensitive. If you type tab sex you get an error message saying that the variable sex is
not found. Try typing tab SEX. You should see the following table in the Results window:

SEX Freq. Percent Cum.

1 5,022 43.19 43.19


2 6,605 56.81 100.00

Total 11,627 100.00

The variable SEX is stored as number 1’s and 2’s. How do we know which numbers are male and which
are female? We need to consult the Data Dictionary. However, to avoid having to consult the Data
Dictionary every time, we can label the variables using STATA. Labelling the variables makes analyses
easier to interpret. To label the variables, type the following commands pressing enter at the end of each
line. (For details see help label).

label define sexlabel 1 Male 2 Female


label values SEX sexlabel
tab SEX

As you can see in the figure below, STATA now shows the tables with the labels in the place of the
numbers. This makes the interpretation of the analysis easier.

SEX Freq. Percent Cum.

Male 5,022 43.19 43.19


Female 6,605 56.81 100.00

Total 11,627 100.00

However, if after the labeling, you still want to see the numbers instead of the labels, you can accomplish
this at any time by typing:

tab SEX, nolabel

Open the Data Dictionary and look at the variables list on page 2. According to the Data Dictionary, SEX
has 5022 men and 6605 women. This is what we see in our table, so we are confident that our data looks
right. You can do this with other variables too, take time to look around the dataset and familiarise
yourself with what it includes. This may save you time later in the pracs. We will come back to the tab
command later in this practical.

4.6 Do files
We are going to type and run commands in a ‘do’ file. We can save our do files to keep a record of what
we have done. It’s very helpful as analyses get complex very quickly. And, you can start to build up a
library of syntax from which to draw on later. Also, I will prepare answers to prac’s in do files, therefore
in order to get solutions to the prac’s, you will need to know how to run the do files.

Three different ways to open a do file (all have the same outcome):
1. Go to the Main Menu and click: Window > Do-File Editor > New Do-File
2. Click CTRL + 9
3. Click the pencil-on-paper icon on the toolbar
Place your cursor in the do file, type tab SEX, highlight the text and then press the CTRL+D. This action
will execute the command tab SEX that you wrote on your do file. Go to the Results window and you’ll
see that it shows the same output generated previously by typing in the Command window. When you
execute a command on a do file (by highlighting the text and pressing CTRL+D), it is the same as if you
typed it and pressed ENTER at the Command window.

Now type the date in the do file *1 March 2022

When you include an asterisk (*) at the beginning of a text, text appears in green font and STATA
understands this as a comment. The green font indicates the text is non-executable (it is not asking Stata
to execute a command). The inclusion of comments is useful because it allows you to keep notes about
your analysis. For example, you can create a comment for the labeling of the variables you did before.

It’s a good idea to start a do file with the date, author, and purpose of the file. For example, I would
usually write something like this at the beginning of my do file:

* 20 February 2023
* Written by Angela Gialamas
* Purpose: Solution for Intro to Epi, Week 3 Prac

Now save the do file by clicking on File > Save then saving with a sensible filename and location. Later you
can open your do file and return to your work.

5. Introducing the command ‘summarize’


We will now begin using the ‘summarize’ and ‘tab’ commands, and then introduce some conditional
commands like ‘if’ and ‘bysort’ to look at subgroups.

Type: help summarize


The help file for the summarize command will open. The summarize command gives summary statistics.
It is very handy. The simple version of summarize gives the number of observations, the mean, standard
deviations (SD), minimum and maximum values. The help file describes options for displaying other
statistics using the details command. Keep this help file open (for the moment) to refer back to…
summarize AGE
You should see output below in the Results window
. summarize AGE

Variable Obs Mean Std. Dev. Min Max

AGE 11,627 54.79281 9.564299 32 81

sum AGE
This is equivalent to summarize AGE. In the help file see that the first 3 letters are underlined
(summarize). This tells you that summarize command can be shortened to sum

sum AGE, detail


Sum can be expanded using the ‘details’ option to display the median (the 50 th centile), and some other
measures of dispersion (variance, skewness, and kurtosis)
AGE

Percentiles Smallest
1% 36 32
5% 40 33
10% 42 33 Obs 11,627
25% 48 33 Sum of Wgt. 11,627

50% 54 Mean 54.79281


Largest Std. Dev. 9.564299
75% 62 80
90% 68 81 Variance 91.47582
95% 71 81 Skewness .1422148
99% 76 81 Kurtosis 2.340903

In addition to the summarize command, to visually examine values of a variable, the histogram command
is useful (See help hist for details).

hist AGE
You can see that STATA will open a new window with the histogram graph. The histogram graph shows
how the frequency of the age is distributed. For example, you can perceive in the graph that most of the
Framingham study participants were aged between 50 and 60 years old. This is in accordance with what
we discover before when we typed the summarize command, which showed a mean of 54.79 years old.
Notice also that the histogram starts with the minimum of 32 years old and goes to a maximum of 81
years old, two values the table generated with the summarize command had also shown us before.
Tables of numbers can be tricky to interpret, but graphs assist us in making sense of the data!
In the new window in which the histogram is displayed, you can click on File > Save. Now go ahead and
click on Save as type:. You can see that STATA allows you to save the image in several file formats,
including the common image format PNG (*.png). Saving the graph into an image file is incredibly useful
since, later on, you can import the image directly to an assignment or paper.

Go back to the Command window and type:

hist AGE, norm


What does the adding the ’norm’ command do?

You can see that the new graph plots in the background the normal distribution. But why should we
bother? The normal distribution is a very famous distribution, as you might have heard before, and if the
variable is normally distributed there are specific statistical techniques that can be readily applied (more
on this later on this course). Therefore, researchers and statisticians often try to assess if a variable is
normally distributed and plotting the graph of the normal distribution on the background is one way of
doing that.

Check other variables, such as hist SYSBP, norm. Compared to AGE, which one looks more “normal" to
you?

Now let’s go back to the summarize command. The summarize command is used for continuous variables
but see what happens if you use summarize for a dichotomous variable. (A dichotomous variable has
only 2 values, e.g. male & female).

sum SEX
It does not make sense to have a mean value for SEX, which has been given arbitrary values (1 = Male; 2 =
Female). Unfortunately, Stata is an incredibly powerful statistical package but it cannot judge whether a
command makes sense or not – it will execute your commands regardless of whether they are sensible.
So be thoughtful about what you ask Stata to do.

Summarize can also be used to view summary statistics for many variables at the same time.
summarize AGE SYSBP DIABP

Variable Obs Mean Std. Dev. Min Max

AGE 11,627 54.79281 9.564299 32 81


SYSBP 11,627 136.3241 22.79862 83.5 295
DIABP 11,627 83.03776 11.66014 30 150

Close the summarize help file if you wish.

7. Introducing ‘tab’ and combining with ‘if’ and ‘bysort’


There are many tricks in Stata that help with selecting relevant data for analysis. It is very common when
conducting research to investigate frequencies and proportions considering two (or more) variables
together. For example, the researcher might ask: how many men are current smokers? Among all men,
what is the percentage of smokers? Here we describe only the ‘if’ and ‘bysort’ commands, but there are
others such as ‘in’ and ‘tag‘ that can also help in elucidating these questions. Remember that this is an
introductory course in epidemiology. A more comprehensive biostatistics course may increase your
repertoire of commands and tricks.

7.1 Command ‘tab’


tab SEX
tab SEX PREVCHD
tab SEX PREVCHD, row col

Digest each piece of output and how changing the commands alters the output.

tab SEX: With only one variable (SEX) we see a tidy table with the frequency (i.e. number of males and
females), their percentages of the total and the cumulative percentage. The cumulative percentage is
more interesting when there are more than 2 categories.

tab SEX PREVCHD: With two variables (SEX PREVCHD) we generate a typical 2x2 table (also known as
contingency table). Time to check PREVCHD in the Data Dictionary… It is the prevalence of coronary
heart disease (CHD), and it has values of 0 (free of disease) or 1 (has disease). Back to the table, we can
see that the number of males without CHD is 4523, the number with CHD is 499. You can look at females.
Notice that the total number of males and females is the same across all tables – always good to check
that whether we are losing any of our sample when we type a command.

tab SEX PREVCHD, row col: The final command adds in row and column percentages.
Here’s how to read the columns: Of the people with no CHD, 41.94% were males and 58.06% were
females.

Now you do the PREVCHD=1 column: Is there a higher proportion of men or women who have CHD?
To read the rows: Of the males, 9.94% had CHD. Of the females, 5.19% had CHD.

7.2 Command ‘if’


Say we wanted to know the proportion of men and women who currently smoke. (Consult the Data
Dictionary if you need to.) We know the variable for men and women (SEX). The variable CURSMOKE has
values (0 = not current smoker; 1 = Current Smoker).

tab CURSMOKE
Check your numbers are right by comparing with the Data Dictionary.

We can select only the smokers by applying CURSMOKE==1 to our tab command. This will allow us to see
the proportion of men and women who currently smoke. Note that you need to use two equals symbols
‘==’.

tab SEX if CURSMOKE==1


SEX Freq. Percent Cum.

Male 2,594 51.58 51.58


Female 2,435 48.42 100.00

Total 5,029 100.00

7.2 Command ‘bysort’


If you wanted to see the proportions of men and women among current smokers and non-smokers you
could use the following
bysort CURSMOKE: tab SEX

With the bysort command, you are telling STATA that you want to tabulate SEX but sorting it by the
values of the variable CURSMOKE (0 = not current smoker; 1 = Current Smoker). Notice that the
frequencies displayed using this command are equivalent to when we used the command tab SEX
CURSMOKE, but now it is possible to see also the percentages and the cumulative percentages.

The use of if and bysort can be used with many other commands, including summarize.
summarize AGE if SEX==1

In this case, the result will show you the mean age of men (54.5 years) instead of the mean age of all
participants (54.8 years).

Reflection: You should be starting to see how we can to build commands in Stata. Having a repertoire of
commands at the front of your mind becomes easier the more you use Stata.

8 Generate a new variable, then name it, label it, then look at it using tab
We need to learn how to generate new variables from other variables in our dataset. Here we are going
to generate a new variable in which the body mass index (BMI) will be categorised. The first step is to
look at the variable which is going to be recoded. Looking at the Data Dictionary, you’ll see that only the
minimum and maximum values are given.

summarize BMI
Shows frequency, mean, SD, minimum and maximum values. Check the min and max are the same as the
Data Dictionary.

hist BMI
What do you notice about the distribution? Would you say it is typical of an adult population of this era?

recode BMI (min/19.999 = 1) (20.000/24.999=2) (25.000/29.999=3) (30.000/max=4), gen(BMICAT)


label var BMICAT "Categories of BMI"
lab def BMICAT 1 "underweight" 2 "healthy" 3 "overweight" 4 "obese"
lab val BMICAT BMICAT

tab BMICAT
BMI
categories Freq. Percent Cum.

Underweight 534 4.61 4.61


Healthy 4,665 40.30 44.92
Overweight 4,808 41.54 86.45
Obese 1,568 13.55 100.00

Total 11,575 100.00

You should see the table above in the Results window. If you don’t, look at the error messages in the
Results window and try to figure out what went wrong. Recheck your commands. This might be a good
time to mention the ‘drop’ command. You can undo what you’ve typed by dropping the BMICAT variable
and having another go at generating the BMICAT variable. Important note: it is not possible to recover a
variable once it has been dropped. Your only hope is to re-open the original dataset that you
downloaded and re-run the syntax.

help drop
drop BMICAT

9. Reshape: Manipulating the data structure from long to wide


Later on, we are going to calculate incidence but before doing so we will need to get our data into a
structure that will allow us to do this.

sort RANDID PERIOD


browse RANDID PERIOD PREVCHD

Look down the list of id numbers (RANDID). Individuals who appear in multiple waves have multiple
rows. For example, RANDID 2448 has two rows, one for PERIOD=1 and another for PERIOD=3. This
means that data was collected for the person with id number 2448 in wave’s 1 and 3. For some people,
you can see that they have prevalent CHD at baseline, or acquire CHD at a later wave (e.g. RANDID 12629
has CHD in period 2 but did not have CHD in period 1). Notice also that the participant RANDID 12629
doesn’t have information on PERIOD 3. This data structure is referred to as ‘long’ form.

In this section, we are going to manipulate the data into the wide form. In contrast to the long form, in
the wide form each individual has only one row. For example, although RANDID 2446 participated in two
waves and previously had two rows on the long form, in the wide form this individual will have a single
row. The difference is that in the wide form the variables will be divided in respect to the three waves. So,
instead of the variable PREVCHD, we will have the variables PREVCHD1, PREVCHD2, and PREVCHD3,
containing the prevalent CHD status of the individual at each wave of the study.

Now it may seem confusing, but it will become easier when you visualize the data. But before conducting
the reshaping, our first step is to include a PERIOD for each participant. Watch what happens to RANDID
2448 (which is missing PERIOD 3).

sort RANDID PERIOD


tsset RANDID PERIOD
tsfill, full

We are using the tsset command to tell Stata that we are using time-series data (e.g. PERIOD 1, 2 and 3
reflect different periods in time). The tsfill command fills in the gaps in the time variable PERIOD. If you
now look at RANDID 12629 you will see that there PERIOD 3 has been introduced, and all the data within
it is set to missing (“.”)
Now let’s proceed with the reshaping. Read the help file to get an understanding of what you are trying
to do.

help reshape
reshape wide PREVCHD, i(RANDID) j(PERIOD)

When we type the reshape command we get an error message that some variables are not constant
within RANDID (“variable BMI not constant within RANDID”). You’ll notice that many of the other
variables in the dataset differ over time. For example, the variable BMI for participant RANDID 6238 has
the values of 28.73 for PERIOD=1, 29.34 for PERIOD=2 and 28.5 for PERIOD=3. This variable is not
constant and the reshape command won’t work when values are not constant over PERIOD. Therefore, if
a variable is not constant across all periods, we need to specify that we want to include that variable in
the reshaping command or we need to drop it.

So before reshaping, we are going to drop all of the variables that we don’t need to calculate incidence.
(An alternative to this is to use the ‘preserve’ command, which you can investigate if you wish). After
calculating incidence, we will not save the dataset so we will be able to ‘restore’ it later, by opening it
again without any loss of data.

keep RANDID CURSMOKE PREVCHD PREVAP PREVMI PREVSTRK PREVHYP PERIOD

(Here we have dropped all other variables from our dataset and are working only on those above. This is
an irrevocable loss of information. To get the lost data back you will have to reopen the dataset.)

Now we can retry the reshape command.

reshape wide CURSMOKE PREVCHD PREVAP PREVMI PREVSTRK PREVHYP, i(RANDID) j(PERIOD)

Now check out what you have done in in the Data Editor.
browse

We have kept the data for the variables in the list above (dropped everything else), plus you have the
values for each of these variables at each wave of the study.

If you want to return to long form you can type


reshape long

If you want to get all of the data back, you will need to reopen the dataset. This is because we used the
command keep. Go ahead and type clear. STATA will close the dataset without saving it, so you can
reopen the file you saved from MyUni and the original dataset without any changes will be there.

10 Prevalence and Incidence


Prevalence
If we look at who had disease at baseline (PERIOD 1) or the other two periods, we are estimating
prevalence. Same for if we look who has prevalent disease in any single wave.
tab1 PREVCHD*
The asterisk is a shortcut for all variables with the stem name PREVCHD. In other words, it includes
PREVCHD1, PREVCHD2, and PREVCHD3.

Incidence
An easy way of looking at incidence (new cases of CHD) at PERIOD 2 is to do a simple cross tabs.
tab PREVCHD1 PREVCHD2, row

From this, we can see that the number of new cases at PERIOD 2 who did not have CHD at PERIOD 1 is
152, which is 4% of the cohort. We can also check that other aspects of the data look right. For example,
all the people who had CHD at PERIOD 1 must be coded as having CHD at PERIOD 2, as a diagnosis of CHD
cannot be revoked.

If we wanted to look prevalent cases of CHD at PERIOD 2 it will include cases at baseline (when
PERIOD==1, so notice the number of 288 cases at the bottom of the table) as well as new cases in PERIOD
2. So in fact, you can see prevalence and incidence in this simple little table.

11 Age-standardisation (direct)

Before doing this section, read the supplementary slides on age standardisation. The idea is that we weight
the populations to make them similar to allow us to compare death rates (or disease rates) across
populations with different age structures. It is a process of ‘weighting’ the data. In this example, we will use
direct standardisation and compare the Standardised Mortality Ratios for men and women at baseline.

In the previous section, we did not keep the AGE variable so you first need to re-open the original dataset.
Next, you will keep only the baseline data (for simplicity) and look at the range of ages present at baseline.

keep if PERIOD==1
sum AGE if PERIOD==1

Then proceed and recode the age variable into categories.

recode AGE (30/34 = 1 “30 – 34”) (35/39 = 2 “35 – 39”) (40/44 = 3 “40 – 44”) (45/49 = 4 “45 – 49”) (50/54 =
5 “50 – 54”) (55/59 = 6 “55 – 59”) (60/64 = 7 “60 – 64”) (65/69 = 8 “65 – 69”) (70/74 = 9 “70 – 74”),
gen(agecat)

Notice that you need to type the code as a single line in the Command Window.

And the command will be executed correctly:

Be careful with the use of spaces, (40/44 = 3 “40 – 44”) is different from (40/44=3“40 – 44”). In addition,
when you copy the command to your do file, you can include ‘///’ to allow syntax to spread across multiple
lines and still be executable.
Therefore, if you highlight the three lines and press “Execute selection (do)”

STATA will run it correctly again.

Let’s move on and investigate the new variable:

tab agecat
tab agecat SEX

The next step is to examine the dstdize help file in Stata. Examples of how to standardise for populations are
available online. If you want to explore further have a look at the example provided by stata here.
help dstdize

Usually, the dstdize command in stata is used with aggregate data (e.g. data that has already been grouped
according to age, sex, and any other characteristic) and is compared against another population. Here we
have individual level data. So in order to use the dstdize command, we need to create a variable called pop
that tells stat to use the current population as the population upon which to standardise.
gen pop=1

The syntax below generates a standardized rate for the prevalence of CHD (PREVCHD) at baseline
(PERIOD==1) and comparing the rates of CHD by SEX.

dstdize DEATH pop agecat if PERIOD==1, by(SEX)


The output is divided by SEX. Let’s digest the output in each column separately.
Column 1, Stratum, is the age category variable.
Column 2, Population, is the total number of individuals in that age category, at baseline and by sex. For
example, there are 249 men aged 35-39 years at baseline.
Column 3, Cases, is the number of deaths for each sex and age category.
Column 4, Pop. Dist. refers to ‘population distribution’, which is the proportion of the population in each age
category. E.g. there are 249 Males aged 35-39, and therefore this sex and age group comprise 0.128 (12.8%)
of the total population (249/1944=0.128)
Column 5, Stratum Rates, is the age and sex specific death rate. E.g. for Males aged 35-39, the death rate is
30/249 = 0.1205. The total of this column will equal 1.
Column 6, STD. PoP. Dist [P] is the standardised distribution for the whole population. This is the percentage
of men AND women in each age category as a proportion of the total population. E.g. In age category 35-39
there are 249 men and 286 women, which comprise 0.5% of the total men and women population (249+286)
/ (1944+2490) = 0.005 or 0.5%). Notice that this is the same value in the tables for both men and women.
Column 7, multiplies the stratum specific rate by the standardised distribution for the whole population.

The table at the very bottom of the output summarizes the data in each table. You can see totals for the
sample size, crude and adjusted rates and their confidence intervals.

Summary of Study Populations:


SEX N Crude Adj_Rate Confidence Interval

1 1944 0.433642 0.437039 [ 0.417445, 0.456634]


2 2490 0.283936 0.282133 [ 0.266058, 0.298208]

You can see that the age-standardised death rate for men (0.43) is much higher than women (0.28).

For simplicity, we have used the data within our dataset to standardise our population. This means that we
have used the age distribution within our population as the ‘standard’ population. We can do direct
standardisation with other population distributions, see dstdize help file for instructions if you ever want to
do this.

9. Practice questions: Use what you have learned to answer the following questions

 Use the Data Dictionary (DD) to find the variables you need
 You can complete this exercise using the commands given in this prac (sum, tab, if and bysort) but if
you happen to know other commands, by all means, go ahead and use them

Calculation Process & tips


Prevalence of hypertension at  Use the DD to find the variable describing hypertension (no/yes) and
study baseline the variable that describes the different waves of the study. Next,
you’ll combine the tab and if commands to obtain the prevalence of
hypertension if it is the study baseline.

Prevalence of hypertension at  This time you’ll need to use the tab, if and the bysort command.
baseline for males and (Hint: You could also do this using SEX==1 and repeating if SEX==2
females separately but by using the bysort you will get the same results but running the
command only one time. Try to learn how to use the bysort
command to complete this exercise).

Incidence of coronary heart  Open the original dataset without the modifications made in this
disease (CHD) from period 1 to practical. For this exercise, you will need to keep the relevant
period 2 variables, then reshape the dataset, and then calculate incidence.
Cumulative incidence of  Use the command clear and open the original dataset again. You will
diabetes up to period 3 need to keep the relevant variables, reshape the dataset and
calculate the incidence from period 1 through to 3.

Good work!

You might also like