You are on page 1of 21

Chapter 1-8.

Operators, Ifs, Dates, and Times

If Expressions

Recall that Stata commands take the general form of:

by byvarlist: command varlist if_expression in_range , options

Example:

list bweight lowbw if sex==2 in 1/5 , noobs

In this section we are going to become expert at if expressions, as well as using operators in
generate commands.

Operators

In Stata, there are four classes of operators: arithmetic, string, relational, and logical.

Arithmetic Operators

The arithmetic operators in Stata are:

+ (addition)
- (subtraction)
* (multiplication)
/ (division)
^ (raised to a power, or exponentiation)
- (negation)

If the arithmetic operation includes a missing value or impossible operation (such as division by
zero), the operation produces a missing value.

_____________________

Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.

Chapter 1-8 (revision 16 May 2010) p. 1


The order of operation is from left to right, except this order is changed by the order of
operations of operators:

1st: ^ (exponentiation)
2nd: - (negation)
3rd: * , / (multiplication or division)
4th: - , + (substraction or addition)

This left to right order can be modified with parentheses ( ), where the operation in parentheses
is done first.

Example The expression –(x+y^(x-y))/(x*y) denotes the formula

which evaluates to missing if x or y is missing or zero.

Example The formula for the odds ratio is

The correct display command to compute this for (a=1,d=2,b=3,c=4) is

display (1*2)/(3*4)

or to add “OR = ” as text is

display "OR = " (1*2)/(3*4)

whereas, the following gives the wrong answer

display 1*2/3*4

Chapter 1-8 (revision 16 May 2010) p. 2


Exercise Kidney: Effective Renal Plasma Flow (Sykes et al, 1991):

“In the constant infusion methods, urine and plasma samples are taken at a time when the
rate of excretion equals the rate of infusion so that clearance (C)

where V is the volume of urine produced during time T, and U and P are the measured
concentrations of activity in urine and plasma respectively.”

Using the following data for the variables, U,V,P, and T,

clear
input u v p t
2 4 4 2
2 2 1 4
end
list

write a generate (gen) statement to compute the variable C, and then list the data with the
new variable computed. The solution should be:

+-------------------+
| u v p t c |
|-------------------|
1. | 2 4 4 2 1 |
2. | 2 2 1 4 1 |
+-------------------+

Chapter 1-8 (revision 16 May 2010) p. 3


Exercise Calculating Initial Saline Infusion Rates (Ellison and Berl, 2007):

In their Table 4, the authors give the following formula:

Using the following data for the variables, TBW, Na1, Na2, Vinput, Einput,
Eurine, Einf (note: these are just made-up unrealistic values),

clear
input tbw na1 na2 vinput einput eurine einf
25 33 49 50 100 5 44
end
list

write a generate (gen) statement to compute the variable volume, and then list the data
with the new variable computed.

Hint: If you get the following error message,

too many ')' or ']'

it is because you must have the same number of left and right parentheses.

The solution should be:


+---------------------------------------------------------------+
| tbw na1 na2 vinput einput eurine einf volume |
|---------------------------------------------------------------|
1. | 25 33 49 50 100 5 44 -121.0904 |
+---------------------------------------------------------------+

The generate command that gives this result in given on the last page of this chapter, if
you cannot get it to work.

Chapter 1-8 (revision 16 May 2010) p. 4


String Operators

The one string operator is:

+ (concatenation)

If the + occurs between two strings, Stata concatenates them. If + appears between two numeric
values, Stata adds them.

Example When strings are read with the input command, use a “str#” in front of it.
clear
input str4 a str4 b
this that
here now
end
gen both=a+b
list

+------------------------+
| a b both |
|------------------------|
1. | this that thisthat |
2. | here now herenow |
+------------------------+

To add a space between the words, use

gen both2=a+" "+b


list

+------------------------------------+
| a b both both2 |
|------------------------------------|
1. | this that thisthat this that |
2. | here now herenow here now |
+------------------------------------+

Chapter 1-8 (revision 16 May 2010) p. 5


Relational Operators

The relational operators in Stata are:

> (greater than)


< (less than)
>= (greater than or equal)
<= (less than or equal)
== (equal)
!= or ~= (not equal)

Note: Stata does not understand “=>” or “=<”, which is not the standard way to say these
operations. You state them just as you learned in elementary school. That is, you say
“greater than or equal to”, not “equal to or greater than”. You say, “less than or equal
to”, not “equal to or less than”.

It is natural to think of relational operators as evaluating to true or false. They actually evaluate
to numbers (1= true) (0=false).

Example

display 5>4
display 5<4

. display 5>4
1

. display 5<4
0

Example use of relational operators


clear
input age
20
18
25
26
end
gen underage = 1 if age<21
replace underage = 0 if age>=21
list

+-----------------------+
| male age underage |
|-----------------------|
1. | 1 20 1 |
2. | 0 18 1 |
3. | 1 25 0 |
4. | 0 26 0 |
+-----------------------+

Chapter 1-8 (revision 16 May 2010) p. 6


Logical Operators

The logical operators in Stata are:

& (and)
| (or)
! or ~ (not)

The logical operators interpret any nonzero value (including missing) as true and zero as false.

Example
clear
input male age
1 20
0 18
1 25
0 26
end
gen underagemale = 1 if age<21 & male==1
list
replace underagemale = 0 if age>=21 & male==1
list
. list

+-----------------------+
| male age undera~e |
|-----------------------|
1. | 1 20 1 |
2. | 0 18 . |
3. | 1 25 . |
4. | 0 26 . |
+-----------------------+

. replace underagemale = 0 if age>=21 & male==1


(1 real change made)

. list

+-----------------------+
| male age undera~e |
|-----------------------|
1. | 1 20 1 |
2. | 0 18 . |
3. | 1 25 0 |
4. | 0 26 . |
+-----------------------+

A more efficient way to create this variable is with the condition function, which is shown
below.

Stata knew what to do in this case, without using parentheses. It used the order of evaluation
rule, or order of operations rule, for all operators, shown below.

Chapter 1-8 (revision 16 May 2010) p. 7


Order of evaluations (order of operations), all operators

The order of evaluation (from first to last) of all operators is

^ (exponentiation)
- (negation)
/ , * (division or multiplication)
- , + (subtraction or addition)
~= (or !=) (not)
> , < (greater than or less than)
<= , >= (“less than or equal to” or “greater than or equal to”)
== (equal to)
& (and)
| (or)

Chapter 1-8 (revision 16 May 2010) p. 8


Boolean Arithmetic

The logical operators “and”, “or” and “not” (&, |, and ~ or !) follow precisely defined rules,
called boolean arithmetic . This arithmetic is define by the following truth tables.

P and Q
P Q P and Q
T T T
T F F
F T F
F F F
Thus both propositions, or expressions, P and Q,
must be true for (P and Q ) to be true.

P or Q
P Q P or Q
T T T
T F T
F T T
F F F
Thus either proposition, P or Q, must be true for
(P or Q) to be true.

not P
P not P
T F
F T
For any value of P, (not P) is its opposite.

Example
clear
input p q
1 1
1 0
0 1
0 0
end
gen PandQ=(p==1)&(q==1)
gen PorQ=(p==1)|(q==1)
gen notP=~(p==1)
list

+-----------------------------+
| p q PandQ PorQ notP |
|-----------------------------|
1. | 1 1 1 1 0 |
2. | 1 0 0 1 0 |
3. | 0 1 0 1 1 |
4. | 0 0 0 0 1 |
+-----------------------------+

Chapter 1-8 (revision 16 May 2010) p. 9


Alternatively, we get the same result if we abbreviate the logical expressions to:

clear
input p q
1 1
1 0
0 1
0 0
end
gen PandQ=p&q
gen PorQ=p|q
gen notP=~p
list

+-----------------------------+
| p q PandQ PorQ notP |
|-----------------------------|
1. | 1 1 1 1 0 |
2. | 1 0 0 1 0 |
3. | 0 1 0 1 1 |
4. | 0 0 0 0 1 |
+-----------------------------+

but this abbreviation approach is not a good idea because a missing value evaluates to a 1, since
it is stored as a “very large” non-zero number, and you might expect these to evaluate to missing
like numeric generates do.

clear
input p q
1 1
1 0
0 .
. 0
end
gen PandQ=p&q
gen PorQ=p|q
gen notP=~p
list

+-----------------------------+
| p q PandQ PorQ notP |
|-----------------------------|
1. | 1 1 1 1 0 |
2. | 1 0 0 1 0 |
3. | 0 . 0 1 1 |
4. | . 0 0 1 0 |
+-----------------------------+

Exercise

By applying the rule for “and” in your head, and using the above output, which is replicated
below, fill in the added column:
+-----------------------------+
| p q PandQ PorQ notP | (P and Q) and (P or Q)
|-----------------------------| ----------------------
1. | 1 1 1 1 0 |
Chapter 1-8 (revision 16 May 2010) p. 10
2. | 1 0 0 1 0 |
3. | 0 1 0 1 1 |
4. | 0 0 0 0 1 |
+-----------------------------+
Practice With Missing Data (Arithmetic Operators)

If arithmetic operators are used, generated variables are set to missing if any of the variables in
the arithmetic expression are missing.

Example

Let’s create a total score of two variables.

clear
input v1 v2
10 15
. 20
15 .
. .
5 4
end
gen tot = v1+v2
list

+---------------+
| v1 v2 tot |
|---------------|
1. | 10 15 25 |
2. | . 20 . |
3. | 15 . . |
4. | . . . |
5. | 5 4 9 |
+---------------+

Notice the total variable was set to missing in the expected fashion.

Chapter 1-8 (revision 16 May 2010) p. 11


Practice With Missing Data (Relational Operators)

With relational operators, the missing values affect the result differently.

Example

Let’s create an indicator, or dichotomous, variable for legal-aged males (males ≥ 21).

clear
input male age
1 20
. .
0 18
1 .
0 26
. 21
end
gen legalmale = 0
replace legalmale = 1 if age>=21 & male==1
list , abbrev(15)
+------------------------+
| male age legalmale |
|------------------------|
1. | 1 20 0 |
2. | . . 0 |
3. | 0 18 0 |
4. | 1 . 1 |
5. | 0 26 0 |
|------------------------|
6. | . 21 0 |
+------------------------+

We see that no results are missing, since missing is treated as a very large number when
evaluating relational operators.
A way to get around this is to add another line to set missing back to missing.

clear
input male age
1 20
. .
0 18
1 .
0 26
. 21
end
gen legalmale = 0
replace legalmale = 1 if age>=21 & male==1
replace legalmale = . if age==. | male==.
list , abbrev(15)
+------------------------+
| male age legalmale |
|------------------------|
1. | 1 20 0 |
2. | . . . |
3. | 0 18 0 |
4. | 1 . . |

Chapter 1-8 (revision 16 May 2010) p. 12


5. | 0 26 0 |
|------------------------|
6. | . 21 . |
+------------------------+

Condition function

The condition function is a very fast way to create a categorical variable from a continuous
variable. It tests the condition in the first parameter, sets it equal to the value in the second
parameter if true, sets it equal to the value in the third parameter if false.

cond(condition, if true, if false) <- syntax

In other words, it replaces the “generate” and “replace” lines with one “generate” line.

Example

Again, creating an indicator, or dichotomous, variable for legal-aged males (males ≥ 21).

clear
input male age
1 20
. .
0 18
1 .
0 26
. 21
end
gen legalmale = cond(age>=21 & male==1,1,0)
list , abbrev(15)

+------------------------+
| male age legalmale |
|------------------------|
1. | 1 20 0 |
2. | . . 0 |
3. | 0 18 0 |
4. | 1 . 1 |
5. | 0 26 0 |
|------------------------|
6. | . 21 0 |
+------------------------+

Even with the cond( ) function, we must still replace the missing with missing, using one more
line:
replace legalmale = . if age==. | male==.

Chapter 1-8 (revision 16 May 2010) p. 13


Dates and Times

There is a long list of functions for working with strings, dates, and times. You can see these by
searching for “functions” in Stata’s help, and then clicking on the “string functions” link or “date
and times” link.

A popular format for storing dates and times in hospital databases is the following:
clear
input str20 admit_date str20 infection_date
"7/22/1999 6:26:00" "7/25/1999 13:00:00"
"7/12/1999 9:35:00" "7/14/1999 10:30:00"
"2/25/2000 10:20:00" "2/28/2000 12:45:00"
end
list , abbrev(15)

+-----------------------------------------+
| admit_date infection_date |
|-----------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 |
+-----------------------------------------+

To discover these dates and times are stored as string variables, we use

describe

Contains data
obs: 3
vars: 2
size: 132 (99.9% of memory free)
------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------
admit_date str20 %20s
infection_date str20 %20s
------------------------------------------------------------

To be able to subtract the two dates, for example, we will have to convert them to numeric
variables. This is done by turning them into “elapsed dates”, which is the amount of time since
January 1, 1960.

Chapter 1-8 (revision 16 May 2010) p. 14


Converting the admit_date string variable into an elapsed date, which will be the number of days
since January 1, 1960, we use

capture drop admit_date2


gen admit_date2 = date(admit_date, "MDYhms")
list , abbrev(15)
format admit_date2 %d <- display using a date format
list , abbrev(15)

. list , abbrev(15)

+-------------------------------------------------------+
| admit_date infection_date admit_date2 |
|-------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 14447 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 14437 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 14665 |
+-------------------------------------------------------+

. format admit_date2 %d

. list , abbrev(15)

+-------------------------------------------------------+
| admit_date infection_date admit_date2 |
|-------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 |
+-------------------------------------------------------+

Notice we had to inform Stata of the order in which the date-time variable was in using the
“MDYhms” mask, which is Month, Day, Year and hours, minutes, seconds.

Chapter 1-8 (revision 16 May 2010) p. 15


If we wanted an admit year variable, we must first create a date variable and then use this as an
argument for the year function,

capture drop admit_date2


capture drop admit_year
gen admit_date2 = date(admit_date, "MDYhms")
gen admit_year = year(admit_date2)
list , abbrev(15)
format admit_date2 %d
list , abbrev(15)

+--------------------------------------------------------------------+
| admit_date infection_date admit_date2 admit_year |
|--------------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 14447 1999 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 14437 1999 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 14665 2000 |
+--------------------------------------------------------------------+

. format admit_date2 %d

. list , abbrev(15)

+--------------------------------------------------------------------+
| admit_date infection_date admit_date2 admit_year |
|--------------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 1999 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 1999 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 2000 |
+--------------------------------------------------------------------+

Notice that the admit_year variable is not an elapsed date, so it needed no format statement.

Chapter 1-8 (revision 16 May 2010) p. 16


For elapsed time, we create a variable that is the number of elapsed milliseconds since January 1,
1960 at midnight.

capture drop admit_year <- done with this, so remove from variables
*
capture drop admit_date2
gen admit_date2 = clock(admit_date, "MDYhms")
list , abbrev(15)
format admit_date2 %tc
list , abbrev(15)

. list , abbrev(15)

+-------------------------------------------------------+
| admit_date infection_date admit_date2 |
|-------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 1.25e+12 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 1.25e+12 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 1.27e+12 |
+-------------------------------------------------------+

. format admit_date2 %tc

. list , abbrev(15)

+--------------------------------------------------------------+
| admit_date infection_date admit_date2 |
|--------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 06:26:46 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 09:34:12 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 10:20:09 |
+--------------------------------------------------------------+

Notice the seconds do not match, which is due to loss of precision with the default “float”
numeric variable format.

This is because a date and time variable, converted with the clock function, is stored as the
elapsed milliseconds from January 1, 1960 at midnight. That is a very large number than cannot
fit in a “float” variable.

To preserve precision, we must specify the double precision format for the new variable.

Chapter 1-8 (revision 16 May 2010) p. 17


Maintaining precision of the seconds,

capture drop admit_date2


gen double admit_date2 = clock(admit_date, "MDYhms")
list , abbrev(15)
format admit_date2 %tc
list , abbrev(15)

. list , abbrev(15)

+-------------------------------------------------------+
| admit_date infection_date admit_date2 |
|-------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 1.248e+12 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 1.247e+12 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 1.267e+12 |
+-------------------------------------------------------+

. format admit_date2 %tc

. list , abbrev(15)

+--------------------------------------------------------------+
| admit_date infection_date admit_date2 |
|--------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 06:26:00 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 09:35:00 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 10:20:00 |
+--------------------------------------------------------------+

Chapter 1-8 (revision 16 May 2010) p. 18


Date literal (or Date constant)

If we want to compute the number of days from July 1, 1999, which is perhaps the study start
date, we can use a date literal. Stata’s d( ) is a date expressed as a day, followed by a month,
followed by a four-digit year.

capture drop admit_date2


capture drop daysfromstart
gen admit_date2 = date(admit_date, "MDYhms")
format admit_date2 %d
gen daysfromstart = admit_date2-d(1jul1999)
list , abbrev(15)

+-----------------------------------------------------------------------+
| admit_date infection_date admit_date2 daysfromstart |
|-----------------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 21 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 11 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 239 |
+-----------------------------------------------------------------------+

Another version of the date literal is “mdy(mm,dd,yyyy)”.


capture drop admit_date2
capture drop daysfromstart
gen admit_date2 = date(admit_date, "MDYhms")
format admit_date2 %d
gen daysfromstart = admit_date2-mdy(7,1,1999)
list , abbrev(15)

+-----------------------------------------------------------------------+
| admit_date infection_date admit_date2 daysfromstart |
|-----------------------------------------------------------------------|
1. | 7/22/1999 6:26:00 7/25/1999 13:00:00 22jul1999 21 |
2. | 7/12/1999 9:35:00 7/14/1999 10:30:00 12jul1999 11 |
3. | 2/25/2000 10:20:00 2/28/2000 12:45:00 25feb2000 239 |
+-----------------------------------------------------------------------+

Chapter 1-8 (revision 16 May 2010) p. 19


Exercise

You have the idea that if you can show it takes longer for a third year resident to respond to a
beeper page than a second year resident, that you could surely get this published in the New
England Journal of Medicine.

Data were collected using a computerized system, which records the date and time, down to the
second. The data are in the file residents.dta. Compute the elapsed minutes from page to
response and compare the two groups using an independent groups t test and a Wilcoxon-Mann-
Whitney test.

This is a rather difficult problem for most students. So here is most of the solution (also at the
bottom of the chapter8.do file.

1) All you need to do is fill in the missing line or lines, to create the timeminutes variable
(minutes between page and response).

Hint: there are 1000 milliseconds to one second. The timemilliseconds variable is elapsed time
in milliseconds. You have to get time in minutes.

capture drop page_time2


capture drop respond_time2
capture drop timemilliseconds
capture drop timeminutes
gen double page_time2 = clock(page_time, "MDYhms")
gen double respond_time2 = clock(respond_time , "MDYhms")
gen timemilliseconds = (respond_time2-page_time2)
* -- missing line(s), for you to fill in
ttest timeminutes ,by(resident_year) // t-test
ranksum timeminutes ,by(resident_year) // Wilcoxon-Mann-Whitney test
permtest2 timeminutes, by(resident_year)
// Fisher-Pitman permutation test for two independent samples

2) Add “if statements” to the test statistics on the last three rows to eliminate an outlier in one of
the groups. Do you think it would be a legitimate analysis strategy to do this?

References

Ellison DH, Berl T. (2007). The syndrome of inappropriate antidiuresis. N Engl J Med
356;2064-
72.

Sykes MK, Vickers MD, Hull CJ, Winterburn PJ, Shepstone BJ. (1991). Principles of
Measurement and Monitoring in Anaesthesia and Intensive Care, 3rd ed, Oxford,
Blackwell Scientific Publications.

Chapter 1-8 (revision 16 May 2010) p. 20


Solution to Page 4 problem

clear
input tbw na1 na2 vinput einput eurine einf
25 33 49 50 100 5 44
end
list
*
capture drop volume
gen volume=(tbw*(1-(na1+23.8)/(na2+23.8)) ///
+vinput-(einput*vinput)/eurine) ///
/(einf/eurine-1)
list

Note: The “///”, which tells Stata to continue the command on the next
line, only works in the do-file editor. If you are using Stata’s command window, you
must put everything on one line, omitting the two instances of “///”.

Chapter 1-8 (revision 16 May 2010) p. 21

You might also like