You are on page 1of 7

NetCourse® 101

Introduction to Stata

Answers to Exercises in Lesson 3


[1|2|3|4|5]

1.
A survey of 10 yes or no questions was administered, and the data are

DATASET: survey.txt

1 male 1 0 0 1 1 0 0 1 1 1
2 female 1 1 0 . 0 1 0 1 1 0
3 female 0 1 1 0 1 0 0 1 . 0
4 male 1 1 0 0 1 1 1 0 0 1

The first data element is the respondent's identification number, the second is the respondent's sex, and the
third is the respondent's answers to the 10 questions. Each question is coded 1=yes, 0=no, and .=irrelevant.
Show how to read these data using import delimited. Label the data so that the words yes and no appear
rather than 1 and 0 on tabulations. Make a tabulation of the responses to question 1 by sex.

Answer:

To import this text file, you need to use import delimited with a " " as the delimiter.

. clear

. import delimited id sex q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 using survey.txt, clear delimiter("


(12 vars, 4 obs)

We can summarize our data to make sure the data was imported correctly:

. summarize

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
id | 4 2.5 1.290994 1 4
sex | 0
q1 | 4 .75 .5 0 1
q2 | 4 .75 .5 0 1
q3 | 4 .25 .5 0 1
-------------+--------------------------------------------------------
q4 | 3 .3333333 .5773503 0 1
q5 | 4 .75 .5 0 1
q6 | 4 .5 .5773503 0 1
q7 | 4 .25 .5 0 1
q8 | 4 .75 .5 0 1
-------------+--------------------------------------------------------
q9 | 3 .6666667 .5773503 0 1
q10 | 4 .5 .5773503 0 1

Next, to define a value label, attach the label to all q variable in the dataset, and run the tabulate.

. label define yesno 0 "no" 1 "yes"

. label values q* yesno


. tabulate q1 sex

| sex
q1 | female male | Total
-----------+----------------------+----------
no | 1 0 | 1
yes | 1 2 | 3
-----------+----------------------+----------
Total | 2 2 | 4

Everything looks fine, but if you tried to use sex as a variable, in many of the statistical routines, you would get
an error message saying that there are no observations. We forgot to encode the variable sex.

. encode sex, gen(sex2)

. tabulate q1 sex2

| sex2
q1| female male | Total
-----------+----------------------+----------
no | 1 0 | 1
yes | 1 2 | 3
-----------+----------------------+----------
Total| 2 2 | 4

The output from tabulate looks the same, but sex is a string variable, while sex2 is a numeric variable with
value labels.

. codebook s*

-------------------------------------------------------------------------------
sex (unlabeled)
-------------------------------------------------------------------------------

type: string (str6)

unique values: 2 missing "": 0/4

tabulation: Freq. Value


2 "female"
2 "male"

-------------------------------------------------------------------------------
sex2 (unlabeled)
-------------------------------------------------------------------------------

type: numeric (long)


label: sex2

range: [1,2] units: 1


unique values: 2 missing .: 0/4

tabulation: Freq. Numeric Label


2 1 female
2 2 male

Last, let's drop variable sex, rename sex2 sex, and reorder our dataset.

. drop sex

. rename sex2 sex

. order id sex
2.
Take the survey dataset used in exercise 1, and write it to the Excel file survey.xlsx. Write the data to a new
sheet, Survey 1. Make sure to export the values, not the labels of the question variables, and start writing the
data in cell B4 of the Excel file. Also, make sure the first row of data contains the variable names.

Answer:

The command to use is:

. export excel survey.xlsx, sheet("Survey 1") cell(B4) nolabel firstrow(var)


file survey.xlsx saved

When you open the file survey.xlsx, you should see the first data value is written in the cell B4. Also the q*
columns contain 0 1 values and the sex column contains 1 2. The only sheet in the Excel file should be
named Survey 1.

3.
In the discussion of dates, we offered the following dataset:

DATASET: hosp.txt

05721 "01/21/1952"
10322 "07/11/1948"
51331 "11/15/1968"

These data could be read and converted to Stata dates by typing

. clear

. import delimited patid strdate using hosp.txt, clear delimiter(" ")

. gen bdate = date(strdate,"MDY")

. format bdate %td

. drop strdate

Now consider the following data:

DATASET: hosp.csv

05721, "01/21/52 10:15:00 AM"


10322, "07/11/48 11:33:30 PM"
51331, "11/15/68 07:00:00 PM"

Assuming that all dates are in the twentieth century, how could these data be read and converted to Stata
dates? Assuming that all dates are in the nineteenth century, how would the answer to this question change?
(Hint: type help datetime_translation in the Stata Command window).

Answer:

What happens if we try to use the solution that worked for the hosp.txt data?

. clear

. import delimited patid strdate using hosp, clear


(2 vars, 3 obs)
. gen bdate = date(strdate,"MDY")
(3 missing values generated)

. format bdate %td

. list

+--------------------------------------+
| patid strdate bdate |
|--------------------------------------|
1. | 5721 "01/21/52 10:15:00 AM" . |
2. | 10322 "07/11/48 11:33:30 PM" . |
3. | 51331 "11/15/68 07:00:00 PM" . |
+--------------------------------------+

Notice the missing values. The date() function with "MDY" as the second argument refuses to accept a two-
digit year and a time.

The clock() function allows you to specify the default century and a time. You can specify "MD19Yhms" or
"MD20Yhms".

. clear

. import delimited patid strdate using hosp, clear


(2 vars, 3 obs)

. gen bdate = clock(strdate,"MD19Yhms")

. format bdate %tc

. list

+---------------------------------------------------+
| patid strdate bdate |
|---------------------------------------------------|
1. | 5721 "01/21/52 10:15:00 AM" 21jan1952 10:15:07 |
2. | 10322 "07/11/48 11:33:30 PM" 11jul1948 23:33:32 |
3. | 51331 "11/15/68 07:00:00 PM" 15nov1968 19:00:04 |
+---------------------------------------------------+

It looks like the times did not convert correctly. Why? Remember that Stata's default numeric data type is float.
Floats can store reals with roughly eight digits of accuracy. This is not a large enough data type for date and
times values. You must use the double storage type:

. drop bdate

. gen double bdate = clock(strdate,"MD19Yhms")

. format bdate %tc

. list

+---------------------------------------------------+
| patid strdate bdate |
|---------------------------------------------------|
1. | 5721 "01/21/52 10:15:00 AM" 21jan1952 10:15:00 |
2. | 10322 "07/11/48 11:33:30 PM" 11jul1948 23:33:30 |
3. | 51331 "11/15/68 07:00:00 PM" 15nov1968 19:00:00 |
+---------------------------------------------------+

4.
Take the hosp dataset imported in exercise 3, and write it to a file new_hosp.csv. Next, use the type to view
the contents of the file. Was the bdate variable exported as a string or numeric? What happens when you
export the data to Excel format?
Answer:

To export the data type:

. export delimited new_hosp


file new_hosp.csv saved

. type new_hosp.csv
patid,strdate,bdate
5721, "01/21/52 10:15:00 AM",21jan1952 10:15:00
10322, "07/11/48 11:33:30 PM",11jul1948 23:33:30
51331, "11/15/68 07:00:00 PM",15nov1968 19:00:00

export delimited exports dates as strings. The reason for this is because another application would have
no idea that the numeric value exported in the new_hosp.csv is really a date. It would just see the value as a
number. Most other packages have date functions similar to Stata's clock() function that convert string dates
to numeric, so it is safe to export a date as a string.

Now export this data to Excel:

. export excel new_hosp


file new_hosp.xls saved

Open this file, and look at the last column. export excel coverts a numeric Stata date to a numeric Excel
date and applies an Excel date format to the cell, so you do not have to do anything special.

5.
Consider the following survey data reporting employee ID, birth date, and hire date:

DATASET: emp.txt

22132 01/21/52 7/85


28399 07/11/48 11/92
11752 10/28/62 5/90

Note that the hire dates are specified as a month and year only. Read the data, and calculate the approximate
age at hire assuming all persons were hired on the 15th of the month (to import this text file correctly, you will
need to use a delimiter() suboption).

Answer:

Let's import the data and use spaces as the delimiter.

. import delimited empid strdate shdate using emp.txt, delimiter(" ") clear
(8 vars, 3 obs)

. list

+--------------------------------------------------------------+
| empid strdate shdate v4 v5 v6 v7 v8 |
|--------------------------------------------------------------|
1. | 22132 . . 01/21/52 . . 7/85 |
2. | 28399 . . 07/11/48 . . 11/92 |
3. | 11752 . . 10/28/62 . . 5/90 |
+--------------------------------------------------------------+

What happened? If you look at the text file, you will see that there are multiple spaces in between each value of
the data. By default, import delimited will treat each space as a variable. There is a solution. The
collapse suboption of delimiter() option will treat multiple consecutive delimiters as just one delimiter.
import delimited empid strdate shdate using emp.txt, delimiter(" ", collapse) clear
(3 vars, 3 obs)

. list

+---------------------------+
| empid strdate shdate |
|---------------------------|
1. | 22132 01/21/52 7/85 |
2. | 28399 07/11/48 11/92 |
3. | 11752 10/28/62 5/90 |
+---------------------------+

Now that the data is correct we can convert the date string variables. We can tell Stata to convert the dates
using date(shdate,"MY"), but because the day is missing, how do we tell Stata that the hire date was the
15th of every month?

The quickest way (and easiest once you have seen it) to take care of the hire-date variable is by typing

. gen bdate = date(strdate,"MD19Y")

. drop strdate

. gen hdate = date(shdate, "M19Y")+14

. format bdate hdate %td

. list

+----------------------------------------+
| empid shdate bdate hdate |
|----------------------------------------|
1. | 22132 7/85 21jan1952 15jul1985 |
2. | 28399 11/92 11jul1948 15nov1992 |
3. | 11752 5/90 28oct1962 15may1990 |
+----------------------------------------+

Let me explain. The date() function simply assumes that we wanted the first of each month. That is why I then
add 14 so that we obtain the 15th of the month instead of the 1st.

Now let me show you how to get the same result with the help of some string functions. Here is the strategy:

1. Find the location of the / in shdate and break it into two numeric variables, mo and yr.

2. Use another Stata function, mdy(), and calculate mdy(mo,15,1900+yr). mdy() is not as smart as
date()—the years must include the century.

Now consider step 1. Let scol be the column of the / in shdate:

scol = strpos(shdate,"/")

Then the month is

substr(shdate, 1, scol-1)

and the year is

substr(shdate, scol+1, .)

There is only one more detail: substr() returns a string, and I want the month and year as numeric variables;
therefore, I want to calculate the

real(substr(...))
So, I type

. gen scol = strpos(shdate,"/")

. gen mo = real(substr(shdate, 1, scol-1))

. gen yr = real(substr(shdate, scol+1, .))

. gen hdate2 = mdy(mo, 15, 1900+yr)

. format hdate2 %td

. drop scol mo yr shdate

. list

+-------------------------------------------+
| empid bdate hdate hdate2 |
|-------------------------------------------|
1. | 22132 21jan1952 15jul1985 15jul1985 |
2. | 28399 11jul1948 15nov1992 15nov1992 |
3. | 11752 28oct1962 15may1990 15may1990 |
+-------------------------------------------+

I get the same answer as before.

Now I can calculate the age at hire:

. drop hdate2

. gen hireage = (hdate-bdate)/365.25

. list

+------------------------------------------+
| empid bdate hdate hireage |
|------------------------------------------|
1. | 22132 21jan1952 15jul1985 33.48118 |
2. | 28399 11jul1948 15nov1992 44.34771 |
3. | 11752 28oct1962 15may1990 27.54552 |
+------------------------------------------

© Copyright 2019 StataCorp LP.

You might also like