You are on page 1of 26

Getting Started with STATA

By: Katie Droll

Embrace Stata!
Stata is your statistical buddy!
If you put in a bit of effort to learn the
basics, you should find the program quite
easy and very helpful.
Statistical software can be very
intimidating your 1st time around. Stay
patient!

Results window: This is where non-graphic output is printed

Review Window:
lists all commands
Click on command to
rerun

Variable
Window

Enter Commands here!


STATA Command Window

Graph Window:
Click on graph & copy into
word doc

How do I enter data?


Retrieve data from stored data files:
EASY: Open .dta files from textbook CD-ROM
HARDER: Import ASCII data from .txt or .raw
But also useful outside the context of class

Manually enter variables & data values :


EASY: Use the data editor
HARDER: Use input command
Time consuming if there is a lot of data
Prone to errors: typos!

Where is the stored data?


Textbook CD-ROM
Datasets for examples found in chapter examples will
be under the appropriate chapter folder under Stata
Datasets for homework problems in Appendix B of the
book should also be here under exercise

On the course website


Under Statistical Computing Datasets
Save the .dta file on your computer

Retrieving .DTA files


Command line:
use "E:\Stata\exercise\nurshome.dta", clear
-OR-

Point and Click:


Go to File Open Select your CD drive
Go to Stata exercise OR chapn

Importing .txt OR .raw data files


Remove the variable names and any other
symbols (such as *) from the top of the .txt file,
then save!
Command:
infile str20 strvar1 numvar2 using C:\Unicef.txt"
Variable names
import data
command
Command for string
variable indicating the
length

Command
File pathname

Entering data using the editor


Go to Data Data Editor
Enter your data similar to a spreadsheet program like Excel
Double-click on the variable names (var1) to edit them and add variable labels
Click Preserve, and then close out of the data editor window
You cannot run analyses on this data until you preserve the data and close the
data editor!

Variable
Name

Thank your classmates


for taking time to figure
this out!

This tells
STATA the
variable is
string

Entering data using input


Start data
entry

input str18 name age


Joe Smith 15
Must use if there are
Ricky Bobby 24
Wilma Flintstone 27 any spaces in variable
end Length of string variable
input str5 first str10 last age
Joe Smith 15
Ricky Bobby 24
Wilma Flintstone 27
end
Exit data entry

input year cigs


1900 54
1910 151
1920 665
1930 1485
1940 1976
1950 3522
1960 4171
1970 3985
1980 3851
1990 2828
end

Summarizing data
list
print your dataset to the results window
summarize variable
prints summary stats in the results window
summarize variable, detail
provides additional summary statistics

Lab #1 Main Topics

Bar Charts
graph bar cigs, over(year) title("Cigarette Consumption Per Person, US")
b2(Year) ytitle("number of Cigarettes") ylabel(0(2000)4000)

Number of Cigarettes
2,000

4,000

Cigarette Consumption Per Person, US

1900

1910

1920

1930

1940 1950
Year

1960

1970

1980

1990

Box plot
graph box cigs, title("Cigarette Consumption per Person, US")
ytitle("Number of Cigarette")
graph box resident, medtype(line) box(1, fcolor(magenta)
lcolor(purple)) title(Box plot of Nursing Home Residents)

nursing home residents


20
40
60

80

Boxplot of Nursing Home Residents

Histogram
histogram resident, ytitle(Distribution of Residents)
xtitle(Number of residents)
title(Histogram of the Distribution of Residents)

.005

Distribution of Residents
.01
.015
.02

.025

Histogram of the Distribution of Residents

0.00

20.00

40.00
Number of residents

60.00

80.00

Save commands!
Open a do editor:
Window Do-file Editor New Do-File
Copy and paste commands in this file to save for later
use
You can also copy and paste commands into a simple txt
file or a word file
Please include important output (results & graphs) in
your homework, along with the commands that produced
the included output.

Saving commands to a log file


Before your Stata session begins, you want to give Stata the
following Command:

log using "C:\Temp\myfile.log", noproc


After you are done writing your Stata commands, you can close the
log file by using the Log button located just below the Prefs menu (it
looks like scroll with a traffic light next to it).
From within Stata, you can examine the contents of that Log file with
the command:

type "C:\Temp\myfile.log"
To run that file as a program (referred to as a "do-file" in Stata), you
can simply issue the Stata command:

do "C:\Temp\myfile.log"

Putting Stata output into homework


Simply highlight what you want from the
results window (including the command),
then copy [Ctrl-C] and paste [Ctrl-V] into
your homework document
To copy and paste graphs, just click on the
graph before copying it. You can use [CtrlC] or Right-click Copy

Lab #2 Main Topics

Labels
Save organ.dta from the website to your computer, and it
open in Stata
The names of the afflicted organs are just labels. To see
what the raw data look like, you can list them without the
labels as follows:

list, nolabel
You can see what the association of label and value is by
listing the labels:

label list

Summarizing data by categorical


groups
If we want to do some exploratory analysis of our
data set, we can at first produce some
descriptive statistics for the survival of each
organ. To do that we must sort the observation
by organ.

sort organ
Then we can summarize the data by organ as
follows:

by organ: summarize survival

1,000 2,000 3,000 4,000

Bronchus

Ovary

Stomach

1,000 2,000 3,000 4,000

Breast

We can even generate


side-by-side box plots for
the survival from
diagnosis for each
affected organ as follows:

Length of Survival (days from diagnosis)

Side-by-side box plots

Graphs by Af f ected organ

graph box survival, by(organ) ytitle("Length of


Survival (days from diagnosis)")

Colon

Creating a new variable as a


function of an existing variable

The first conclusion from the box plot is that women with breast
cancer have the longest survival. This is consistent with the
descriptive statistics produced by the summarize command.

Another conclusion is that the variability in the length of survival is


not the same in all cases, with breast and ovarian cancer having a
large variability (indicated by the length of the box) while the rest of
the cancers have very small variability. This will actually be a
problem later on, so taking a transformation of the original survival
times. A logarithmic transformation is usually a good bet. We do this
as follows:

generate lsurv=ln(survival)
label var lsurv "Log-transformed survival"

Box plot of log survival


To include the overall box plot of survival in the
side-by-side box plots, you just add the option
total:

Bronchus

Colon

Ovary

Stomach

Total

Breast

Log-transformed Survival (days from diagnosis)

graph box lsurv, by(organ,total) ytitle("Log-transformed Survival (days


from diagnosis)")

Graphs by Affected organ

Histograms by group
We can also generate the histograms of
survival time (log-transformed) for each
type of cancer as well as total as follows:
hist lsurv, freq by(organ, total)
Bronchus

Colon

Ovary

Stomach

Total

0
15
10
5
0

Frequency

10

15

Breast

Log-transformed survival
Graphs by Af f ected organ

Selecting groups to summarize


To get descriptive statistics within only
breast and ovarian cancer groups you
must use the if statement within the
summarize command:
by organ: summarize survival if
organ==1 | organ==4, detail

Especially for Point-and-click People!

If you dont like entering commands, you can also use the menus in
Stata to point and click your way through the analyses.

To summarize data:
Data Describe Data choose an option here
Graphs:
Graphics

Bar Chart
Histogram
Box plot
and many other options

This is a great way to explore the program, and learn about the
various capabilities of Stata

Still please remember to include the command from the results


window in your homework

You might also like