Professional Documents
Culture Documents
Max .no. of
observations 2,147,483,647 2,147,483,647 1,000
Max. no. of
characters for 244 80 80
string variables
Order of the
matrix 1000 * 1000 800 * 800 40 * 40
Continued…
Stata is friendly and flexible to use
It provides easy way of learning it
It is possible to adopt changes easily into Stata
Stata is both command & menu driven statistical
package
Using commands to execute tasks in Stata is the
efficient one as it makes switching to programs
much faster.
Continued
Though efficient using commands to execute
tasks in Stata it is sometimes challenging as Stata
is case sensitive
Stata does not appreciate any spelling, capitalizing or
spacing errors while writing programs (Stata
commands)
2.The Stata Interface
The Stata interface comprises of Stata windows, Stata
menus bar and Stata buttons bar
A. Stata Windows
The Stata windows give you all the key information about:
The product of the Stata that you are using
Recent commands executed together with their outputs
They also let you get help on how to use Stata & notifies
any errors you might done while executing programs
(Stata commands)
Some of the Stata windows open automatically when you
launch Stata while others can be opened from window
menu or using buttons on the tool bar
Continued…
The Stata windows are:
1. Results: consists information about the product
of the Stata you are using , recent commands
together with their outputs except graphs and
feedback about errors you might done while
executing Stata commands (written in red)
2. Command : to enter a command into Stata for
execution
3. Variable : to see the names and descriptions of
the variables in the dataset that you are using
currently
4. Review : to see recent commands that you
executed
Continued…
5. Stata Editor : to edit and have a spreadsheet
like view of data in dataset that you are using
and to enter row data into Stata directly
6. Stata browser : to have a spreadsheet like view
of data in a the dataset that you are using (you
can’t edit data on it)
7. Stata do file editor : to construct a file of Stata
commands, to write and edit programs, to
execute a program or programs simultaneously
in Stata, to write descriptions about your
work so as to communicate it to outsiders
8. Stata viewer : to get help on how to use Stata
Continued…
9. Graph : to see the graph you drew
10. variables manager : to edit the names and
descriptions of variables in a dataset
B. Menus bar
Stata’s menu bar extend from left to right at the
top of the outer window
It contains 8 drop-down menus from which
different icons to execute different tasks in
Stata can be selected directly
These drop-down menus generally used as a
substitute for what can be done by using Stata
commands
Continued…
The following are 8 drop-down menus from
left to right of menus bar
File
I. Open : to open a dataset from Stata data file
II. Save: to save changes you made on the dataset
III. save as: to save the whole dataset
IV. Do : to open do file editor
V. File name :to copy the file name of a dataset to
the command line
VI. Exit : to quit Stata
VII. Print : to print log or graph
Continued…
Edit
I. Copy/paste :to copy text among the command,
results and log windows
II. Copy table: to copy table from results window to
another file
III. Table copy options :what to do with table lines in
copy table
IV. Preferences: gives various options for setting
preferences
Continued…
Data: to do tasks of data and variable
managements
Graphics: to draw different kinds of ordinary
and statistical graphs
Statistics : to do various statistical summaries
and analysis
User : to access new Stata commands from
internet
Window : to open Stata window
Help: to search Stata command or to get help
with how to use Stata
C. Buttons (tool) bar
The button bar contains some buttons to
execute routine tasks in stata directly by
clicking on them
The buttons on the button bar, from left to
right, their functions and their equivalent
commands are summarized on the table
below
Continued..
Button Function Equivalent
command
open To open Stata data file use
Leading-zero formats
%0#.#f fixed numeric format %09.2f
%0#s string format %015s
Left-justified formats
%-#.#g general numeric format %-9.0g
%-#.#f fixed numeric format %-9.2f
%-#.#e exponential numeric format %-10.7e
%-d default numeric elapsed date format %-d
%-d... user-specified elapsed date format %-dM/D/Y
%-#s string format %-15s
Centered formats
%~#s string format (special) %~15s
-----------------------------------------------------------------------------
ADVANCED DESCRIPTIVE STATISTICS
• tabulate … summarize
– This command creates one- and two-way tables
that summarize continuous variables. With the
“summarize” option, we can put means and other
statistics of a continous variable.
• The syntax is:
tabulate varname1 varname2 [if exp] [in range],
summarize(varname3) options
• You can specify which statistics with options “means”,
“standard” and “freq“
• You can use the abbreviation “tab…sum( )
ADVANCED DESCRIPTIVE STATISTICS
• Some examples:
• tab q1a, sum(cons) gives the mean, std
deviation, and frequency of per capita
expenditure for each region
• tab q1b, sum(cons) mean gives the mean
consumption for each village
• tab q1a sexh, sum(food) gives the mean, std
deviation, and frequency in each
cell of hh head sex per region
ADVANCED DESCRIPTIVE
STATISTICS
• tabstat
– This command gives summary statistics for a set
of continuous variable for each value of a
categorical variable.
• The syntax is:
tabstat varlist [if exp] [in range] , stat(statname [...]) by(varname)
• where
– varlist is a list of continuous variables
– statname is a type of statistic
– varname is a categorical variable
ADVANCED DESCRIPTIVE
STATISTICS
Some facts about this command:
– The default statistic is the mean.
– Optional statistics subcommands include mean, sum, max,
min, range, sd (standard deviation), var (variance),
skewness, kurtosis, median, and pn (nth percentile).
– Without the by() option, tabstat is like “summarize” except
that it allows you to specify the list of statistics to be
displayed.
– With the by() option, tabstat is like "tabulate … summarize
“except that tabstat is more flexible in the statistics and
format
ADVANCED DESCRIPTIVE
STATISTICS
• Examples
– tabstat food hhsize, stats(mean max min) gives mean,
max, and min of food & hhsize
– tabstat food hhsize, by(q1a) gives mean of two
variables for each region
– tabstat food, stats(median) by(q1a) gives the median
food consumption
for each region
• The tabstat command displays summary statistics for
a series of numeric variables in a single table.
ADVANCED DESCRIPTIVE STATISTICS
• table
– This command creates a wide variety of tables. It
is probably the most flexible and useful of all the
table commands in Stata.
• The syntax is:
table rowvar colvar [if exp] [in range], c(clist) [row col]
• where
– rowvar is the categorical row variable
– colvar is the categorical column variable
– clist is a list of statistic and variables
– row is an option to include a summary row
– col is an option to include a summary column
ADVANCED DESCRIPTIVE
STATISTICS
• Some useful facts about this command:
– The default statistic is the frequency.
– Optional statistics are mean, sd, sum, rawsum (unweighted), count, max, min,
median, and pn (nth percentile).
– The c( ) is short for contents of each cell.
– Like tab, it can be used to create one- and two-way frequency tables, but table
cannot do percentages
– Like tab…sum, it can be used to calculate basic stats for each value of a
categorical variable
– Its advantage over tab…sum is that it can do more statistics and it can take
more than one continuous variable
– Like tabstat, it can be used to calculate advanced stats for each value of a
categorical variable
– Its advantage over tabstat is that it can use two (and more) way tables, but its
disadvantage is that it has fewer statistics.
ADVANCED DESCRIPTIVE
STATISTICS
• Here are some examples:
– table q1a , row table of frequencies by region with total row
– table q1a, c(mean income) table of average income by
region
– table q1a, c(mean yield sd yield median yield) table of yield
statistics by region
– table q1a, c(mean yield) format(%9.2f) table of average yields
by region with format .
– table q1a sexh, c(mean yield) table of average yield by region
and sex
– table q1a sexh, c(mean income mean yield) table of avg yield &
income by region & sex
PRESENTING DATA WITH GRAPH
• The commands that draw graphs are
– graph twoway scatterplots, line plots,
– graph matrix scatterplot matrices
– graph bar bar charts
– graph dot dot charts
– graph box box-and-whisker plots
– graph pie pie charts
PRESENTING DATA WITH GRAPH
• Examples
– graph twoway scatter cons food
• We can show the regression line predicting
cons from food using lfit option.
– twoway lfit cons food
• The two graphs can be overlapped like this
– twoway (scatter cons hhsize) (lfit cons hhsize)
PRESENTING DATA WITH GRAPH
• Labeling graphs
scatter read write, title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
• Example
scatter ageh cons , title("title") subtitle("subtitle")
xtitle("xtitle") ytitle("ytitle") note("note")
NORMALITY AND OUTLIER
• skewness and kurtosis
sum rconspc
sum rconspc, detail
• check normality of a variable visually by
looking at some basic graphs
histogram rconspc
histogram rconspc, normal
histogram rconspc, normal bin(100)
NORMALITY AND OUTLIER
• graph box draws vertical box plots
graph box rconspc, by(sexh)
– y axis is numerical, and the x axis is categorical
– upper and lower bounds of box are defined by the
25th and 75th percentiles
– line within the box is the median
– ends of the whiskers are 5th and 95th percentile
• If rconspc is normal, the median would be in the
center of the box and the end of whiskers would be
equidistant from the box
NORMALITY AND OUTLIER
• The kdensity command with the normal option
kdensity rconspc, normal
– density graph of the residual with a normal distribution
superimposed on the graph
– useful in verifying that the residuals are normally
distributed
• pnorm command produces a P-P plot
pnorm rconspc
– It should be approximately linear if the variable follows
normal distribution
NORMALITY AND OUTLIER
• Qnorm command plots the quantiles of a variable
against the quantiles of a normal distribution
qnorm rconspc
– If the Q-Q plot shows a line that is close to the 45 degree
line, the variable is more normally distributed
• Both P-P and Q-Q plot prove that rconspc is not normal, with
a long tail to the right
• The qnorm plot is more sensitive to deviances from normality
in the tails of the distribution
• The pnorm plot is more sensitive to deviances near the mean
of the distribution
NORMALITY AND OUTLIER
• Dealing with outliers
– delete them from analyses
– use measures that are not sensitive to them, such
as median instead of mean
– transform the data to be more normal
– to replace them by imputation
NORMALITY AND OUTLIER
/* Calculate number of standard deviations from median by sex of hh head */
egen median=median(rconspc), by (sexh)
egen sd=sd(rconspc), by (sexh)
*generate the ratio of the deviation from the median to the standard deviation
gen ratio=(rconspc-median)/sd
* (3 missing values generated)
. *generate an outlier dummy if the value is 3 times the ratio above
gen outlier=1 if ratio>3 & ratio~=.
*(1414 missing values generated)
replace outlier=0 if outlier==. & ratio~=.
*(1411 real changes made)
tabulate outlier, missing
table sexh outlier, contents(mean rconspc) row col missing
NORMALITY AND OUTLIER
• Listwise deletion
histogram rconspc if outlier==0, normal
• Data transformation
– a log transformation
gen lnrconspc=ln(rconspc)
histogram lnrconspc if rconspc~=., normal
• Imputation
– First the analyst estimates a regression model in which the
dependent variable has missing values
– In the second step, the estimated regression coefficients
are used to predict (impute) missing values of that variable
NORMALITY AND OUTLIER
replace rconspc=. If outlier==1
xi: regress lnrconspc i.q1a i.sexh i.poor hhsize ageh, robust
predict yhat
• Or
• Options:
– tab : tab-delimited data
– comma : comma-delimited data
– delimiter("char"): use char as delimiter
– clear: replace data in memory
– names : variable names are included on the first line of the file
GOOD SITES TO LOOK AT!