Professional Documents
Culture Documents
Choosing a statistical software package is often a strategic decision which entails the
investment of time and money. One should think about the future development and
compatibility of the software support system available from peers and exchanging
experience.
1. Introduction
Stata is a powerful statistical package with smart data-management facilities, a wide array of
up-to-date statistical techniques, and an excellent system for producing publication-quality
graphs. The official website is http://www.stata.com/. Its main strengths are handling and
manipulating large data sets, and capabilities for handling panel and time-series regression analysis.
Stata is designed for the usage by command line, but it also offers a GUI that allows for
working with menus. The command line usage has the advantage that all steps of the
analysis, and thus all results, are easily replicable.
Versions of Stata
There are a number of different versions of STATA available, these are STATA SE (Special
Edition), Intercooled STATA and Small STATA. The standard version is called Stata/IC (or
Intercooled Stata) and can handle up to 2,047 variables. The Stata/SE that can handle up to
32,766 variables.
When Stata starts up you see four or five docked windows, depending on the version; the
later versions have 5 windows, which are:
Important to note:
✓ You can resize or even close some of these windows. Stata remembers its settings the next
time it runs.
✓ You can also save (and then load) named preference sets using the menu Edit|Preferences.
✓ You can also choose the font used in each window; just right click and select font from the
context menu.
✓ You can change the colour scheme. For earlier versions of Stata the preset scheme is classic,
the traditional black background.
For version 8 and above Stata's graphical user interface (GUI) allows selecting commands
and options from a menu and dialog system. The use of the GUI is always the easier route
however the command language ensures reproducibility of your results. Commands can be
typed on a separate file, called a do file, or in the command window. The GUI can be helpful
when you are starting to learn Stata, particularly because after you point and click on the
menus and dialogs, Stata types the corresponding command for you.
Typing Commands
The standard structure of Stata commands usually follows the following syntax, where bold
indicates keywords and square brackets indicate optional elements:
[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [using filename]
[,options]
A variable list (varlist) is a list of variable names with spaces/blanks in between. There are a
number of shorthand conventions (use of wildcards) to reduce the amount of typing.
myvar* All variables starting with myvar
*var All variables ending with var
my*var All variables starting with my and ending with var
my~var A single variable starting with my and ending with var
my?var All variables starting with my and ending with var with
one other character between
myvar1-myvar5 All variables in the order of the variables window myvar1
through myvar5
The ? character matches a single character. All variables matching the pattern are returned.
The - character indicates that all variables in the dataset, starting with the variable to the left
of the - and ending with the variable to the right of the – are to be returned.
Working Directory
The working directory is displayed at the bottom left hand corner of the window. Stata
displays the default directory. Any files you save without specifying a directory will
be saved here. To change your working directory, use the cd command:
cd directoryname
Allocating memory
set memory
Stata reads the whole dataset into the working memory, thus, sufficient memory has to be
reserved (or an error message will be displayed). Therefore, you should set the size of the
working memory reserved for Stata before loading a (big) dataset with the command set
memory:
Syntax: set memory Xm [, permanently]
e.g: set memory 200m
X represents the number of megabytes and the permanently option allows for a
permanent setting of the respective memory size. Nonetheless, not more than ¾ of the
available memory of the computer should be reserved for Stata in order to guarantee a
good performance of the system
Help in STATA
You may need to save your results and keep track of the commands that you have used.
The default is that all our output go to the Results window, where it can be viewed but
eventually disappears. To keep a record of your results, you should log your session after
which Stata writes all results to both the Results window and to the file you specify.
where filename is the name of your log file. Note the use of two recommended options: text
and replace. The text option create logs in plain text (ASCII) format (the file has extension
.log), which can be viewed in an editor such as Notepad or a word processor such as MS
Word otherwise log file can only be viewed using Stata's Viewer and has extension .scml
The replace option specifies that the file is to be overwritten if it already exists the other
alternative is append. , which adds new stuff to what is already in the file.
The capture keyword tells Stata to run the command that follows and ignore any errors.
(i)You can use Stata's own built-in do-file Editor, which has the great advantage that you can
run your program directly from the editor by clicking on the run icon or selecting Tools|Run
from the menu. You can also select just a few commands and run them by selecting
Tools|Run Selection in the menu. To access Stata's do editor use Ctrl-8 in earlier versions or
Ctrl-9 in versions 12 and 13 or select Window|Do-file Editor|New Do-file Editor in the menu
system.
(ii) Alternatively, you can use an editor such as Notepad. Save the file using extension .do
and then execute it using the command do filename.
(iii)You could even use a word processor such as Word, but you would have to remember to
save the file in plain text format, not in Word document format .
It is always a good idea to annotate your do files with explanatory comments that provide the
explanation of what you are trying to do.
(i)In the Stata command window you can start a line with a * to indicate that it is a comment,
not a command. This can be useful to annotate your output.
(ii) In a do file you can also use two other types of comments: // and /* */
(a) //is used to indicate that everything that follows to the end of the line is a
comment and should be ignored by Stata. For example you could write
(b) /* */ is used to indicate that all the text between the opening /* and the
closing */, which may be a few characters or may span several lines, is a comment to
be ignored by Stata.
(c) /// To indicate to Stata that a command continues on the next line you use ///,
which says everything else to the end of the line is a comment and the command itself
continues on the next line.
Continuation Lines
(i) To indicate to Stata that a command continues on the next line you use ///, which
says everything else to the end of the line is a comment and the command itself
continues on the next line. For example you could write
(ii) Alternatively tell Stata to use a semi-colon instead of the enter key at the end of
the line to mark the end of a command, using #delimit ;, as in this example:
#delimit ;
graph twoway (scatter lexp loggnppc)
(lfit lexp loggnppc) ;
This is a command that sets the character that marks the end of a command line. The
character chosen is ; This command is helpful when writing do-files because otherwise
commands should be written in a single line. Now all commands need to terminate with a
semi-colon. To return to using carriage return as the delimiter use
#delimit cr
NB***
We start the do file by specifying the version of Stata we are using, in this case 10. This
helps ensure that future versions of Stata will continue to interpret the commands correctly,
even if Stata has changed
2. Data Management
In this section we describe Stata data files, discuss how to read raw data into Stata in free and
fixed formats, how to create new variables, how to document a dataset labelling the variables and
their values, and how to manage Stata system files.
Stata 11 introduced a variables manager that allows editing variable names, labels, types, formats,
and notes, as well as value labels, using an intuitive graphical user interface available under
Data|Variables Manager in the menu system. While the manager is certainly convenient, we still
prefer writing all commands in a do file to ensure research reproducibility.
Stata Files
Variable names can have up to 32 characters, but many commands print only 12, and shorter
names are easier to type. Stata names are case sensitive.
Variables can contain numbers or strings. Numeric variables can be stored as integers (bytes,
integers, or longs) or floating point (float or double). These types differ in the range or precision
of the values they can hold, type help datatype for details.
You usually don't need to be concerned about the storage mode; Stata does all calculations using
doubles, and the compress command will find the most economical way to store each variable in
your dataset, type help compress to learn more.
You do have to be careful with logical comparisons involving floating point types. If you store
0.1 in a float called x you may be surprised to learn that x == 0.1 is never true. The reason is that
0.1 is "rounded" to different binary numbers when stored as a float (x) or as a double (the
constant 0.1). This problem does not occur with integers or strings.
String variables can have varying lengths up to 244 characters in Stata 12, or up to two billion
characters in Stata 13, where you can use str1...str2045 to define fixed-length strings of up to
2045 characters, and strL to define a long string.
To convert string variables numbers you use the command destring or the function real().
Otherwise, you can use encode to convert string data into a numeric variable or decode to convert
numeric variables to strings.
Missing Values
Like other statistical packages, Stata distinguishes missing values. The basic missing value for
numeric variables is represented by a dot . Starting with version 8 there are 26 additional missing-
value codes denoted by .a to .z. These values are represented internally as very large numbers, so
valid_numbers< . < .a < ... < .z.
To check for missing you need to write var >= . (not var == .). Stata has a function that can do
this comparison, missing(varname) and I recommend it because it leads to more readable code,
e.g. I prefer list id if missing(age) to list id if age >= .
From Excel
If you have an Excel file, you need to save it as a CSV file (a comma delimited file). Then, the
insheet command will read the file into Stata. E.g
On some occasions, you also need to include a variable list. So, the command would read insheet
var1 var2 var3 … using “C:\myexcelfile.csv”, comma;
In this section we discuss how to read raw data files. If your data come from another statistical
package, such as SAS or SPSS, consider using a tool such as Stat/Transfer
(www.stattransfer.com) or DBMSCopy (www.dataflux.com).
Free Format
If your data are in free format, with variables separated by blanks, commas, or tabs, you can use
the infile command.
For an example of a free format file see the data available on the web data given url e.g
http://uz.stats/datasets
We can read the a free format data set into Stata using the command
The infile command is followed by the names of the variables. Because the var1 is a string rather
than a numeric variable we precede the name with str20, which sets the type of the variable as a
string of up to 1420 characters. All other variables are numeric, which is the default type.
The keyword using is followed by the name of the file, which can be a file on your computer, a
local network, or the internet. In this example we are reading the file directly off the internet. And
that's all there is to it. For more information on this command type help infile1. To see what we
got we can list a few cases
. list in 1/3
Survey data often come in fixed format, with one or more records per case and each variable in a
fixed position in each record.
infix str var1 2-19 var2 20-24 var3 25-29 var4 30-40 using http://uz.stats/datasets/hhdata.raw, clear
This says to read the variable1 from columns 2-19, var2 from columns 20-24, and so on. It is, of
course, essential to read the correct columns. We specified that variable 1 was a string variable
but didn't have to specify the width, which was clear from the fact that the data are in columns 2-
19. The clear option is used to overwrite the existing dataset in memory.
If you have a large number of variables you should consider typing the names and locations on a
separate file, called a dictionary, which you can then call from the infix command.
The infile command can also be used with fixed-format data and a dictionary. This is a very
powerful command that gives you a number of options not available with infix; for example it
lets you define variable labels right in the dictionary, but the syntax is a bit more complicated.
In most cases you will find that you can read free-format data using infile and fixed-format data
using infix. For more information on the trade-offs see help infiling.
Data can also be typed directly into Stata using the input command, see help input, or using the
built-in Stata data editor available through Data|Data editor on the menu system.
Stata lets you label your dataset using the label data command followed by a label of up to 80
characters (244 in Stata SE). You can also add notes of up to ~64K characters each using the
notes command followed by a colon and then the text:
Users of the data can type notes to see your annotation. Documenting your data carefully always
pays off.
You can (and should) label your variables using the label variable command followed by the
name of the variable and a label of up to 80 characters enclosed in quotes. With the infile
command you can add these labels to the dictionary, which is a natural home for them. Otherwise
you should prepare a do file with all the labels. Here's how to define labels for the three variables
in our dataset:
Stata also lets you add notes to specific variables using the command notes varname: text. Note
that the command is followed by a variable name and then a colon:
Value Labels
You can also label the values of categorical variables, for example sexHHH.
Stata has a two-step approach to defining labels. First you define a named label set which
associates integer codes with labels of up to 80 characters (244 in Stata SE), using the label
define command. Then you associate the set of labels with a variable, using the label values
command. Often you use the same name for the label set and the variable, as we did in our
example.
One advantage of this approach is that you can use the same set of labels for several variables.
The canonical example is label define yesno 1 "yes" 0 "no", which can then be associated with all
0-1 variables in your dataset, using a command of the form label values variablename yesno for
each one. When defining labels you can omit the quotes if the label is a single word, but I prefer
to use them always for clarity.
The most important Stata commands for creating new variables are generate/replace and recode,
and they are often used together.
The generate command creates a new variable using an expression that may combine constants,
variables, functions, and arithmetic and logical operators. Let's start with a simple example: here
is how to create setting squared:
You can also use drop var_names to drop one or more variables from the dataset.
The following table shows the standard arithmetic, logical and relational operators you may use
in expressions:
Here's how to create an indicator variable for countries with high-effort programs:
This is a common Stata idiom, taking advantage of the fact that logical expressions take the value
1 if true and 0 if false. A common alternative is to write
. generate adult1 = 0
. replace adult = 1 if age > 17
The two strategies yield exactly the same answer. Both will be wrong if there are missing values,
which will be coded as high age because missing value codes are very large values. You should
develop a good habit of avoiding open ended comparisons. My preferred approach is to use
which gives true for age above 17, false for age less than or equal to 17, and missing when age is
missing. Logical expressions may be combined using & for "and" or | for "or". Here's how to
create an indicator variable for age between 5 and 15:
Functions
Stata has a large number of functions, here are a few frequently-used mathematical functions,
type help mathfun to see a complete list:
min(x1,x2,...,xn) the minimum of x1, x2, ..., xn, ignoring missing values
These functions are automatically applied to all observations when the argument is a variable in
your dataset.
Stata also has a function to generate random numbers (useful in simulation), namely uniform(). It
also has an extensive set of functions to compute probability distributions (needed for p-values)
and their inverses (needed for critical values), including normal() for the normal cdf and
invnormal() for its inverse, see help density functions for more information. To simulate normally
distributed observations you can use
.rnormal() // or invnormal(uniform())
There are also some specialized functions for working with strings, see help string functions, and
with dates, see help date functions.
Recoding Variables
The recode command is used to group a numeric variable into categories. Suppose for example a
survey has age in single, and you would like to code it into 5-year age groups.
Each expression in parenthesis is a recoding rule, and consist of a list or range of values, followed
by an equal sign and a new value. A range, specified using a slash, includes the two boundaries,
so 15/19 is 15 to 19, which could also be specified as 15 16 17 18 19 or even 15 16 17/19. You
can use min to refer to the smallest value and max to refer to the largest value, as in min/19 and
44/max. The parentheses can be omitted when the rule has the form range=value, but they usually
help make the command more readable.
Values are assigned to the first category where they fall. Values that are never assigned to a
category are kept as they are. You can use else (or *) as the last clause to refer to any value not
yet assigned. Alternatively, you can use missing and nonmissing to refer to unassigned missing
and nonmissing values; these must be the last two clauses and cannot be combined with else.
You can also specify value labels in each recoding rule. This is simpler and less error prone that
creating the labels in a separate statement. The option label(label_name) lets you assign a name
to the labels created (the default is the same as the variable name).
It is often a good idea to cross-tabulate original and recoded variables to check that the
transformation has worked as intended
The clear statement deletes the data currently held in memory and any value labels you
might have. Typing clear all removes these objects from memory, ensuring that you start
with a completely clean slate.
To delete cases and variables the drop command is used. E.g drop varlist for deleting
variables and drop if exp or drop in range [if exp] for deleting observations or
cases.
Similarly, you may specify the variables or observations you want to keep, using keep command.
E.g keep varlist for deleting variables and keep if exp or keep in range [if
exp] for deleting observations or cases.
After importing data into Stata from external sources, variables containing only numbers
are sometimes saved in string format. Thus, no arithmetic operations can be performed
with such a variable. The conversion can then be performed with the destring command:
destring [varlist], {generate(newvarlist)|replace}
Variables that are already numeric will not be changed. IF any nonnumerical character is
found this variable will not be changed. It might be necessary to remove or replace non-
numeric characters prior to converting the string variable into numerical format (e.g., decimal
separator comma instead of period).
Data Merging
Adding cases
The append is used to join datasets end to the end with the one currently used. If the data
types of the variables are not the same Stata will promote data types and will keep all
variables when the two datasets have differing variable names:
Syntax: append using filename.dta [, options]
E.g.: append using newdata.dta
Adding variables
Datasets sharing the same kind of observations, but having different variables, can be
joined with the merge command. Then, the currently used dataset is extended with the
corresponding observations from one or more other files with the following options :
One-to-one merge
Syntax: merge 1:1 varlist using filename.dta [, options]
Many-to-one merge
Syntax: merge m:1 varlist using filename.dta [, options]
Many-to-many merge
Syntax: merge m:m varlist using filename.dta [, options]
The current/working and “using” datasets need to share at least one common variable, called
the primary key/matching variable. The match variable(s) is (are) defined in varlist.
The merging options handles the issue of variables and observations in the incoming data
set.
The option keepusing(varlist) specifies the variables to be kept from the “using”
dataset. If keepusing() is not specified, all variables are kept.
The option nokeep causes merge to ignore observations in the “using” dataset that have
no corresponding observation in the working data set.
After merging, Stata automatically generates a variable _merge which contains information
about the matching of the data with the following codes:
1 Observations only from active dataset
2 Observations only from “using” dataset(s)
3 Observations from active and “using” dataset(s)
4 Observations from both, missing values updated
5 Observations from both, conflicting nonmissing values
Data Exportation
Data can be exported out from Stata to a tab-separated text file with outsheet using
filename.
Data can be exported out from Stata to Excel worksheet with export excel using
filename.
The putexcel command writes Stata expressions, matrices, tables, images, and returned results
to an Excel file. This allows you to automate exporting and formatting of, for example, Stata
estimation results. Learn more about the putexcel command using help putexcel
The syntax of the graph commands is different from that of most other Stata commands.
A Stata command for creating graphs comprises two elements: the graph command and a
graph type
The most common graphs in statistics are X-Y plots showing points or lines. These are available in Stata
through the twoway subcommand, which in turn has 31 sub-subcommands or plot types.
For the twoway graph type, a plot type must also be specified. Here is an example with the
plot type scatter:
For the twoway graph type, you can leave out graph to save typing. The most important of
which are scatter and line and for these you can even leave out twoway. The following
commands are therefore identical to the one given above:
. twoway scatter hhsize income
. scatter hhsize income
The plot types of the twoway graph type can be overlaid. Here is an example with
scatter and lfit:
. graph twoway (scatter hhsize income) (lfit hhsize income)
Here both types are set in parentheses. However, you can also separate the plot types with ||
as in the following example, where we also leave out graph and twoway:
. scatter hhsize income || lfit hhsize income
Stata 10 introduced a graphics editor that can be used to modify a graph interactively.
Graph elements
The appearance of graphs is defined by a series of elements.
• Elements that control the display of data, including the shape, color, and size of
the “marker symbols”, as well as lines, bars, and other ways to display data.
• Elements that control the size and shape of the graph, including the “graph region”
and “plot region”. The graph region is the size of the entire graph, including titles,
legends, and surrounding text. The plot region is the space that can hold data
points. Just inside the plot region is the data region. The plot region’s size
Fitted Lines
Suppose we want to show the fitted regression line as well. In some packages you would need to
run a regression, compute the fitted line, and then plot it. Stata can do all that in one step using
the lfit plot type. (There is also a qfit plot for quadratic fits.) This can be combined with the
scatter plot by enclosing each sub-plot in parenthesis. (One can also combine plots using two
horizontal bars ||)
Now suppose we wanted to put confidence bands around the regression line. Stata can do this
with the lfitci plot type, which draws the confidence region as a gray band. (There is also a
qfitci band for quadratic fits.) Because the confidence band can obscure some points we
draw the region first and the points later
Note that this command doesn't label the y-axis but uses a legend instead. You could specify a
label for the y-axis using the ytitle() option, and omit the (rather obvious) legend using
legend(off).
graph twoway (lfitci yvar xvar) (scatter yvar xvar), ytitle("Y variable")
legend(off)
There are options that apply to all two-way graphs, including titles, labels, and legends. Stata
graphs can have a title() and subtitle(), usually at the top, and a legend(), note()
and caption(), usually at the bottom, type help title_options to learn more. Usually a
title is all you need. Stata 11 allows text in graphs to include bold, italics, greek letters,
mathematical symbols, and a choice of fonts, type help graph text to learn more.
To add a legend to specify the linear fit and 95% confidence interval, use the order(2
"linear fit" 1 "95% CI") option of the legend to label the second and first items in that
order. (in a do file you can type with annotations as follows)
The mlabel(varname) command is used to label the points using text included in another
variable.
The line plot allows you to specify more than one "y" variable, the order is y1, y2, ..., ym, x.
Alternatively, we could have used several line plots: (line y1 x) (line y2 x) etc.
Here we used three options, which as usual in Stata go after a comma: title, subtitle and
legend. The legend option has many sub options; we used order to list the keys and their
labels, saying that the first line represented category1 and the tcategory2 blacks. To omit a key
you just leave it out of the list. To add text without a matching key use a hyphen (or minus sign)
for the key. There are many other legend options, see help legend_option to learn more.
Line Styles
Stata lets you control the line style in different ways. The clstyle() option lets you use a
named style, such as foreground, grid, yxline, or p1-p15 for the styles used by lines 1 to
15, see help linestyle. This is useful if you want to pick your style elements from a scheme,
as noted further below.
Alternatively, you can specify the three components of a style: the line pattern, width and color:
• Patterns are specified using the clpattern() option. The most common patterns are
solid, dash, and dot; see help linepatternstyle for more information.
• Line widthis specified using clwidth(); the available options include thin, medium
and thick, see help linewidthstyle for more.
• Colors can be specified using the clcolor() option using color names (such as red,
white and blue, teal, sienna, and many others) or RGB values, see help
colorstyle.
Here's how to specify blue for whites and red for blacks:
Note that clcolor() is an option of the line plot, so we put parentheses round the line
command and inserted it there.
Scale Options
Graph Schemes
Stata uses schemes to control the appearance of graphs, see help scheme. You can set the
default scheme to be used in all graphs with set scheme_name. You can also redisplay the (last)
graph using a different scheme with graph display, scheme(scheme_name).
To see a list of available schemes type graph query, schemes. Try s2color for screen
graphs, s1manual for the style used in the Stata manuals, and economist for the style used in
The Economist. Using the latter we obtain the graph shown at the start of this section.
Combining graphs
Stata allows you to combine as many graphs as you want into a joint graph. To do this,
you first save the individual graphs and then combine them using graph combine. We
will demonstrate this using a display of hhsize by income, separated by respondents from
any two districts
. scatter hhsize1 income1, name(dist1, replace)
. scatter hhsize2 income2, name(dist2, replace)
. graph combine dist1 dist2
It is important to set the number of rows and columns in the combined graph. The individual
graphs are placed in the combined graph in rows and columns in a matrix-like fashion. The
positioning of the individual graphs depends on how many rows and columns the matrix has.
. graph combine dist1 dist2, rows(2)
The by() option displays separate graphs for each group defined by the variable in the
parentheses. If more than one variable is entered in the parentheses, graphs are provided
for every combination of the chosen variables.
scatter hhsize income, by(sexhhh, benstatus)
We use the command summarize followed by the names of the variables to get descriptive
statistics for our variables. Type the following command and note the difference:
. summarize
. summarize lexp gnppc
We compute a new variable using the generate command with a new variable name and an
arithmetic expression. For example to compute natural logs we use the built-in function log:
To run a linear regression of dependent variable on an independent variable use the regress
command, which lists the outcome followed by the predictors e.g
Stata omits observations that are missing the outcome or one of the predictors.
Following a regression or any estimation command you can retype the command with no
arguments to see the results again. Try typing reg.
Post-Estimation Commands
Stata has a number of post-estimation commands that build on the results of a model fit. A
useful command is predict, which can be used to generate fitted values or residuals
following a regression.
. predict plexp
(option xb assumed; fitted values)
(5 missing values generated)
The command generates a new variable, plexp, that has the life expectancy predicted from
our regression equation.
A common task is to superimpose a regression line on a scatter plot to inspect the quality of
the fit. We could do this using the predictions we stored in plexp, but Stata's graph
command knows how to do linear fits on the fly using the lfit plot type, and can
superimpose different types of twoway plots. Try the command
It's hard not to notice the country on the bottom left of the graph, which has much lower life
expectancy than one would expect, even given its low GNP per capita. To find which country
it is we list the (names of the) countries where life expectancy is less than 55:
The keyword clean after the comma is an option which omits the borders on the listing.
Many Stata commands have options, and these are always specified after a comma.
Here we restricted the listing to cases where the value of the variable country was "United
States". Note the use of a double equal sign in a logical expression. In Stata x = 2 assigns the
value 2 to the variable x, whereas x == 2 checks to see if the value of x is 2.
PRACTICAL
Ag
e
of Ownershi
hhi Sex of Marital of HH p of weigh heigh
d HHH town HHH opinion H Sources of income house t t
Petty
Salary Rentals trade Other
1 Male Harare Divorced agree 28 160 75 680 1500 Yes 65 1.87
Femal
2 e Bulawayo Married disagree 45 640 160 2600 4500 Yes 73 1.74
Femal
3 e Mutare Single disagree 33 468 130 900 2800 Yes 57 1.64
4 Male Bulawayo Married agree 38 537 140 1985 5000 No 66 1.52
5 Male Bulawayo Widowed agree 51 615 180 2355 4800 Yes 82 1.48
6 Male Mutare Married agree 37 512 490 1200 3000 No 94 1.81
Femal
7 e Harare Married disagree 41 530 100 720 1900 Yes 59 1.69
Femal
8 e Bulawayo Divorced agree 57 640 180 2100 5200 Yes 88 1.65
Create a do file with all the command for the following procedures
(iii) Label all the variables and label values for all categorical variables
(vi) Construct bar charts for all the categorical variables, with bars showing percentages
(vii) Construct a stacked bar for the sources of income by town, inserting percentages in each
sector
(viii) Construct a scatter plot for age and salary and comment. Include the line of best fit with
95% confidence bands.
(ix) Calculate the Pearson’s product moment correlation coefficient for age and salary and
comment
(x) Fit a linear regression between age and salary and comment.
(xi) Construct a clustered bar chart for town and marital status and comment
(xii) Split your data by marital status and produce a frequency table for opinion
(xiii) Determine the measures of central tendency for households whose household heads
(HHH) are aged between 40 and 60
(xv) Compute the total income and all the quartiles for total income and comment
(xvii) Recode age into a different variable with categories 21-30, 31-40, 41-50, 51-60, 61-
70,71-80 and produce a frequency distribution for theis ne variable.
𝑊𝑒𝑖𝑔ℎ𝑡
(xviii) Compute the Body Mass Index – BMI for each HHH, where 𝐵𝑀𝐼 = 𝐻𝑒𝑖𝑔ℎ𝑡 2
Amt Rebat
Rcd No. Slsprn Date Make Model Sales com
paid es
Q2.
A parent /guardian has decided to monitor how much their child spends during a certain
semester. The second semester will constitute a data collection period to study expenditures.
The
main aim is to learn from this semester and use that as a basis of parent’s advice to
student(child) regarding their future spending. During the semester under study: 60
expenditures were incurred. The parent’s goal for the analysis is the determination of how and
why expenditures occur over time.
1. What types of purchases are being made?
2. Are there spending patterns occurring during the week, month, and semester?
3. How are the payments of expenditures divided among the credit card and cash?
4. Finally, can some of the expenditures be identified as unnecessary?
The pair agree on 6 data fields to describe each observation:
(1) the number identifier of the week (1–15) for the 15-week semester in which
the expenditure occurs, (2) the date, (3) the weekday (Sunday = Sn, Monday = M,
etc.) corresponding to the date, (4) the amount of the expenditure in dollars,
(5) whether cash (C) or credit card (R) was used for payment, and finally, (6) one
of three categories of expenditure types: food (F), personal (P), and school (S).
Table 2.4 2nd semester university student expenses
1 6-Jan Sn 111.46 R F
2 7-Jan M 43.23 C S
Week
01
3 8-Jan T 17.11 C S
4 10-Jan Th 17.67 C P
5 13-Jan Sn 107.00 R F
7 14-Jan M 33.91 C P
Week
02
8 17-Jan Th 17.67 C P
9 18-Jan F 41.17 R F
10 20-Jan Sn 91.53 R F
11 21-Jan M 49.76 C P
12 21-Jan M 32.97 C S
Week
03
13 22-Jan T 14.03 C P
14 24-Jan Th 17.67 C P
15 24-Jan Th 17.67 C P
16 27-Jan Sn 76.19 R F
17 31-Jan Th 17.67 C P
Week
04
18 31-Jan Th 17.67 C P
19 1-Feb F 33.03 R F
20 3-Feb Sn 66.63 R F
Week
21 5-Feb T 15.23 C P
05
22 7-Feb Th 17.67 C P
23 10-Feb Sn 96.19 R F
24 12-Feb T 14.91 C P
Week
06
25 14-Feb Th 17.67 C P
26 15-Feb F 40.30 R F
27 17-Feb Sn 96.26 R F
28 18-Feb M 36.37 C S
29 18-Feb M 46.19 C P
Week
07
30 19-Feb T 18.03 C P
31 21-Feb Th 17.67 C P
32 22-Feb F 28.49 R F
33 24-Feb Sn 75.21 R F
Week
34 24-Feb Sn 58.22 R F
08
35 28-Feb Th 17.67 C P
10-Mar Sn 106.49 R F
39
42 16-Mar S 53.32 R S
43 17-Mar Sn 111.78 R F
Week
44 19-Mar T 17.91 C P
11
45 23-Mar S 53.52 R P
46 24-Mar Sn 69.00 R F
Week
12
47 28-Mar Th 17.67 C P
48 31-Mar Sn 56.12 R F
49 1-Apr M 48.24 C S
Week
13
50 4-Apr Th 17.67 C P
51 6-Apr S 55.79 R S
52 7-Apr Sn 107.88 R F
Week
53 8-Apr M 47.37 C P
14
54 13-Apr S 39.05 R P
55 14-Apr Sn 85.95 R F
56 16-Apr T 22.37 C S
57 16-Apr T 23.86 C P
Week
15
58 18-Apr Th 17.67 C P
59 19-Apr F 28.60 R F
60 20-Apr S 48.82 R S