You are on page 1of 4

20/09/23 13:59

*LDEMO2404 - Class 1

********************************
*PART 1: IMPORT AND EXPORT DATA
********************************

*1.1 Introduction to Stata


*************************

*do file example (pay attention to the different colours!)


********************************************************************
* with * you easily add a separate comment line
* with // you add a comment after the command
* comments should always be green

*1) clear the memory


*it is important to clear the memory before loading a data set to stata
clear // clear memory but keeps saved results
clear all // clears all information in memory
set more off // to avoid that the output appears only stepwisely

*2) define directory


cd "path" // define directory where data set should be saved
*example: data path
cd "C:\Users\menasheoren\OneDrive - UCL\Bureau\"

*3) open an example data set


sysuse lifeexp, clear // this command (sysuse)opens an example data set that Stata offers
*to open all other datasets, we would write -- use filename.dta, clear --

*1.2 Check and format data: basics


*********************************
*check whether your data looks OK

describe // this command describes the variables in the data set


*we see that one variable (country) is a string variable, as it contains names.
*we see that some variables have notes. I do not recommend this feature, but lets
check:
notes // contains a bit more information than what is in the variable names

browse // shows how the data looks like, opens window 'data editor'
list // shows data in results window (only useful for small datasets or selected cases)
list in 1/10

summarize // gives mean, s.d., range


sum //abbreviated summarize command
sum, detail // this is a command that gives all the summary statistics (incl percentiles) of
the variables
sum lexp // most often we do this only for one variable

help summarize // viewer file opens with information on sum command


search summarize //viewer file opens with search results

*label variables your own way


des // shows which variable need labels
label var safewater "pop access to safe water"

*label categorical data your own way


tab region // shows whether categories need (better) labels
label drop region //as the variable is already labelled, we first drop the labels
label define region 1 "Europe & C.Asia" 2 "North America" 3 "South America"
lab value region region

https://moodle.uclouvain.be/pluginfile.php/516396/mod_resource/content/1/LDEMO2404_session1_dofile.do Page 1 sur 4


20/09/23 13:59

*1.3 Export the data


**************************************

*1) export data in Stata format


save lifeexp.dta, replace // saves data as Stata data set in our directory, as we specified
it with 'cd'
/*alternatively*/ save "C:\Users\menasheoren\OneDrive - UCL\Bureau\lifeexp.dta", replace

*2) Export in excel format


*we can also save it in other formats, e.g. in Excel
export excel lifeexp.xlsx, firstrow(variables) replace

*1.4 Import the data


******************************

*now we can open this data in stata format from our directory
use lifeexp.dta, clear
/*OR*/ use "C:\Users\menasheoren\OneDrive - UCL\Bureau\lifeexp.dta", clear

*or we open the excel file again


import excel lifeexp.xlsx, firstrow clear

*import data in raw format


*if the file includes a categorical variable, precede the variable name with str[number of
maximum characters]
clear
/*example*/ infile str14 country setting effort change using
"https://grodri.github.io/datasets/effort.raw"

***********************************
*Part 2 VARIABLE FORMATS AND LABELS
***********************************

* 2.1 Data type


*************************************
* open dataset: here we use an example dataset provided by Stata (="system")
* Note again: `sysuse' instead of `use' (no need of path)
sysuse lifeexp, clear

describe // storage types and labels of the variables


* Note: variables contain either numbers or strings
help datatype // to get more details about storage types

/*
EXTRA: Rounding matters! If variable contains integers with 9 digits or less
accuracy, store as longs
* Otherwise, store as doubles (16 digits of accuracy)
recast double id // changes storage type of identification number to double
* When in doubt (computing new variable), always store as double
* Then, save on space by reformatting all variables with the most economical format:
compress

* Display format is NOT the same as storage format


* Even if double, browse may display variable like this: 2.30470e+14. To display fully:
format id %20.0f //sets variable's output format. Now id is 230470003170001 */

* 2.2 get fast overview over data


***********************
sysuse lifeexp, clear // open again the original data set
browse //shows data in editor window
list country lexp // like `browse', but in results window

*2.3 Structure data set

https://moodle.uclouvain.be/pluginfile.php/516396/mod_resource/content/1/LDEMO2404_session1_dofile.do Page 2 sur 4


20/09/23 13:59

*************************
*we can sort the order of observations by variable(s)
*e.g. we can sort the countries by their life expectancy
list country lexp
sort lexp
list country lexp

*we can also re-order the variables in the dataset


*e.g. we want that country and life expectancy are the first variables
order country lexp, first

*2.4 Descriptive statistics


******************************
summarize lexp gnppc // gives observations, mean, s.d., range
summarize lexp gnppc, detail // gives also percentiles

mean lexp //gives mean


tabulate country //frequency table
tabulate country region //cross-tabulation

*commands can be abbreviated


sum lexp
tab country
* variable names can be abbreviated: v*
* or listed: v01-v05 (in the order of the dataset)

*Display a statistic for a subgroup


tab region
tab region, nolab //first step: get the category numbers
* Display categories with values AND labels
numlabel, add
tab region

* Warning! Missing values are not displayed in standard output (see later)
tab safewater
tab safewater, missing

mean popgrowth if region==1 //gives mean value for Europe and Asia
mean popgrowth, over(region) //gives region-specific means
/*alternative*/
bysort region: summarize popgrowth, detail

* IF restricts the scope of a command to those observations for which the logical expression
is true
mean lexp if region==1 //gives mean value for Europe and Asia
* IF is very often used!
* Display according to a condition
tab country lexp if lexp<70 // handy if crossing two categorical variables
list country lexp if lexp<70 // lists countries and their life expectancies
browse country lexp if lexp<70 // same as list, but in a separate window

* IN restricts the scope of the command to a specific observation range


list lexp in 1/10 //gives the values for the first ten observations
* Do not use this command, prefer browse

* Cross-tabulation of categorical variables


* Here we use a continuous variable (lexp) since dataset has no other categorical
variable than region
*Column, row, and cell percentages
tab lexp region, nof col // no absolute frequencies (nof), only column percents (col)
tab lexp region, nof row // no absolute frequencies (nof), only row percents (row)
tab lexp region, nof cell // no absolute frequencies (nof), only cell percents (cell)

https://moodle.uclouvain.be/pluginfile.php/516396/mod_resource/content/1/LDEMO2404_session1_dofile.do Page 3 sur 4


20/09/23 13:59

*Missing values:
*******************************
*- basic missing values displayed as a dot "."
*- may distinguish different types with 26 additional codes denoted by .a to .z
*- missing values are in Stata very large (+infinite) numbers: valid numbers < . < .a
< .z

tab gnppc, missing // check whether there are missing values in this variable
list country gnppc if missing(gnppc) // list countries with missing values in gnppc
/*OR*/
list country gnppc if gnppc==.

tab gnppc, missing // check whether there are missing values in this variable
list country gnppc if missing(gnppc) // list countries with missing values in gnppc
/*OR*/
list country gnppc if gnppc==.

***********************
* Graphical association between 2 continuous variables
help graph twoway

* Scatter plot:
twoway scatter lexp gnppc

https://moodle.uclouvain.be/pluginfile.php/516396/mod_resource/content/1/LDEMO2404_session1_dofile.do Page 4 sur 4

You might also like