String functions

Search for a word in a string:  gen x = regexm(var1,"RSV")      searches var1 for direct match of words "RSV" with 1 if yes and 0 if no

gen x1 = regexs(1) if regexm(name, "([a-zA-Z]+)[ ]*([a-zA-Z]+)") Extract 1st word from string "na

gen x2 = regexs(2) if regexm(name, "(([a-zA-Z]+)[ ]*([a-zA-Z]+))") Extract 2nd word from string "name" gen x=strpos(var1,"Dr") >0 codes 1 if "Dr" appears in variable, 0 if not gen strpos("stata","a") returns 3 (string position of letter a) For tips on regexm code: http://www.ats.ucla.edu/stat/stata/faq/regex.htm : how to extract comp

Extract dates from strings  regexs(0) if regexm(date,"[0-9]*$") extracts year from a stata date field (%td format 23jan2009)  regexs(0) if regexm(date,"[a-zA-Z]+" extracts month from 23jan2009  regexs(0) if regexm(date,"^[0-9]+") extracts day from 23jan2009

Split a string into many variables according to defined split point eg: "Diaganosis:cardiac" to "Diagnosis" "cardia  split var1,p(:) limit(2) splits string "var1" at colon ":" in the string for 2 occasions (defined by "limit")

Extract letters from string  substr(var1,1) chooses first letter of string  substr(var1,1,2) chooses first 2 letters of string  substr(var1,-1) chooses last letter of string  substring (name,1,comma-1)…extracts from name to first comma  substr("abcdef",2,3) gives "bcd"  substr("abcdef",-3,2) gives "de"  substr("abcdef",2,.) gives "bcdef" substr("abcdef",-3,.) = "def"

Isolate parts of string  gen x = strpos(diagnosis,"RSV") >0 codes 1 for any time the string "RSV" appears in the field (if it appear 
regexm(var1,"RSV") codes 1 if word "RSV" appears in var1 string

Join strings together (needs egenmore.ado) Concatenate  egen initials = concat(x1 x2)  egen newvar = concat (x y z), punct(-) ..creates x-y-z, use any punctaution character in brackets of punct()

Covert String to number

encode ethnic_str, gen(ethnic)

Extract date (12/03/2009) from a long date format (12/03/2009 00:00:00)

gen x=substr(date,1,10)

Replace       replace myvar = myvar[ n-1] if missing(myvar) ....replace missing value with previous value replace x= round(x,1)......round a value to 1 decimal place

replace x = 1 if _n<= 100 ...replace first 100 values replace x = 1 if (_N - _n) < 100....replace last 100 values replace x = 1 in 4 ....replaces 4th value in x with 1 replace x=1 in 5/25 ....replaces from value 5 to 25

Stats: Descriptive
    

summarise var1.........this gives mean summarise var1,detail (this gives full breakdown of stats such as mean, SD, min, mx etc) values are stored and can be retrieved with return command eg r(mean), r(p75), r(p25), r(sd) SEM = r to see return lists type return list statsby mean=r(mean) sem=r(sem) size=r(N), by(group) gives grouped stats

Stats for all groups of categorical variable
    

table x, contents (freq mean var1 sd var1) table x, contents(freq median var1 p25 var1 p75 var1) ....stats on categorical variable x including qu table x y, contents(freq mean var1) ................................stats on two categorical variables (x and y) tabi x y / z w column chi2 exact gives quick way for categoricla stats table slow class ethnic, contents(mean days sd days) by(gender) format(%4.1f)

Stats on multiple variables grouped using by command
  

tabstat age weight height, statistics (n mean sd) by (sex) format(%4.1f) tabstat x, by(group) stat (N) tabstat x if y==1, by(year) stats (median min max)

How to make stats per quartile of a variable
 

xtile x = age(nq(4) tabstat x, statistics(n mean sd), by x

Populate a new variable with stats by a grouped categorical variable
    

by group: egen var_new = median (y) by id:egen median_weight=median(weight) bysort yr:egen x = rank(lov) by group: egen var_new = count (y) by group: egen var_new = sum (y)

    

by group: egen var_new = pctile (y),p(75) ie 3rd quartile bysort yr: egen x = mean (y) bysort yr:egen z = x-y if z==3 & y==4 bysort group:tabstat los, by(sex) statistics(count mean) by yr:egen ranked_set = _n (sequential case number by group eg year)

Make a new dataset of descriptive stats using collapse function
  

collapse (p25) x, by (group) collapse (mean) x y return values that can be used include mean,median,p50,p75,p25,sd,semean,sum,count,rawsum,min,m

Cumulative sum by group

by group, gen tot = sum (x)

Highest value record

egen high = record(wage), by(group) order (yr)

Count distinct values in a variable bysort x y: generate count = (_n==1) by x:replace count = sum(count) by y:replace count= count(N) Other method: egen tag = tag (x y) egen count=total(tag), by(x) or egen count=nvals(x), by(y)

Stats: Basic stuff
Basic stats:
  good reference: http://www.ats.ucla.edu/stat/stata/whatstat/whatstat.htm

summarise x summarise,detail x (this gives details such as percentiles)

Stats by group:
  

bysort yr:tabstat bedays,by(mo) statistics(median, p25 p75) table mo yr, contents(median bedays ) tab x, contents(mean variableX sd variableX count variableX)

GENERERATE median value for each SUBGROUP

by subgrp, sort: egen medstay = median(los)

GENERATE the deviation from the median length of stay

generate deltalos = los - medstay

SUBGROUP AGGREGATES (eg by month)
    

by mnth, sort: egen monthmedn= median(daymax) by mnth, sort: egen monthmax= max(daymax) by mnth, sort: egen monthmin= min(daymax) by mnth, sort: egen month25= pctile(daymax),p(25) by mnth, sort: egen month75= pctile(daymax),p(75)

Frequency distributions
  

tab x y tab x y, row tab x y, chi2

gives % for row (can also use column) for chi squared (use exact for fishers)

Fishers or CHI squared use cci command: cci 12 23 24 56, exact .....will give this output:
| Exposed Unexposed | Total Exposed -----------------+------------------------+-----------------------Cases | Controls | 12 24 23 56 | | 35 80 0.3429 0.3000

-----------------+------------------------+-----------------------Total | | | Point estimate 36 79 | | | [95% Conf. Interval] 115 0.3130

|------------------------+-----------------------Odds ratio | Attr. frac. ex. | Attr. frac. pop | 1.217391 .1785714 .0612245 | | | .4710021 -1.123133 3.04953 (exact) .6720806 (exact)

+------------------------------------------------1-sided Fisher's exact P = 0.4023 2-sided Fisher's exact P = 0.6669

 tab x y, chi2 nof column row (nof does not show frequencies.Other stats are exact for FI

     

tabi 54 43 \ 56 78, column chi2 is chi2 with raw data tab1 x y z produces oneway frequency for multiple variables tab1 varx-varz by group, sort: tab x y, nofreq col chi2 tab3way is good ado file for multi-column Odds ratio
  cci 21 16 1 4 , exact

exactcci 21 16 1 4, exact
          

Diagnostic tests (sensitivity, specificity, NPV, PPV) use module "diagt"
 

diagti 80 17 11 44

Showing means and discriptive stats in tables
   

tab x, summ(y) shows basic stats (mean sd freq) for groups in x tab x1 x2, summ(y) means two way table of x1 and x2 with means of y table x1 x2, contents(mean y1 median y2) stats(q) gives interquartile range

for tab command:

 ,cell gives % for each cell  ,expected gives expected distributions  ,generate(new) plots dummy variables eg. tab x1,gen(dummy) produces dummy1 dum

x1  ,lrchi likelihood ratio  ,missing  ,nofreq  ,nolabel

Multi-table frequencies
 table y x2 x3, by(x5 x6) contents (freq)  by x3, sort: tab x1 x2, exact (This is a 3 way table)

other stats for contents:
   

freq,mean,sd,sum,rawsum,count,n,max,min,median,iqr,p1,p99,p75 use format tab x y, contents(mean z median z) tab died yr, summ(pim) means two-way table of means tab x , contents(mean z)

Table function with contents
    

tab group cs, sum(los) means table group cs, contents (count los median los) table inhouse yr, contents (count los median los) table band2 yr, contents (count los)table age yr, contents (count los) table x y, by(z)

Confidence intervals  ci x, level (99) produces 99%confidence interval once x is summarised TABSTAT tabstat x,stats(….)
        

tabstat pop, stat(mean), by (size) tabstat lov_days, by(yr) stat(mean sd min max) nototal long tabstat lov_days, by(yr) stat(n q) nototal long tabstat lov_days, by(yr) stat(mean sd min max) nototal long tabstat lov_days, by(yr) stat(n q) nototal long tabstat x, stat(mean, count, median), by(var2) tabstat x, stat(count mean q) by(y) q is interquartile range by x:tabstat y if z==2 & q==0, summarize(n mean q) record(x) is highest value of x (egenmore function)

these are options between brackets: · mean · count (count of nonmissing observations) · n same as count · sum sum · max maximum · min minimum · range range = max - min · sd standard deviation · sdmean standard deviation of mean = sd/sqrt(n) · skewness skewness · kurtosis kurtosis · median median (same as p50) · p1 1st percentile · p5 5th percentile

· · · · · · · · ·

p10 10th percentile p25 25th percentile p50 50th percentile (same as median) p75 75th percentile p90 90th percentile p95 95th percentile p99 99th percentile iqr interquartile range = p75 - p25 q equivalent to specifying "p25 p50 p75"

OTHER BASIC STATS TESTS Skewness test
   

sktest x swilk x (shapiro wilks) ladder x (produces powers with skewness test for normality) gladder x (plots various distributions of x)

Parametric One sam ple t-test:

ttest write=50 (does mean differ from 50)

Non Parametric One sam ple ( Wilcoxon signed-rank test)

signrank write=50 (eg. does median differ from 50)

Binomial test

bitest female=.5 (eg. does proportion differ from 50%)

Parametric Two independent samples t-test

ttest x, by(group)

Non Parametric Mann-Whitney test

ranksum x, by(group)

Parametric Paired t test

ttest x = y

Non parametirc Paired (Wilcoxon)

signrank x=y

Parametric One way anova: anova x y

Non Parametric Kruskall Wallace : kwallis x, by(y)

Date stuff

Import dates from excel in dd/mm/yyyy/ hh:mm format (eg 12/03/2009 12:33)
  

gen double dt = clock(datevariable,”DMYhm”) format dt %tc label variable dt “Date”

To convert clock date (dd/mm/yy hh:mm) to dd/mm/yyyy (eg 12/03/2001)
 

gen new_date = dofc(dt) format new_date %td

To subtract dates with result in hours: (dates in %td format)
  

gen double hrs = hours(end_dt – start_dt) gen los_days = los_hr/24 format los_days %9.1f

* *

To generate dd/mm/yyyy date:
  

gen birth = date(dob,”DMY”) format birth %td label variable birth “Date of Birth”

To convert STRING date in long format (12/03/2009 00:00:00) to short format (12/03/2009)

 

gen x=substr(date,1,10) ....this keeps first 10 characters of string then use date(x,"DMY") to format the date as td format

Script to convert clock date to multiple formats gen before = cond(hired_on < td(15jun2004), 1, 0) if hired_on < .

drop if admitted_on < tc(15jun2004 12:00:00)

gen date_tc = clock(x,"DMYhm") gen date_td =dofc(date_tc) gen date_tm=mofd(date_td)

// format structure is 12/03/2006 12:30 //convert to 12/03/2006 DAY/MONTH/YEAR format //convert to 2006/03 YR/MONTH format

Date stuff (stata 11)

Dates are set from jan 1 , 1960

Format %td %tw %tm %tq %th %ty

Description daily weekly monthly quarterly half-yearly yearly

 

generate y = date(doa, "DMY") format y %td

Date field in "12/03/2009 23:34" format , use clock function (1 unit = 1 millisecond) gen double dt_clock= clock(datevariable,"DMY hm") To convert clock format to date format use dofc gen newdate = dofc(date_clock) ie 12/03/2009 Date in "12/03/2009" format, use date function (1 unit = 1 day) gen y =date(x,"DMY") Format x d% or format %tc gives hh:mm as well

 if year is in format 09 instead of 2009, precede "DMY" by century eg "DM20Y"  if date spans centuries, use (x,"DMY",2020) for 1998 and 2000 (use largest century date)

       

generate birthday=mdy(month,day,year) generate m=month(birthday) generate d=day(birthday) generate y=year(birthday) dow(x) date of week generate weeks = diff/7 generate months = diff/30.5 generate years = diff/365.25

Other functions:  weekly(x,"wy")  monthly(x,"my")  quarterly(x,"qy")  halfyearly(xr,"hy")  yearly(x,"y") If three columns for each day,month and year use:  

gen y = mdy(month,day,year) gen x = mdy(x,y,z)

mdy(month,day,year) yw(year, week) ym(year,month) yq(year,quarter) yh(year,half-year)

daily weekly monthly quarterly half-yearly

Translating to td% dates (DD/MM/YYYY)

dofw() dofm() dofq() dofy()

weekly to daily monthly to daily quarterly to daily yearly to daily

Translate from %td dates:

wofd()

daily to weekly

mofd() qofd() yofd()

daily to monthly daily to quarterly daily to yearly

To reference dates :    reg x y if w(1999w10) sum salary if q(1998-4) tab sex if y(2007)

To reference range of dates use the tin() and twithin() functions:   reg y x if tin(01feb1998,01jun1998) sum income if twithin(1990-1,1990-3)

tin() includes the beginning and end dates, twithin() excludes them

Stata: Visual date displays

        

%tc | mdyhms(M, D, Y, h, m, s) %tc | dhms(td, h, m, s) %tc | hms(h, m, s) %td | mdy(M, D, Y) %tw | yw(Y, W) %tm | ym(Y, M) %tq | yq(Y, Q) %th | yh(Y, H) %ty | Y

clock values (%tc) for data in format "12/03/2009 23:34" format x = hh(x) shows hours. Mm(x) or ss(x)

gen x = mdy(m,d,y) or mdyhs gen bdayday = day(bdaynew) gen bdaydow = dow(bdaynew) gen bdaymo = month(bdaynew) gen bdayyr = year(bdaynew)

Convert date times (NOTE %tc is milliseconds, %td is seconds, %tm is months)
    

%tc to %td use dofc(x) ie from "12/03/2009 23:34" to "12/03/2009" %td to %tm use mofd( ) %td to %tq use qofd( )

then can apply year( ) month( ) day( ) doy( ) dow( ) this is day of yr from 1-365 or halfyear( ) quarter( ) week( ) dow ( )= sunday

Conditional date arguments
     

gen before=cond(adm<td(15jun2006),1,0 list if !inrange(x,2,10) lists if not in range 2 to 10 list age if inrange(population,200,5000) gen byte x = inlist(x,"one","two") egen x =rcount(v1-v4),cond(@>5 & @>15) by year,sort:egen y=sum(died)

generate bdaynew=date(bday,"mdy", 2010) if data as 02/03/07

Date script
gen x =substr(date,1,10) //converts string date to dd/mm/yyyy string

TYPICAL DATE SCRIPT (assume date in string format as dd/mm/yyyy hh:mm:ss) and you w

drop date gen date= date(x,"DMY") //converts date to %td format format date %td label variable date "Date" gen yr = year(date) gen mo = month(date) gen day=day(date) gen month_yr =mofd(date) //convert to month yr format format month_yr %tm

FOR FULL CLOCK FORMAT use (assume date in string format as dd/mm/yyyy hh:mm:s gen date2 = clock(date,"DMYhms") format date2 %tc

Graph tips
Histograms
 

histogram x, by(group, total) percent bin(10) histogram x, frequency title("Graph1) xlabel(15(10)30) ytick (1(2)10) start(10) width(2) norm gap(

........ start is where bar begins, width id bin size , norm overlies curve, gap Scatterplots
 

graph twoway scatter x y (can use xlabel, xtick, xtitle and also msymbol) mysymbol () can be: O,o, D,d, T,t, S,s, +, smplus,X,x, (add "h" for hollow eg Oh = hollow big circl

 

scatter twoway x y [fweight = age], by(group) symbol (oh) mlabel(id) allows bubble plot with size to format axis numbers eg 1.33 to 1.3 use: ylabel(,format(%3.1f))

Lineplots
     

graph twoway line x y year, legend (label(1 "label a") label(2 "label b") position(2) ring(0) rows(2) this plots line x and y against time (year). the legend will be placed in the graph (ring(0) in top right can use xtick eg (1960(2) 1980) for every 2 yrs ylabel(0(10)50),angle(horizontal)) will plot label for y axis horizontally clpattern is type of line and can be: solid,dash,dot,dash_dot,shortdash,shortdash_dot,longdash,blank if two lines then specify each in the plot eg msymbol (T Oh) clpattern(dash solid)

Barplots
    

Can do summaries eg graph bar (median) x,over(group) blabel(bar,size (medium)) bar(1, bcolor(gs1 note: blabel puts the value on top of each bar bar labels can vary in size: size(small) or tiny medsmall medlarge large Stacked bar graph : graph bar(sum) x y z, over(group) stack graph hbar for horizontal bar graph

Other graphs:
   

qqplot xy quantile x qnorm x,grid

Mean /SEM type graph:
 

graph twoway rcap xlow xhigh year || connect z_mean year, legend (off) if z_mean is mean and eac xhigh respectivelyy (eg after collapse command). Can use ANOVA comand to creat mean/SEM graphs by using "predict" command after ANOVA:
   

anova income year predict income_mean (generate smean value for income) predict SEincome (generates Standrd error for income) then use serrbar scale(2) to plot inc serrbar income_mean SEincome year, scale(2) addplot (line income_mean year,clpatter

or do following: **from Statistics with Stata by Hamilton anova income year gender*year predict aggmean predict SEagg, stdp gen agghigh = aggmean +2 * SEagg gen agglow = aggmean -2 * SEagg

graph twoway connected aggmean year || rcap agghigh agglow year, by (gender, legend(off) note(" ")

Overlapping graphs

graph twoay lfitci x y || scatter x y , xlabel(10(2)20) ylabel(2(10)20, angle(horizontal)) legend(order "regression line") rows(3) position(1) ring(0)

Combining graphs graph twoway x y ............... saving (fig1) graph twoway z x............... saving (fig2) graph combine fig1.gph fig2.gph imargin(vsmall) rows(2)

rows is numberof rows on graph

Graph time-series
    

tsset date_x,format(%td) to set data as daily where date_x is date variable tssmooth ma newvar = x, window (2 1 2) generates a 5 day moving average (2 lagged, current graph twoway line admissions date_x plots line graph of admissions vs date but x-axis looks graph twoway tsline admissions, ylabel(10(10)100) ttitle(" ") tlabel(01jan1983 01mar1983, grid clwidth(thin) clpattern(solid) using tsline, because data is tsset, one doesn't reference time, surpress the title by ttilte("")

Other way to generate moving average is egen: egen moving_av = ma(x), nmiss t(3) gives 3 day moving average if data is tsset to daily formats tssmooth nl command is better if outliers present

tssmooth nl x_smooth =x, smootehr(4253h,twice) ...smooths the running median by different s moving avergae of span 3 according to Velleman

Date: gen addate1 = clock(doa,"DMY hm") // for PICU data where date field is convert date 12/03/2008 12:33 to clock format addate1 %tc value. Must be in format *generate date format ie DDMMYYYY from clock value above 4/01/2010 12/03/09 gen addate = dofc(addate1) 19:22:24 12:34 format addate %td 02/02/2010 19:46:28 String to time conversion. http://www.sealedenvelope.com/stata/time.php Covert decimal time to HH:MM format (24hrs =1) converts a numeric variable containing elapsed times to a string variable containing times in 24 hour clock 02/02/2010 format (HH:MM or HH:MM:SS). 19:48:03 http://www.sealedenvelope.com/stata/time.php 04/01/2010 stats on multiple tables of variables by group 21:17:08 Can be count data or any other statistic 24/01/2010 _n manipuations: Sequentialy count of oservations in subgroups (eg count 19:19:51 from observation1,2,3,4 ... for every year, with each year starting at 1 again

Dates str2time tod, ge eg:tod (string) 0

time2str etod, g table x y z, cont table x y z, cont table x y z, cont by(group) by yr, sort: gen subgroup by yr, sort: gen subgroup

i = identifier, j =

wide: devides d

eg: i=patient, j=

reshape wide N format to use as

eg patient with t

to move betwee

reshape long x, 24/01/2010 19:25:04 Reshape data between wide and long formats 24/01/2010 19:26:57 Start up do file

reshape wide x,

These steps “un make file "profile put it in root of S In this do-file pu

02/02/2010 19:43:21 Web page http://www.sealedenvelope.com/stata.php 02/02/2010 web 20:01:20 http://www.cpc.unc.edu/research/tools/data_analysis/statatutorial/index.html

hl Hosmer-Lem reformat conver publication qual time utilities to t format to elapse xcount count clu xfill fill in static v xtab tabulate lon

University Caro label v1mean "m predict SEv, std predicted mean gen vhigh =v1m gen vlow =v1me

Plot ANOVA graph 24/01/2010 anova v1 gender year gender*year 20:16:13 predict vmean

graph twoway c rcap vhigh vlow ")) ytitle("mean

anova drink yea predict drinkme label variable dr predict SEdrink, means 24/01/2010 20:16:45 Plot ANOVA graphs with error bars 24/01/2010 Plot ANOVA graphs across time 20:18:37 xtline info: http://www.ats.ucla.edu/stat/stata/faq/xtline.htm http://www.ats.ucla.edu/stat/stata/faq/visualize_longitudinal.htm

serrbar drinkme addplot(line drin (off) if data in long fo (id time y as var xtline y, t(time) i

if data in wide fo variable eg id tim

profileplot time1 info: http://www.ats.u graph bar (sum) graph bar (sum) graph bar (sum)

Can use mean, (or any function xline(5) or y line

24/01/2010 20:20:05 Bar graphs by Group 24/01/2010 20:21:25 CATPLOT, BEAMPLOT graphs

graph bar (med bar(1,bcolor(gs1 blabel puts num greyscale can size up labe For second bar start : Graph bar x y, beamplot x, by( catplot bar rep7 catplot hbar rep

graph hbox x,ov intensity(30) ma 24/01/2010 20:22:00 Boxplot

26/01/2010 How to "collapse" data and then plot summary values: 16:50:04 collapse (mean) dep (sd) sddep=dep (count) n=dep, by(visit group) 26/01/2010 17:01:01 How to refer to scalar in a graph 26/01/2010 17:05:10 How to show r2 regression result in title of regression graph

intensity shadin Xline in hbox ar error bar plot: sort group gen high = dep gen low = dep twoway (rarea lo (connected dep by(group) legen depression"))

26/01/2010 17:08:32 Histograms 26/01/2010 17:09:09 Histogram by group 26/01/2010 Histogram and boxplot on same graph. 17:09:56 Install histbox.ado 02/02/2010 20:12:02 Add linear regression line to scatterplot 12/02/2010 graph descriptive data with one comand: sixplot module

title (CUSUM G eretrun gives r2 reg x y local r2: display twoway (lfitci x y histogram x, fre ylabel(0(2)10) ti options: bin(3) percent gap(5) addlabel. Adds discrete onebar norm. Overlies n if hisrogram of t by(group,total):H

16:07:03

21/02/2010 ERROR BAR GRAPH: Draw graph with time on x-axis and mean/SE error bars or median and IQ 08:45:17 Needs module XTGRAPH. USEFULL 21/02/2010 10:28:39 Stacked bar graph 26/01/2010 17:11:07 Fill in missing gaps in variable in bulk

24/01/2010 19:30:11 Statistics by group (eg median, means, IQR) 26/01/2010 16:48:30 Decode all missing values to a fixed number (eg 999) 26/01/2010 16:53:22 Generate random numbers and mix up the numbers randomly after this 26/01/2010 16:55:59 Removing duplicated observations in an entire file 26/01/2010 17:00:18 Counting by groups 02/02/2010 Count cluster data. XCOUNT 19:52:45 net from http://www.sealedenvelope.com/ 02/02/2010 19:59:04 tab at cluster level (for panel data) 12/02/2010 16:24:34 subtract the previous value in a running calcualtion 12/02/2010 bysort command for grouped stats on a variable. 16:27:40 Does not need sorting before the command is run 12/02/2010 16:30:06 Sorting orders within groups (_N _n values) 12/02/2010 16:40:40 table of means/SD/count across groups: use oneway test 12/02/2010 16:51:36 Calculate difference in 1 value from next or sum of next value 12/02/2010 17:02:43 Counting variables by group

13/02/2010 16:09:00 Tabulate ranges of each quartile for avariable using "xtile" to split data in quartiles, and tabstat to

13/02/2010 Create a list with descriptive stats (mean, n, median etc) for a list of items in a group using EGEN c 16:26:08 which is percentile, sd, rank. USEFULL 13/02/2010 Setting highest value (record value) in a variable

16:27:28 12/02/2010 16:03:16 web: new modules from

24/01/2010 19:56:44 Goodness of fit

24/01/2010 19:57:10 T tests

24/01/2010 19:58:06 Mann Whitney Kruskall Wallace 26/01/2010 16:47:33 Adding formating to decimal places after the tab command for basic stats 26/01/2010 16:51:22 How to tabulate summary values(eg median) by group 26/01/2010 Odds ratio with Fishers exact test, knowing the numbers per group 16:52:37 eg: 21 and 16 vs 1 and 4 26/01/2010 17:11:49 Diagnostic tests (specificity and sensitivity and predictive values) 12/02/2010 16:48:53 Chi squared with known proportions or fischers exact 13/02/2010 16:04:55 Grouped table of stats in multiple rows using tabstat (medians, means etc) VERY USEFULL 16/02/2010 Generate parameter data (intercept, slope, standard erros for both) for clustered longitudinal data ( 21:28:59 Generates parameters for each cluster (id). 19/02/2010 XCONTRACT: great way to generate count/percentage , cumulative percentage stats on grouped 13:09:05 Need to download module first from SSC 04/01/2010 21:12:56 Convert string to numeric 23/01/2010 23:11:48 cumulative count by variable (_n) Make dummy variables from a categorical variable. eg 24/01/2010 variable "size" has size =0, size =1, size =2 10:29:00 Useful in regression 24/01/2010 19:18:32 List and define missing variables: use "codebook" 24/01/2010 Conditional argument for string: 19:23:30 if variable is string eg "alive"or dead" then convert to 1,0

24/01/2010 19:54:37 Tabstat (for statistics by groups) 26/01/2010 16:57:06 Split characters off a string in different positions

26/01/2010 16:59:27 Split the first word of a string before a delimiter eg. "John:Smith" to John where delimter is ":". De 26/01/2010 17:03:55 Summarise missing values for variables 02/02/2010 20:16:06 One to one merging (if each uniqueid has a matched uniqueid in another file 03/02/2010 14:56:41 Conditional arguments for strings

12/02/2010 15:57:11 Saving log and commands panel 12/02/2010 16:10:22 Replace contents of a string. FDTA module: fdta checks the string varibles, searching for st1. Wh 12/02/2010 16:36:06 Coding variables into categorical groups 12/02/2010 16:38:51 Quick stats (T tests, Mann whitney etc) 12/02/2010 16:43:55 Recode continuous variable into defined categories with cut command 12/02/2010 16:44:32 cumulative sum 12/02/2010 16:46:45 replace variable with previous value if a value is missing 12/02/2010 16:56:01 Missing value in if command 12/02/2010 16:58:32 Find variables in list 12/02/2010 17:00:08 recode variable 12/02/2010 17:01:02 cumulative sum by group 12/02/2010 17:07:03 number of distinct observations by group 13/02/2010 16:30:30 Separating lines in list 13/02/2010 16:33:39 List of TOTAL values by a group using EGEN command. USEFULL 13/02/2010 16:35:46 Current dates and times 21/02/2010 Generate a numerical sequence in an empty variable eg. 0,5,10...

09:56:41 21/02/2010 09:58:56 Convert a string to a number 21/02/2010 14:59:00 MISSING VALUES: Generate a variable to define if a list of variables have missing values GROUP COMMAND 25/02/2010 Group string or continuous data into distict groups (integers 1,2,3 etc) 11:17:15 1st distinct value coded as 1, next as 2 and so on 25/02/2010 11:19:23 Number lists examples 25/02/2010 15:37:55 REPLACE MISSING value with the preceding value 25/02/2010 15:47:03 LAG value by previous or next value 25/02/2010 15:55:18 RECODE 25/02/2010 17:17:04 Referencing SCALARS in variables or in graphs titles 27/02/2010 23:11:58 DIAGNOSTIC TESTS: specificity, sensitivity, predictive values

28/02/2010 23:45:33 Bulk RENAMING VARIABLES: renvars module 03/03/2010 20:38:31 STATSBY: produce tables of discriptive stats. RE-writes data into new dta tables

04/03/2010 21:49:11 Fill a sequence (use egen)

07/03/2010 22:29:31 Convert TIME in HH:MM (eg 12:33) to minutes

13/04/2010 COunt unique distinct value sin s astring where multiple duplicates may occur. Eg each patient has 16:16:25 admitte dmany times) 15/04/2010 Create variable containing the median length of stay for each diagnostic code 15:12:36 . 15/04/2010 17:19:29 stats for multiple groups using SORT 21/04/2010 Profile.do template file 21:49:19

21/04/2010 22:01:32 Moving average 21/04/2010 22:21:17 Breaking a categorical variable into a set of binary variables (make each categorical value a sepera collapse (mean) dep (sd) sdd error bar plot of this: sort group gen high = dep + 2*sddep/sq gen low = dep - 2*sddep/sqrt twoway (rarea low high visit, 21/04/2010 */ (connected dep visit, mcolo 22:23:36 ERROR BAR AREA GRAPH after collapsing data to basic stats */ clcolor(black)), by(group) /* 29/04/2010 GROUP command to combine a numbe rof variables into a unique 13:24:14 group egen newid = group( var1 va 29/04/2010 Egen faminc =sum(income),b 13:25:46 Grouping code Egen faminc =max(income),b twoway (line beds_AM day)(l 03/05/2010 Pyramid type plot for day vs night shifts. Shows how to change ylabel(0(5)30, angle(horizont 22:50:13 graph axis label size legend(label(1 Day) label(2 N by patientid:gen x = _n 19/05/2010 _N count for only maximal value of a grouped variable list egen y = max(x), by patientid 15:06:29 (eg making avariable with maximal value per year per patient) gen var1 = yr if x = y 15/06/2010 sparl x y 19:14:49 Scatterplot with regression line and r2 coeficiant (SPARLmodule regress 08/07/2010 00:18:17 eperiod module TIME DIFFERENC calculatedifference between

Centering data
Return lists:

following the summarize command, the following list returns are available: r(N), r(sum), r(mean),r(sd),r(min) To standardise values: summarise age gen age_c = (age -r(mean))/r(std)

Foreach loop to centre data: in this example the variables los and los1 are centered creating los_c and los1_ foreach var of varlist los los1 { summarize `var', meanonly generate `var'_c = `var' - r(mean) label variable `var'_c "`var' (centered)" }

Loops
Basic loops that can be modified to perform many tasks foreach variable in v1 v2 v3 { list `variable´ }

foreach number of numlist 1 2 3 { display `number´ } forvalues i=1/3 { display `i´ } The same output can also be produced using while as follows: local i = 1 while i<=3 { disp `i´ local i = `i´ + 1 } Looping through a variable list local xlist na k cl foreach v of varlist `xlist' { summ `v' correlate sbe `v' scatter sbe `v' } Looping through a variable for duplicates foreach var of varlist dupl_tag{ sort dupl day h_hour h_ph sample by dupl: gen x=h_na[_n-1]- h_na drop if x<0.05 & dupl_tag ==`var' drop x }

Sign up to vote on this title
UsefulNot useful