You are on page 1of 6

Data Cleaning Management.

P1 - Printed on 25-Sep-23 1:55:54 AM


1
2 *************** Data Cleaning and Management.P1******************
3
4 * Datasets: Math2002Excel.xlsx Math2002.dta, Math2003.dta, Math2004.dta
5
6 ******************************************************************
7 /* set the local directory, which should be the folder
8 within which all of the data files are stored.*/
9 cd "D:\......\data_cleaning"
10
11
12 *create log file
13
14 log: D:\......\data_cleaning.smcl
15
16
17
18 ******************************************************************
19 * Getting the first dataset loaded and ready for action
20 ******************************************************************
21 clear all // clear Stata of any stray data or results
22
23 * Load in data.
24 * can use the menu under File->Import->Excel spreadsheet, and follow the directions to specify that
the first row is variable names
25
26 * Or, do the following:
27 import excel ///
28 "/Users/..../Data Cleaning and Management/Math2002Excel.xlsx", ///
29 sheet("Math2002csv.csv") firstrow
30
31
32
33
34 * Inspect the data -- Three different methods:
35 *******************************************************************
36 * "browse" opens up the Data Browser (the spreadsheet)
37 browse
38
39 summarize
40 describe // produces a table that identifies how the variables are stored
41 codebook, compact // produces a table with #observations, Mean, Min, Max, & Label
42
43
44 /* Note that three of the variables are shown in red.
45 These three variables (district schname mrawsc) have zero observations */
46
47
48 summarize district schname mrawsc
49
50 /* The first two -- district and schname -- are useful only for labeling things
51 But mrawsc -- math raw score -- is crucial
52 Let's take a closer look by tabulating the values */
53
54 tabulate mrawsc
55
56 /* Stata has listed the values in a funky order (0, 1, 10, 11 etc.)
57 And at the bottom there's a value called "NS" (no score on the test)
58 We want to change "NS" to a blank (there are blanks for other variables)
59 Here's one approach: */
60
61 replace mrawsc="." if mrawsc=="NS"
62 tabulate mrawsc

Page 1
Data Cleaning Management.P1 - Printed on 25-Sep-23 1:55:54 AM
63
64 /* When we do the above-mentioned "replace", Stata reports "223 real changes made"
65 And now, if all of the rest of the values for mrawsc are numbers, then we can convert the
variable from a string variable into a useful format */
66 destring mrawsc, replace
67
68 * For district, we can do like that
69 encode district, gen (distrist_school)
70
71
72 * Save your efforts
73 save "practice_convert_to_Stata_data", replace
74
75
76
77
78 ******************************************************************
79 * Labels, notes, names, and labels
80 ******************************************************************
81
82 * First, let's add a dataset label
83 label data "Data Cleaning file"
84 save "practice_convert_to_Stata_data", replace
85
86 * Close it and open again, to see what this does
87 use "practice_convert_to_Stata_data", clear
88
89 *look at the "CODEBOOK", to see how we might be more rigorous
90 codebook, compact
91
92 * Because we imported this from an Excel spreadsheet, there's a lot to do
93 * Let's try to clean up a few. First, label the variables:
94
95 label var system "District Number"
96 label variable district "School District"
97 label variable school "School Number"
98 label variable schname "School Name"
99 * etc.
100
101
102 * label the values associated with some variables
103 * This is typically a two-step process for example:
104 * Step 1: define the label
105 label define femalelabel 1 Female 0 Male
106 * Step 2: apply the label to a variable
107 label values female femalelabel
108
109 * check to see if the codebook is complete for that variable
110 codebook female
111
112 /*---------------------------------------------------------------
113 * Your turn for race
114
115 * labeling the values for the "race" variable
116 tab race
117 * (note: for race, many old state datasets had:
118 * 1=Asian
119 * 2=Black
120 * 3=Hispanic
121 * 4=Native American
122 * 5=White
123 *
124

Page 2
Data Cleaning Management.P1 - Printed on 25-Sep-23 1:55:54 AM
125
126
127
128
129 *---------------------------------------------------------------*/
130
131 ******************************************************************
132 * Testing for Normality
133 ******************************************************************
134
135 * open a fresh dataset, already cleaned up:
136 use "Math2002.dta", clear
137
138 * Inspecting the gpa variable to see if it is normal and appropriate for analysis
139 summarize gpa, detail
140 histogram gpa, normal
141
142 * Test for normality
143 pnorm gpa
144 swilk gpa
145
146 /* transforming gpa using Tukey's ladder, Tukey, J. W. 1977. Exploratory Data Analysis. Reading, MA:
Addison-Wesley.*/
147 ladder gpa
148 * Unfortunately, the interpretation of the test leaves something to be desired
149
150 * Stata also produces a graph matrix of nine possible transformations
151 gladder gpa
152
153 * None of these are particularly good transformations, but
154 gen gpa_squared = gpa^2
155 hist gpa_squared, normal
156
157 * Let's get rid of that transformed variable
158 drop gpa_squared
159
160
161
162
163 ******************************************************************
164 * Subsetting Data
165 ******************************************************************
166 * The main reason you would subset data is if you have a lot of observations or variables that you
do not need for your analysis.
167
168 * You should always be careful when deleting rows or columns because:
169 * these edits cannot be undone in Stata.
170
171 * If you do make a mistake then you will need to revert to your original file
172 * and start over
173
174 * The "drop" and "keep" commands are what you would use here
175
176
177
178
179 ******************************************************************
180 * Appending datasets
181 ******************************************************************
182
183 * The "append" command stacks datasets on top of each other,
184 * creating new datasets with the combined information:
185

Page 3
Data Cleaning Management.P1 - Printed on 25-Sep-23 1:55:54 AM
186 append using "Math2003.dta"
187
188
189 * Rather, produce a new "combined" dataset
190
191 save "Math02_03.dta", replace
192
193 * Let's add the third dataset we have available
194 append using "Math2004.dta"
195
196 * And again, save a combined dataset, now with all three years:
197 save "Math02_03_04_LONG.dta", replace
198
199 * Sort the data by student ID and year and inspect the first 10 cases
200 sort studid adminyear
201
202 * List the information for the first 10 rows of the sorted data
203 list in 1/10
204 * This might be somewhat hard to read. Maybe browse instead?
205
206 * Or possibly, list only some of the variables
207 list schname studid adminyear grade mrawsc female race in 1/10
208
209
210
211 *-----------------------------------------------------------------*
212 * PROBLEM! There are no values for race for the 2004 data!
213
214 * Here's a way to look and see what's going on:
215 tabulate race adminyear, miss
216 * we are crosstabulating two variables, including missing variables in the count in each cell table
217
218 * One solution is to ASSUME that students who have identified their race
219
220
221 * previously can have their previous value assigned again if missing.
222
223
224 * How do we replace the value of a variable with the value from another row?
225
226 * If we are confident of the first value of race (by year, in this case),
227 * then the following command will replace
228 by studid (adminyear), sort: replace race = race[_n-1] if race == .
229
230
231
232 * However, one should also check whether there are students who are missing on a variable (like
race) for all years
233
234 tab race, miss
235 /* Note, here, that 22 of the students (and 66 entries) still do not have a value of race in the
dataset (We knew this from our previous tabulation) */
236
237
238 *-----------------------------------------------------------------
239 * Your turn:
240 * Here's another problem that we'll need to solve:
241 * What's wrong with this picture?
242 histogram mrawsc
243 * And how might we solve this?
244 *Hint: normalization for mrawsc
245
246

Page 4
Data Cleaning Management.P1 - Printed on 25-Sep-23 1:55:54 AM
247
248
249 *-----------------------------------------------------------------*
250
251
252 ******************************************************************
253 * Merging datasets
254 ******************************************************************
255 *The merge command joins a dataset to the end (right side) of our existing dataset.
256 *This is useful if our analysis requires that we have one row of data for each individual (vs. one
row per observation and 3 rows for each individual).
257
258 ***clear the appended dataset
259 clear
260
261 ***re-open the original dataset
262 use "Math2002.dta"
263
264 *--------------------------------------------------------------*
265 * PROBLEM -- > Values from one year will override the other years!!!
266 * So we need to rename all of the time-varying variables to identify
267 * the variable is associated with the year
268 * SOLUTION -- > looping!
269
270 foreach var of varlist adminyear grade mteststa mrawsc mscaleds studstat frlunch title1 lep sped gpa{
271 rename `var' `var'02
272 }
273
274 * Let's save this altered dataset as version 2
275 save "Math2002v2.dta", replace
276
277 *Open, edit, and then save the 2003 data
278 use "Math2003.dta", clear
279 foreach var of varlist adminyear grade mteststa mrawsc mscaleds studstat ///
280 frlunch title1 lep sped gpa{
281 rename `var' `var'03
282 }
283 save "Math2003v2.dta", replace
284
285 *Open, edit, and then save the 2004 data
286 use "Math2004.dta", clear
287 foreach var of varlist adminyear grade mteststa mrawsc mscaleds studstat ///
288 frlunch title1 lep sped gpa{
289 rename `var' `var'04
290 }
291
292 save "Math2004v2.dta", replace
293 *-------------------------------------------------------------------*
294
295 use "Math2002v2.dta", clear
296
297
298 ***merge the 2003 data
299 merge 1:1 studid using "Math2003v2.dta"
300
301 drop _merge
302
303 ***merge the 2004 data
304 merge 1:1 studid using "Math2004v2.dta"
305
306
307 list studid grade* gpa* mrawsc* female in 1/10
308

Page 5
Data Cleaning Management.P1 - Printed on 25-Sep-23 1:55:54 AM
309
310
311
312
313 ******************************************************************
314 * Reshaping datasets
315 ******************************************************************
316
317 ***Return to long dataset:
318
319 use "Math02_03_04_LONG.dta", clear
320
321 ***Reshape the long dataset into a wide dataset:
322 reshape wide grade mteststa mrawsc mscaleds race studstat gpa, ///
323 i(studid) j(adminyear)
324
325 save "Math02_04_WIDE_v2.dta", replace
326
327 * Once you've gone through the process once (in either direction),
328 * then you can easily flip back and forth
329
330 reshape long
331 reshape wide
332 *....
333
334
335
336
337 ******************************************************************
338 * Generating new variables
339 ******************************************************************
340 * create a dichotomous variable called reg_test from the mteststa variable.
341 ** A value of mteststa of 1 indicates a regular testing situation.
342 ** All other values of mteststa indicate a non-standard testing situation
343
344 use "Math2002.dta", clear
345 ***before conducting the transformations, take a look at the original variable:
346 tab mteststa
347
348 ***generate the new variable
349 gen regtest=.
350
351 ***make the rules that determine what the values of the new variable will be:
352 replace regtest=1 if mteststa==1
353 replace regtest=0 if mteststa>1
354
355 ***check the frequencies of the new variable
356 tab regtest
357 tab mteststa regtest
358
359

Page 6

You might also like