Do_file_quan Ly Va Lam Sach Du Lieu

Data Cleaning Management.
P1 - Printed on 25-Sep-23 1:55:54 AM

1
2 *************** Data Cleaning and Management.P1******************
3
4 * Datasets: Math2002Excel.xlsx Math2002.dta, Math2003.dta, Math2004.dta
5
6 ******************************************************************
7 /* set the local directory, which should be the folder
8 within which all of the data files are stored.*/
9 cd "D:\......\data_cleaning"
10
11
12 *create log file
13
14 log: D:\......\data_cleaning.smcl
15
16
17
18 ******************************************************************
19 * Getting the first dataset loaded and ready for action
20 ******************************************************************
21 clear all // clear Stata of any stray data or results
22
23 * Load in data.
24 * can use the menu under File->Import->Excel spreadsheet, and follow the directions to specify that
the first row is variable names
25
26 * Or, do the following:
27 import excel ///
28 "/Users/..../Data Cleaning and Management/Math2002Excel.xlsx", ///
29 sheet("Math2002csv.csv") firstrow
30
31
32
33
34 * Inspect the data -- Three different methods:
35 *******************************************************************
36 * "browse" opens up the Data Browser (the spreadsheet)
37 browse
38
39 summarize
40 describe // produces a table that identifies how the variables are stored
41 codebook, compact // produces a table with #observations, Mean, Min, Max, & Label
42
43
44 /* Note that three of the variables are shown in red.
45 These three variables (district schname mrawsc) have zero observations */
46
47
48 summarize district schname mrawsc
49
50 /* The first two -- district and schname -- are useful only for labeling things
51 But mrawsc -- math raw score -- is crucial
52 Let's take a closer look by tabulating the values */
53
54 tabulate mrawsc
55
56 /* Stata has listed the values in a funky order (0, 1, 10, 11 etc.)
57 And at the bottom there's a value called "NS" (no score on the test)
58 We want to change "NS" to a blank (there are blanks for other variables)
59 Here's one approach: */
60
61 replace mrawsc="." if mrawsc=="NS"
62 tabulate mrawsc
Page 1
Data Cleaning Management.P1 - Printed on 25-Sep-23 1:55:54 AM
63
64 /* When we do the above-mentioned "replace", Stata reports "223 real changes made"
65 And now, if all of the rest of the values for mrawsc are numbers, then we can convert the
variable from a string variable into a useful format */
66 destring mrawsc, replace
67
68 * For district, we can do like that
69 encode district, gen (distrist_school)
70
71
72 * Save your efforts
73 save "practice_convert_to_Stata_data", replace
74
75
76
77
78 ******************************************************************
79 * Labels, notes, names, and labels
80 ******************************************************************
81
82 * First, let's add a dataset label
83 label data "Data Cleaning file"
84 save "practice_convert_to_Stata_data", replace
85
86 * Close it and open again, to see what this does
87 use "practice_convert_to_Stata_data", clear
88
89 *look at the "CODEBOOK", to see how we might be more rigorous
90 codebook, compact
91
92 * Because we imported this from an Excel spreadsheet, there's a lot to do
93 * Let's try to clean up a few. First, label the variables:
94
95 label var system "District Number"
96 label variable district "School District"
97 label variable school "School Number"
98 label variable schname "School Name"
99 * etc.
100
101
102 * label the values associated with some variables
103 * This is typically a two-step process for example:
104 * Step 1: define the label
105 label define femalelabel 1 Female 0 Male
106 * Step 2: apply the label to a variable
107 label values female femalelabel
108
109 * check to see if the codebook is complete for that variable
110 codebook female
111
112 /*---------------------------------------------------------------
113 * Your turn for race
114
115 * labeling the values for the "race" variable
116 tab race
117 * (note: for race, many old state datasets had:
118 * 1=Asian
119 * 2=Black
120 * 3=Hispanic
121 * 4=Native American
122 * 5=White
123 *
124
Page 2
125
126
127
128
129 *---------------------------------------------------------------*/
130
131 ******************************************************************
132 * Testing for Normality
133 ******************************************************************
134
135 * open a fresh dataset, already cleaned up:
136 use "Math2002.dta", clear
137
138 * Inspecting the gpa variable to see if it is normal and appropriate for analysis
139 summarize gpa, detail
140 histogram gpa, normal
141
142 * Test for normality
143 pnorm gpa
144 swilk gpa
145
146 /* transforming gpa using Tukey's ladder, Tukey, J. W. 1977. Exploratory Data Analysis. Reading, MA:
Addison-Wesley.*/
147 ladder gpa
148 * Unfortunately, the interpretation of the test leaves something to be desired
149
150 * Stata also produces a graph matrix of nine possible transformations
151 gladder gpa
152
153 * None of these are particularly good transformations, but
154 gen gpa_squared = gpa^2
155 hist gpa_squared, normal
156
157 * Let's get rid of that transformed variable
158 drop gpa_squared
159
160
161
162
163 ******************************************************************
164 * Subsetting Data
165 ******************************************************************
166 * The main reason you would subset data is if you have a lot of observations or variables that you
do not need for your analysis.
167
168 * You should always be careful when deleting rows or columns because:
169 * these edits cannot be undone in Stata.
170
171 * If you do make a mistake then you will need to revert to your original file
172 * and start over
173
174 * The "drop" and "keep" commands are what you would use here
175
176
177
178
179 ******************************************************************
180 * Appending datasets
181 ******************************************************************
182
183 * The "append" command stacks datasets on top of each other,
184 * creating new datasets with the combined information:
185
Page 3
186 append using "Math2003.dta"
187
188
189 * Rather, produce a new "combined" dataset
190
191 save "Math02_03.dta", replace
192
193 * Let's add the third dataset we have available
194 append using "Math2004.dta"
195
196 * And again, save a combined dataset, now with all three years:
197 save "Math02_03_04_LONG.dta", replace
198
199 * Sort the data by student ID and year and inspect the first 10 cases
200 sort studid adminyear
201
202 * List the information for the first 10 rows of the sorted data
203 list in 1/10
204 * This might be somewhat hard to read. Maybe browse instead?
205
206 * Or possibly, list only some of the variables
207 list schname studid adminyear grade mrawsc female race in 1/10
208
209
210
211 *-----------------------------------------------------------------*
212 * PROBLEM! There are no values for race for the 2004 data!
213
214 * Here's a way to look and see what's going on:
215 tabulate race adminyear, miss
216 * we are crosstabulating two variables, including missing variables in the count in each cell table
217
218 * One solution is to ASSUME that students who have identified their race
219
220
221 * previously can have their previous value assigned again if missing.
222
223
224 * How do we replace the value of a variable with the value from another row?
225
226 * If we are confident of the first value of race (by year, in this case),
227 * then the following command will replace
228 by studid (adminyear), sort: replace race = race[_n-1] if race == .
229
230
231
232 * However, one should also check whether there are students who are missing on a variable (like
race) for all years
233
234 tab race, miss
235 /* Note, here, that 22 of the students (and 66 entries) still do not have a value of race in the
dataset (We knew this from our previous tabulation) */
236
237
238 *-----------------------------------------------------------------
239 * Your turn:
240 * Here's another problem that we'll need to solve:
241 * What's wrong with this picture?
242 histogram mrawsc
243 * And how might we solve this?
244 *Hint: normalization for mrawsc
245
246
Page 4
247
248
249 *-----------------------------------------------------------------*
250
251
252 ******************************************************************
253 * Merging datasets
254 ******************************************************************
255 *The merge command joins a dataset to the end (right side) of our existing dataset.
256 *This is useful if our analysis requires that we have one row of data for each individual (vs. one
row per observation and 3 rows for each individual).
257
258 ***clear the appended dataset
259 clear
260
261 ***re-open the original dataset
262 use "Math2002.dta"
263
264 *--------------------------------------------------------------*
265 * PROBLEM -- > Values from one year will override the other years!!!
266 * So we need to rename all of the time-varying variables to identify
267 * the variable is associated with the year
268 * SOLUTION -- > looping!
269
270 foreach var of varlist adminyear grade mteststa mrawsc mscaleds studstat frlunch title1 lep sped gpa{
271 rename `var' `var'02
272 }
273
274 * Let's save this altered dataset as version 2
275 save "Math2002v2.dta", replace
276
277 *Open, edit, and then save the 2003 data
279 foreach var of varlist adminyear grade mteststa mrawsc mscaleds studstat ///
280 frlunch title1 lep sped gpa{
282 }
284
285 *Open, edit, and then save the 2004 data
287 foreach var of varlist adminyear grade mteststa mrawsc mscaleds studstat ///
288 frlunch title1 lep sped gpa{
290 }
291
293 *-------------------------------------------------------------------*
294
295 use "Math2002v2.dta", clear
296
297
298 ***merge the 2003 data
299 merge 1:1 studid using "Math2003v2.dta"
300
301 drop _merge
302
303 ***merge the 2004 data
304 merge 1:1 studid using "Math2004v2.dta"
305
306
307 list studid grade* gpa* mrawsc* female in 1/10
308
Page 5
309
310
311
312
313 ******************************************************************
314 * Reshaping datasets
315 ******************************************************************
316
317 ***Return to long dataset:
318
319 use "Math02_03_04_LONG.dta", clear
320
321 ***Reshape the long dataset into a wide dataset:
322 reshape wide grade mteststa mrawsc mscaleds race studstat gpa, ///
323 i(studid) j(adminyear)
324
325 save "Math02_04_WIDE_v2.dta", replace
326
327 * Once you've gone through the process once (in either direction),
328 * then you can easily flip back and forth
329
330 reshape long
331 reshape wide
332 *....
333
334
335
336
337 ******************************************************************
338 * Generating new variables
339 ******************************************************************
340 * create a dichotomous variable called reg_test from the mteststa variable.
341 ** A value of mteststa of 1 indicates a regular testing situation.
342 ** All other values of mteststa indicate a non-standard testing situation
343
345 ***before conducting the transformations, take a look at the original variable:
346 tab mteststa
347
348 ***generate the new variable
349 gen regtest=.
350
351 ***make the rules that determine what the values of the new variable will be:
352 replace regtest=1 if mteststa==1
353 replace regtest=0 if mteststa>1
354
355 ***check the frequencies of the new variable
356 tab regtest
357 tab mteststa regtest
358
359
Page 6

Do_file_quan Ly Va Lam Sach Du Lieu

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Do_file_quan Ly Va Lam Sach Du Lieu

Uploaded by

Copyright:

Available Formats

Data Cleaning Management.

P1 - Printed on 25-Sep-23 1:55:54 AM

You might also like