Professional Documents
Culture Documents
___ ____ ____ ____ ____(R)
/__ / ____/ / ____/
___/ / /___/ / /___/
Statistics/Data Analysis
User: Danish Naeem Khan (19160015) / Aleeshay Obaid (19160002)
Project: SDA
name: <unnamed>
log: C:\Users\19160015\Desktop\Stata Homework (Final Submission).smcl
log type: smcl
opened on: 3 Dec 2019, 20:03:54
1 . do "C:\Users\19160015\Desktop\Stata Homework (Final Submission).do"
2 . import excel "C:\Users\19160015\Desktop\Stata+Homework.xls", sheet("Sheet1") fi
> rstrow
(43 vars, 4,360 obs)
3 . ******************** Question‐1 ********************
4 . ********** Part‐a **********
5 . lab var nr "Person Identifier"
6 . lab var agric "=1 if in agriculture"
7 . lab var black "=1 if black"
8 . lab var construc "=1 if in construction"
9 . lab var exper "labor mkt experience"
10 . lab var hisp "=1 if Hispanic"
11 . lab var poorhlth "=1 in poor health"
12 . lab var hours "annual hours worked"
13 . lab var manuf "=1 if in manufacturing"
14 . lab var married "=1 if married"
15 . lab var nrthcen "=1 if north central"
16 . lab var nrtheast "=1 if northeast"
17 . lab var south "=1 if south"
18 . lab var educ "years of schooling"
19 . lab var union "=1 if in union"
20 . lab var lwage "log(wage)"
21 . lab var d81 "=1 if year==1981"
22 . lab var d82 "=1 if year==1982"
23 . lab var d83 "=1 if year==1983"
24 . lab var d84 "=1 if year==1984"
25 . lab var d85 "=1 if year==1985"
Stata Homework Tuesday December 3 20:28:39 2019 Page 2
26 . lab var d86 "=1 if year==1986"
27 . lab var d87 "=1 if year==1987"
28 . lab var expersq "expereince squared"
29 . ********** Part‐b **********
30 . gen year=1 if d81==1
(3,815 missing values generated)
31 . replace year=2 if d82==1
(545 real changes made)
32 . replace year=3 if d83==1
(545 real changes made)
33 . replace year=4 if d84==1
(545 real changes made)
34 . replace year=5 if d85==1
(545 real changes made)
35 . replace year=6 if d86==1
(545 real changes made)
36 . replace year=7 if d87==1
(545 real changes made)
37 . ********** Part‐c **********
38 . lab def year 1"Year‐1981" 2"Year‐1982" 3"Year‐1983" 4"Year‐1984" 5"Year‐1985" 6
> "Year‐1986" 7"Year‐1987"
39 . lab val year year
40 . tab year
year Freq. Percent Cum.
Year‐1981 545 14.29 14.29
Year‐1982 545 14.29 28.57
Year‐1983 545 14.29 42.86
Year‐1984 545 14.29 57.14
Year‐1985 545 14.29 71.43
Year‐1986 545 14.29 85.71
Year‐1987 545 14.29 100.00
Total 3,815 100.00
41 . lab var year "1980 to 1987"
42 . * defining the variable of year in part‐a here*
43 . ********** Part‐d **********
44 . gen ethnicity=1 if black==0
(504 missing values generated)
Stata Homework Tuesday December 3 20:28:39 2019 Page 3
45 . replace ethnicity=2 if black==1
(504 real changes made)
46 . lab def race 1"white" 2"black"
47 . lab val ethnicity race
48 . ******************** Question‐2 ********************
49 . ********** Part‐a **********
50 . sum hours
Variable Obs Mean Std. Dev. Min Max
hours 4,360 2191.257 566.3523 120 4992
51 . return list
scalars:
r(N) = 4360
r(sum_w) = 4360
r(mean) = 2191.257339449541
r(Var) = 320754.9289435966
r(sd) = 566.3523010844016
r(min) = 120
r(max) = 4992
r(sum) = 9553882
52 . ***** (i) *****
53 . di (3200‐r(mean))/r(sd)
1.7811222
54 . di (3800 ‐ r(mean))/r(sd)
2.8405335
55 . di 0.99774‐0.96246
.03528
56 . * The probability that the annual hours worked is more than 3200 and less than
> 3800 i.e. p[3200<X<3800] is (0.99774‐0.96246) that is 0.03528
57 . ***** (ii) *****
58 . di (950‐r(mean))/r(sd)
‐2.19167
59 . di 1‐0.98574
.01426
60 . * The probability that the annual hours worked is less than 950 i.e. p[X<950] i
> s 0.01426
61 . ***** (iii) *****
62 . *To answer this question first we need to find the value in Z‐distribution tabl
> e against which we get the answer approximately equal to 0.1379 after subtracti
> ng the value of Z‐distribution table from 1 which we get as ‐1.09
63 . di (‐1.09*r(sd))+r(mean)
1573.9333
Stata Homework Tuesday December 3 20:28:39 2019 Page 4
64 . * So, the value of annual hours worked for probability 0.1379 is 1573.933
65 . ***** (iv) *****
66 . *To answer this question first we need to find the value in Z‐distribution tabl
> e against which we get the answer approximately equal to 0.0655 after subtracti
> ng the value of Z‐distribution table from 1 which we get as ‐1.51
67 . di (‐1.51*r(sd))+r(mean)
1336.0654
68 . *So. the value of annual hours worked for probability 0.0655 is 1336.0654
69 . ********** Part‐b **********
70 . gen norm_hrs=((hours‐r(mean))/r(sd))
71 . tabstat norm_hrs, s(mean sd)
variable mean sd
norm_hrs ‐9.38e‐11 1
72 . ********** Part‐c **********
73 . ***** (i) *****
74 . count if norm_hrs>1.7811 & norm_hrs<2.840
160
75 . *So, there are 160 observations in norm_hrs betweem 1.7811 and 2.840
76 . ***** (ii) *****
77 . count if norm_hrs < ‐2.19
121
78 . *So, there are 121 observations in norm_hrs which are less than ‐2.19
79 . ***** (iii) *****
80 . di (160/4360)
.03669725
81 . *This value is close to the value obtained in part‐a(i)
82 . di (121/4360)
.02775229
83 . *This value is not close to the value obtained in part‐a(ii)
84 . ***** (iv) *****
85 . kdensity norm_hrs, norm
86 . graph save "Graph" "C:\Users\19160015\Desktop\k‐density graph [Q 2‐c(iv)].gph"
(file C:\Users\19160015\Desktop\k‐density graph [Q 2‐c(iv)].gph saved)
87 . *The reason that we get different answers in [Q 2‐a(ii)] and [Q 2‐c(ii)] is tha
> t [Q 2‐c(ii)] our data is normally distributed, whereas, in [Q 2‐a(ii)] our dat
> a is not normally distributed due to which we get different answers in both par
> ts.
88 . ******************** Question‐3 ********************
89 . ********** Part‐a **********
90 . generate wage = exp(lwage)
91 . ********** Part‐b **********
92 . scatter wage educ
Stata Homework Tuesday December 3 20:28:39 2019 Page 5
93 . graph save "Graph" "C:\Users\19160015\Desktop\Scatter plot [Q 3‐b].gph"
(file C:\Users\19160015\Desktop\Scatter plot [Q 3‐b].gph saved)
94 . * Yes, the scatter plot satisfies our expectation because it shows that higher
> education leads to a higher wage rate
95 . ********** Part‐c **********
96 . corr wage educ
(obs=4,360)
wage educ
wage 1.0000
educ 0.2643 1.0000
97 . * The correlation between wage and education suggests that there is a positive
> relationship between the two variables with a strength of 0.2643. The correlati
> on coefficient ranges between ‐1 and 1 and can be found by dividing the covaria
> nce of the given variables by the product of the standard deviation of the give
> n variables.
98 . corr wage educ, cov
(obs=4,360)
wage educ
wage 10.2542
educ 1.47807 3.04915
99 . *The covariance of the given variables is positive (1.47807) which shows that b
> oth of the variables move in the same direction.
100 . ********** Part‐d **********
101 . reg wage educ
Source SS df MS Number of obs = 4,360
F(1, 4358) = 327.38
Model 3123.18017 1 3123.18017 Prob > F = 0.0000
Residual 41575.0738 4,358 9.53994352 R‐squared = 0.0699
Adj R‐squared = 0.0697
Total 44698.254 4,359 10.254245 Root MSE = 3.0887
wage Coef. Std. Err. t P>|t| [95% Conf. Interval]
educ .4847476 .0267911 18.09 0.000 .4322235 .5372718
_cons .215163 .3187013 0.68 0.500 ‐.4096536 .8399795
102 . * Here, our dependent variable is wage and independent variable is education. O
> ur coefficient (educ) shows us that if we increase education by 1 unit, the wag
> e increases by 0.4847476 on average. Moreover, our intercept (_cons) tells us t
> hat when education is 0 the average wage is 0.215163.
103 . ********** Part‐e **********
104 . * R‐Squared (the value of goodness of fit) is 0.0699 which tells us that out of
> total (100%) variation in dependent variable (wage), 6.99% of the variation is
> explained by our independent variable (education).
Stata Homework Tuesday December 3 20:28:39 2019 Page 6
105 . ********** Part‐f **********
106 . scatter wage educ || lfit wage educ
107 . graph save "Graph" "C:\Users\19160015\Desktop\Scatter plot [Q 3‐f].gph"
(file C:\Users\19160015\Desktop\Scatter plot [Q 3‐f].gph saved)
108 . ********** Part‐g **********
109 . predict predicted_wages
(option xb assumed; fitted values)
110 . predict predicted_errors, res
111 . ********** Part‐h **********
112 . ***** (i) *****
113 . tabstat predicted_errors, s(sum)
variable sum
predicted~rs .000066
114 . * The sum of the predicted_errors gives us the value of 0.000066 which is appro
> ximately equal to zero
115 . ***** (ii) *****
116 . g actual_wage= predicted_wages+predicted_errors
117 . * if both the variables (wage and actual_wage) are equal then the sum of the di
> fference of both the variables must also be equal to zero.
118 . g difference_wage= wage‐actual_wage
119 . tabstat difference_wage, s(sum)
variable sum
difference~e .0000199
120 . * The sum of the difference_wage approximately equals zero because the sum of o
> ur predicted_errors was not exactly zero. However, if we round‐off the values o
> f the predicted_errors to two decimal place and it's sum turns out to be exactl
> y zero. Then the sum of difference_wage (sum of the difference of wage and actu
> al_wage) will resultantly sum exactly zero.
121 . ***** (iii) *****
122 . di "19160015 ‐ 19160002"
19160015 ‐ 19160002
123 . ******************** Question‐4 ********************
124 . ********** Part‐a **********
125 . decode year, gen(Year)
126 . *Here it has removed codes behind the variable year and generated a new string
> variable as Year
127 . ********** Part‐b **********
128 . encode Year, gen(YEAR)
Stata Homework Tuesday December 3 20:28:39 2019 Page 7
129 . *Here it has coded the variable Year and generated a new variable YEAR which ha
> s codes on it's back end.
130 . * No, there is no difference behind the coding of year and YEAR because when we
> use the command of encode on Year it encoded the variable alphabetically. Sinc
> e, the Year‐1981 comes prior to Year‐1982 in aphabetical order it encoded Year‐
> 1981 as 1 and Year‐1982 as 2. As we did earlier when we generated the variable
> year.
131 . ********** Part‐c **********
132 . tostring married, gen (Married)
Married generated as str1
133 . destring Married, gen (MARRIED)
Married: all characters numeric; MARRIED generated as byte
134 . ******************** Question‐5 ********************
135 . ********** Part‐a **********
136 . tabstat wage hours if (educ>=10 & educ<=15), s(p50)
stats wage hours
p50 5.453574 2080
137 . * For the individuals whose year of ecucation are between 10 & 15 (both 10 & 15
> inclusive), the median wage is 5.453574 and median hours worked is 2080.
138 . tabstat wage hours if (educ>=2 & educ<=6), s(p50)
stats wage hours
p50 4.521925 2080
139 . * For the individuals whose year of ecucation are between 2 & 6 (both 2 & 6 inc
> lusive), the median wage is 4.521925 and median hours worked is 2080.
140 . ********** Part‐b **********
141 . tab married ethnicity, col
Key
frequency
column percentage
=1 if ethnicity
married white black Total
142 . * Of all black people, 25.79% are married
143 . tab ethnicity married, col
Key
frequency
column percentage
=1 if married
ethnicity 0 1 Total
144 . * Of all unmarried people, 84.71% are White
145 . ********** Part‐c **********
146 . sum wage, d
wage
Percentiles Smallest
1% 1.004084 .0279014
5% 2.299114 .2424242
10% 2.874092 .2836879 Obs 4,360
25% 3.860191 .296195 Sum of Wgt. 4,360
50% 5.318244 Mean 5.919175
Largest Std. Dev. 3.202225
75% 7.323994 31.48034
90% 9.601383 32.22385 Variance 10.25425
95% 11.14366 43.70629 Skewness 3.007379
99% 16.33186 57.50431 Kurtosis 29.37572
147 . return list
scalars:
r(N) = 4360
r(sum_w) = 4360
r(mean) = 5.91917510948853
r(Var) = 10.25424501323122
r(sd) = 3.202225009775424
r(skewness) = 3.007379490602033
r(kurtosis) = 29.37571706953641
r(sum) = 25807.60347736999
r(min) = .0279013924300671
r(max) = 57.50430679321289
r(p1) = 1.004084467887878
r(p5) = 2.299114227294922
r(p10) = 2.874091982841492
r(p25) = 3.860190629959106
r(p50) = 5.318243980407715
r(p75) = 7.323993921279907
r(p90) = 9.601382732391357
r(p95) = 11.14365768432617
r(p99) = 16.33186340332031
Stata Homework Tuesday December 3 20:28:39 2019 Page 9
148 . di r(p99)/r(p1)
16.265428
149 . * The ration of P99 to P1 (P99:P1) is 16.265428
150 . di r(p90)/r(p10)
3.3406665
151 . * The ration of P90 to P10 (P90:P10) is 3.3406665
152 . di r(p75)/r(p25)
1.8973141
153 . * The ration of P75 to P25 (P75:P25) is 1.8973141
154 . di r(p90)/r(p50)
1.8053671
155 . * The ration of P90 to P50 (P90:P50) is 1.8053671
156 .
end of do‐file
157 . log close
name: <unnamed>
log: C:\Users\19160015\Desktop\Stata Homework (Final Submission).smcl
log type: smcl
closed on: 3 Dec 2019, 20:08:37