H Tn c Thng
Hc My
Bi bo co v Job Salary Prediction
s dng LATEX
Thnh vin:
Ging vin hng dn:
Trn ng Trnh - 5100324
TS. Nguyn Thanh Hin
Nguyn Th M Dung - 51003238
Th 2, 09/05/2016
Contents
1 GII THIU V NGN NG R 2
1.1 Khi nim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Gii thiu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 R l g? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ti v Ci t R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Vn phm Ngn ng R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Cch t tn trong R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 H tr trong R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 GII THIU V KAGGLE 4
3 GII THIU V CUC THI JOB SALARY PREDICTION 5
4 JOB SALARY PREDICTION CHY TRN R 6
4.1 c d liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Xy dng Top Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.1 Tp d liu train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.2 Tp d liu test (tng t tp d liu train) . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Xy dng Top Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.1 Tp d liu train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.2 Tp d liu test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Xy dng Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1
Chapter 1
GII THIU V NGN NG R
1.1 Khi nim
1.1.1 Gii thiu
- Phn tch s liu v biu thng c tin hnh bng cc phn mm thng dng nh SAS, SPSS, Stata,
Statistica, v S-Plus. y l nhng phn mm c cc cng ty phn mm pht trin v gii thiu trn
th trng khong ba thp nin qua, v c cc trng i hc, cc trung tm nghin cu v cng ti
k ngh trn ton th gii s dng cho ging dy v nghin cu. Nhng v chi ph s dng cc phn
mm ny tung i t tin (c khi ln n hng trm ngn -la mi nm), mt s trng i hc cc
nc ang pht trin (v ngay c mt s nc pht trin) khng c kh nng ti chnh s dng
chng mt cch lu di. Do , cc nh nghin cu thng k trn th gii hp tc vi nhau pht
trin mt phn mm mi, vi ch trng m ngun m, sao cho tt c cc thnh vin trong ngnh thng
k hc v ton hc trn th gii c th s dng mt cch thng nht v hon ton min ph.
- Nm 1996, trong mt bi bo quan trng v tnh ton thng k, hai nh thng k hc Ross Ihaka v Robert
Gentleman thuc Trng i hc Auckland, New Zealand pht ho mt ngn ng mi cho phn tch
thng k m h t tn l R. Sng kin ny c rt nhiu nh thng k hc trn th gii tn thnh v
tham gia vo vic pht trin R.
- Cho n nay, qua cha y 10 nm pht trin, cng ngy cng c nhiu nh thng k hc, ton hc, nghin
cu trong mi lnh vc chuyn sang s dng R phn tch d liu khoa hc. Trn ton cu, c
mt mng li hn mt triu ngi s dng R, v con s ny ang tng rt nhanh.
1.1.2 R l g?
Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch thng k v v biu . Tht ra, v bn
cht, R l ngn ng my tnh a nng, c th s dng cho nhiu mc tiu khc nhau, t tnh ton n gin,
ton hc gii tr (recreational mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc
tp. V l mt ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn mn
cho mt vn tnh ton c bit.
1.2 Ti v Ci t R
- s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh. lm vic ny,
ta phi truy nhp vo mng v vo website c tn l Comprehensive R Archive Network (CRAN) sau y:
[Link]
- Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh. lm vic ny, chng ta
ch n gin nhn chut vo ti liu trn v lm theo hng dn cch ci t trn mn hnh. y l mt
bc rt n gin, ch cn 1 pht l vic ci t R c th hon tt.
2
1.3 Vn phm Ngn ng R
- Vn phm chung ca R l mt lnh (command) hay function. M l hm th phi c tham s, cho nn
theo sau hm l nhng tham s m chng ta phi cung cp:
i tng <- hm(thng s 1, thng s 2, . . . , thng s n)
- bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args vit tt ch arguments)
m trong x l mt hm chng ta cn bit.
- Mt s k hiu hay dng trong R l:
x == 5 : x bng 5
x != 5 : x khng bng 5
y < x : y nh hn x
x > y : x ln hn y
z <= 7 : z nh hn hoc bng 7
p >= 1 : p ln hn hoc bng 1
[Link](x) : C phi x l bin s trng khng (missing value)
A & B : A v B (AND)
A | B : A hoc B (OR)
! : Khng l (NOT)
- Vi R, tt c cc cu ch hay lnh sau k hiu # u khng c hiu ng, v # l k hiu dnh cho ngi
s dng thm vo cc ghi ch.
1.3.1 Cch t tn trong R
- t tn mt i tng (object) hay mt bin s (variable) trong R kh linh hot, v R khng c nhiu gii
hn nh cc phn mm khc. Tn mt object phi c vit lin nhau (tc khng c cch ri bng mt
khong trng). Chng hn nh R chp nhn myobject nhng khng chp nhn my object.
- i khi tn myobject kh c, cho nn chng ta nn tc ri bng du chm. Nh [Link].
- Mt iu quan trng cn lu l R phn bit mu t vit hoa v vit thng. Cho nn [Link] khc
vi [Link].
1.3.2 H tr trong R
Ngoi lnh args() R cn cung cp lnh help() ngi s dng c th hiu vn phm ca tng hm. Chng
hn nh mun bit hm lm c nhng thng s (arguments) no, chng ta ch n gin lnh:
> help(lm) hay > ?lm
3
Chapter 2
GII THIU V KAGGLE
- c thnh lp vo nm 2010, Kaggle l nn tng trc tuyn phc v cho vic t chc cc cuc thi khai
thc d liu v xy dng m hnh d bo. Mt cng ty no c th phi hp vi Kaggle a ln mng
mt m d liu cng vi bi ton t hng cng ng cc nh khoa hc ca site ny xut gii php.
- im quan trng l cc th sinh" c quyn chnh sa ti lui gii php ca mnh, thc y h v cng
ng n lc tm kim gii php tt hn cho n tn hn cht.
- mi cng ty nh MasterCard, Pfizer, Allstate, Facebook v c NASA u tham gia t chc cuc thi
trn Kaggle. V d nh Cng ty General Electric ti tr cuc thi vit phn mm thit lp ng bay hiu
qu hn cho hng hng khng; hay cng ty Practice Fusion (chuyn v cng ngh sc khe) ti tr mt
cuc thi khc nhm xc nh cc bnh nhn b bnh tiu ng loi 2 da trn h s y t
- Gii thng cho gii php thng cuc trong khong t 3.000 n 250.000 USD. C bit c gii thng tr
gi n 3 triu USD c Heritage Provider Network trao thng.
- Mi ngi u c c hi. Bt k th sinh no, d c xa xi cch tr n u i na u c th nh
gi ti nng ca mnh so vi nhng ngi ng u cng lnh vc. Hn na, trong cc din n ca
Kaggle, cc th sinh c th trao i v trau di k nng. Mt lp trnh vin gii c th tng th hng
nhanh chng bng cch ghi im tt trong hai hoc ba cuc thi.
- mc no , Kaggle l mt dng "crowdsourcing", khai thc b no ton cu gii quyt mt vn
ln no . Dng khai thc ngun lc m ng ny c c chc nm nay hoc hn, t nht l t
thi Wikipedia (hoc xa hn, t thi Linux, v.v..). Cc cng ty nh TaskRabbit v oDesk to cng n
vic lm cho m ng nhiu nm nay. Nhng Kaggle hn th. Th nht, nhng ngi tham gia Kaggle
lm vic khng ch v mc ch thin nguyn: h mun ginh chin thng v mun ci thin th hng
ca mnh c c hi tt hn trn th trng vic lm. Th hai, Kaggle khng ch to ra cng n vic lm
m cn to ra th trng vic lm mi cho cc chuyn gia. Khng ging nh cc lao ng thi v truyn
thng, thnh vin Kaggle l nhng ngi sao.
- Th hng Kaggle tr thnh mt thc o quan trng trong gii khoa hc d liu. Cc cng ty nh
American Express v New York Times bt u lit k th hng Kaggle nh mt chng ch cn thit
trong qung co tm kim nhn ti ca mnh. N khng ch l huy hiu m cn l ch s v nng lc, c
ngha quan trng v gi tr hn cc tiu chun truyn thng v trnh v chuyn mn. Bng cp t cc
trng i hc danh ting v l lch lm vic ti nhng cng ty tn tui nh IBM c th khng c ngha
bng im s Kaggle. Ni cch khc, cng vic c th o m v th hng ca bn trn th trng gi
tr hn ni bn lm vic. Bn CV (Curriculum Vitae l lch lm vic) ri s khng cn cn na?
- Kaggle to nn mt loi th trng lao ng mi, ni m k nng c tch bch khi nhng y nhim
th khng tin cy l bng cp v l lch. y thc s l bc thay i ln.
4
Chapter 3
GII THIU V CUC THI JOB
SALARY PREDICTION
- Thng thng khi ng tuyn vic lm, ngi s dng lao ng thng b qua vic cp n mc lng.
V khi mt c nhn tm kim mt cng vic, iu ny t ra mt tnh hung kh x, lm h c nguy c
lng ph thi gian qu bu vo mt cng vic vi mc lng thp, hoc b qua qung co vi nguy c b
qua mt c hi vic lm tuyt vi.
- Adzuna l mt cng ty Rao vt Anh vi a s cc qung co v vic lm. V hn mt na trong s
qung co khng lit k mc lng. cung cp dch v tt hn, Adzuna mun cung cp mt s c
tnh v mc lng cho cng vic khi m nh tuyn dng khng lit k. kt thc iu ny, Adzuna
t chc cuc thi Kaggle vi mc tiu nng cao s d on mc lng ca cng vic.
- M hnh thnh cng s kt hp mt s phn tch v tc ng ca vic a cc t kha hoc cm t khc
nhau, cng nh cch s dng trng d liu c cu trc ging nh a im, thi gian hoc cng ty. Mt
s d liu c cu trc hin th c suy ra bi cc quy trnh ring ca Adzuna, da vo ni qung co n
t u hoc ni dung ca n, v c th khng ng nhng li l i din ca cc d liu thc t.
- Bn s c cung cp mt tp d liu hun luyn xy dng m hnh, v s bao gm tt c cc bin
(bao gm c tin lng). Mt tp d liu th hai s c s dng cung cp thng tin phn hi trn
bng cng cng. Sau khong 6 tun, Kaggle s pht hnh mt b d liu cui cng m khng bao gm
lnh vc tin lng cho ngi tham gia. Sau , ngi tham gia s c yu cu np d on mc lng
ca h i vi mi cng vic nh gi.
5
Chapter 4
JOB SALARY PREDICTION CHY
TRN R
4.1 c d liu
Kaggle cung cp tt c d liu dng .csv nn ta cn c vo R bng phng thc [Link]
kim tra tn ct trong d liu, ta dng names()
kim tra tn s d liu, ta s dng phng thc table()
trn ta thy:
1. full_time, part_time, contract, permanent l nhng thuc tnh c trong d liu train.
2. Nhng con s th hin tn s xut hin ca thuc tnh .
4.2 Xy dng Top Sources
4.2.1 Tp d liu train
Lnh summary() cho ta nhng thng tin chnh xc v y ca d liu
6
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R
Lc ra 10 Sources c tn s cao nht.
Gn tn s vo Top Sources
To thm thuc tnh Other
Gn tn s ca thuc tnh NA qua Other
4.2.2 Tp d liu test (tng t tp d liu train)
Gn tn s vo Top Sources
7
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R
To thm thuc tnh Other
Gn tn s ca thuc tnh NA qua Other
4.3 Xy dng Top Location
4.3.1 Tp d liu train
m tn s ca cc thuc tnh a im v lu vo [Link]
Lc ra top 10 locations c tn s cao nht
Gn tn s cho cc Top Locations
8
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R
To thuc tnh Other
Chuyn tn s ca thuc tnh NA qua thuc tnh Other
Cng dn tn s ca thuc tnh UK vo thuc tnh Other
4.3.2 Tp d liu test
Gn tn s cho cc Top Locations
9
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R
To thuc tnh Other
Chuyn tn s ca thuc tnh NA qua thuc tnh Other
Cng dn tn s ca thuc tnh UK vo thuc tnh Other
4.4 Xy dng Model
Xy dng model mc lng (SalaryNormalized) da trn: Loi cng vic (Category), Hn
hp ng (ContractTime), Top Location v Top Sources.
10
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R
1. Residual: phn d (hay cn gi l phn khc bit gia gi tr thc t v tin on). Ta k vng n
gn bng 0 bi ton chnh xc hn. Nhng vn dao ng t Min -> Max.
2. Residual Standard Error: c tnh bng cch 0.16 = 0.4 v 244664 l con s thuc tnh c trong
tp d liu train.
3. Multiple R-squared: th hin c 32,69% dao ng ca ton b thuc tnh.
4559 + 6788 + 7990
R2 = = 0.3269
4559 + 6788 + 7990 + 39810
4. Adjusted R-squared: th hin ci tin ca m hnh
4559 + 6788 + 7990 + 39810
s2 = = 0.2416
8 + 10 + 85 + 244664
s2 0.1627
R2 = = 0.3266
s2
5. Df (degree of freedom): bc t do
6. Sum Sq: tng bnh phng
7. Mean Sq: trung bnh bnh phng
8. F value: gi tr F c tnh nh sau
569.91
F = = 3502.58
0.1627
9. Pr (>F): Tr s P dng kim nh F
Thc hin bc d on lng
To output
Xut output
11
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R
THE END.
12
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R
TI LIU THAM KHO CHNH
Bi ging ca lp Phng php nghin cu nh lng nng cao (Advanced Qualitative Research Methods),
k hiu HLN706, Queensland University of Technology, Australia.
Dupont, W.D., Statistical modeling for biomedical researchers: a simple introduction to the analysis of
complex data. second ed 2009: Cambridge Univ Press. 544.
Book: Machine Learning with R - Brett Lanz
Website: [Link]
13