0% found this document useful (0 votes)
1K views14 pages

Báo Cáo Môn Machine Learning

Bài báo cáo này mô tả việc sử dụng ngôn ngữ lập trình R để giải quyết bài toán dự đoán mức lương công việc trên kaggle. Bài báo cáo giới thiệu về R, kaggle và cuộc thi cụ thể. Sau đó mô tả các bước đọc dữ liệu, xây dựng top nguồn việc làm và vị trí phổ biến, rồi xây dựng mô hình dự đoán trên R.

Uploaded by

Phạm Thanh Tú
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views14 pages

Báo Cáo Môn Machine Learning

Bài báo cáo này mô tả việc sử dụng ngôn ngữ lập trình R để giải quyết bài toán dự đoán mức lương công việc trên kaggle. Bài báo cáo giới thiệu về R, kaggle và cuộc thi cụ thể. Sau đó mô tả các bước đọc dữ liệu, xây dựng top nguồn việc làm và vị trí phổ biến, rồi xây dựng mô hình dự đoán trên R.

Uploaded by

Phạm Thanh Tú
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

H Tn c Thng

Hc My

Bi bo co v Job Salary Prediction


s dng LATEX

Thnh vin:
Ging vin hng dn:
Trn ng Trnh - 5100324
TS. Nguyn Thanh Hin
Nguyn Th M Dung - 51003238

Th 2, 09/05/2016
Contents

1 GII THIU V NGN NG R 2


1.1 Khi nim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Gii thiu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 R l g? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Ti v Ci t R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Vn phm Ngn ng R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Cch t tn trong R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 H tr trong R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 GII THIU V KAGGLE 4

3 GII THIU V CUC THI JOB SALARY PREDICTION 5

4 JOB SALARY PREDICTION CHY TRN R 6


4.1 c d liu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Xy dng Top Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.1 Tp d liu train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2.2 Tp d liu test (tng t tp d liu train) . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Xy dng Top Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.1 Tp d liu train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.2 Tp d liu test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.4 Xy dng Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1
Chapter 1

GII THIU V NGN NG R

1.1 Khi nim


1.1.1 Gii thiu
- Phn tch s liu v biu thng c tin hnh bng cc phn mm thng dng nh SAS, SPSS, Stata,
Statistica, v S-Plus. y l nhng phn mm c cc cng ty phn mm pht trin v gii thiu trn
th trng khong ba thp nin qua, v c cc trng i hc, cc trung tm nghin cu v cng ti
k ngh trn ton th gii s dng cho ging dy v nghin cu. Nhng v chi ph s dng cc phn
mm ny tung i t tin (c khi ln n hng trm ngn -la mi nm), mt s trng i hc cc
nc ang pht trin (v ngay c mt s nc pht trin) khng c kh nng ti chnh s dng
chng mt cch lu di. Do , cc nh nghin cu thng k trn th gii hp tc vi nhau pht
trin mt phn mm mi, vi ch trng m ngun m, sao cho tt c cc thnh vin trong ngnh thng
k hc v ton hc trn th gii c th s dng mt cch thng nht v hon ton min ph.
- Nm 1996, trong mt bi bo quan trng v tnh ton thng k, hai nh thng k hc Ross Ihaka v Robert
Gentleman thuc Trng i hc Auckland, New Zealand pht ho mt ngn ng mi cho phn tch
thng k m h t tn l R. Sng kin ny c rt nhiu nh thng k hc trn th gii tn thnh v
tham gia vo vic pht trin R.
- Cho n nay, qua cha y 10 nm pht trin, cng ngy cng c nhiu nh thng k hc, ton hc, nghin
cu trong mi lnh vc chuyn sang s dng R phn tch d liu khoa hc. Trn ton cu, c
mt mng li hn mt triu ngi s dng R, v con s ny ang tng rt nhanh.

1.1.2 R l g?
Ni mt cch ngn gn, R l mt phn mm s dng cho phn tch thng k v v biu . Tht ra, v bn
cht, R l ngn ng my tnh a nng, c th s dng cho nhiu mc tiu khc nhau, t tnh ton n gin,
ton hc gii tr (recreational mathematics), tnh ton ma trn (matrix), n cc phn tch thng k phc
tp. V l mt ngn ng, cho nn ngi ta c th s dng R pht trin thnh cc phn mm chuyn mn
cho mt vn tnh ton c bit.

1.2 Ti v Ci t R
- s dng R, vic u tin l chng ta phi ci t R trong my tnh ca mnh. lm vic ny,
ta phi truy nhp vo mng v vo website c tn l Comprehensive R Archive Network (CRAN) sau y:

[Link]

- Khi ti R xung my tnh, bc k tip l ci t (set-up) vo my tnh. lm vic ny, chng ta


ch n gin nhn chut vo ti liu trn v lm theo hng dn cch ci t trn mn hnh. y l mt
bc rt n gin, ch cn 1 pht l vic ci t R c th hon tt.

2
1.3 Vn phm Ngn ng R
- Vn phm chung ca R l mt lnh (command) hay function. M l hm th phi c tham s, cho nn
theo sau hm l nhng tham s m chng ta phi cung cp:

i tng <- hm(thng s 1, thng s 2, . . . , thng s n)

- bit mt hm cn c nhng thng s no, chng ta dng lnh args(x), (args vit tt ch arguments)
m trong x l mt hm chng ta cn bit.

- Mt s k hiu hay dng trong R l:

x == 5 : x bng 5
x != 5 : x khng bng 5
y < x : y nh hn x
x > y : x ln hn y
z <= 7 : z nh hn hoc bng 7
p >= 1 : p ln hn hoc bng 1
[Link](x) : C phi x l bin s trng khng (missing value)
A & B : A v B (AND)
A | B : A hoc B (OR)
! : Khng l (NOT)

- Vi R, tt c cc cu ch hay lnh sau k hiu # u khng c hiu ng, v # l k hiu dnh cho ngi
s dng thm vo cc ghi ch.

1.3.1 Cch t tn trong R


- t tn mt i tng (object) hay mt bin s (variable) trong R kh linh hot, v R khng c nhiu gii
hn nh cc phn mm khc. Tn mt object phi c vit lin nhau (tc khng c cch ri bng mt
khong trng). Chng hn nh R chp nhn myobject nhng khng chp nhn my object.
- i khi tn myobject kh c, cho nn chng ta nn tc ri bng du chm. Nh [Link].

- Mt iu quan trng cn lu l R phn bit mu t vit hoa v vit thng. Cho nn [Link] khc
vi [Link].

1.3.2 H tr trong R
Ngoi lnh args() R cn cung cp lnh help() ngi s dng c th hiu vn phm ca tng hm. Chng
hn nh mun bit hm lm c nhng thng s (arguments) no, chng ta ch n gin lnh:
> help(lm) hay > ?lm

3
Chapter 2

GII THIU V KAGGLE

- c thnh lp vo nm 2010, Kaggle l nn tng trc tuyn phc v cho vic t chc cc cuc thi khai
thc d liu v xy dng m hnh d bo. Mt cng ty no c th phi hp vi Kaggle a ln mng
mt m d liu cng vi bi ton t hng cng ng cc nh khoa hc ca site ny xut gii php.

- im quan trng l cc th sinh" c quyn chnh sa ti lui gii php ca mnh, thc y h v cng
ng n lc tm kim gii php tt hn cho n tn hn cht.
- mi cng ty nh MasterCard, Pfizer, Allstate, Facebook v c NASA u tham gia t chc cuc thi
trn Kaggle. V d nh Cng ty General Electric ti tr cuc thi vit phn mm thit lp ng bay hiu
qu hn cho hng hng khng; hay cng ty Practice Fusion (chuyn v cng ngh sc khe) ti tr mt
cuc thi khc nhm xc nh cc bnh nhn b bnh tiu ng loi 2 da trn h s y t
- Gii thng cho gii php thng cuc trong khong t 3.000 n 250.000 USD. C bit c gii thng tr
gi n 3 triu USD c Heritage Provider Network trao thng.
- Mi ngi u c c hi. Bt k th sinh no, d c xa xi cch tr n u i na u c th nh
gi ti nng ca mnh so vi nhng ngi ng u cng lnh vc. Hn na, trong cc din n ca
Kaggle, cc th sinh c th trao i v trau di k nng. Mt lp trnh vin gii c th tng th hng
nhanh chng bng cch ghi im tt trong hai hoc ba cuc thi.

- mc no , Kaggle l mt dng "crowdsourcing", khai thc b no ton cu gii quyt mt vn


ln no . Dng khai thc ngun lc m ng ny c c chc nm nay hoc hn, t nht l t
thi Wikipedia (hoc xa hn, t thi Linux, v.v..). Cc cng ty nh TaskRabbit v oDesk to cng n
vic lm cho m ng nhiu nm nay. Nhng Kaggle hn th. Th nht, nhng ngi tham gia Kaggle
lm vic khng ch v mc ch thin nguyn: h mun ginh chin thng v mun ci thin th hng
ca mnh c c hi tt hn trn th trng vic lm. Th hai, Kaggle khng ch to ra cng n vic lm
m cn to ra th trng vic lm mi cho cc chuyn gia. Khng ging nh cc lao ng thi v truyn
thng, thnh vin Kaggle l nhng ngi sao.

- Th hng Kaggle tr thnh mt thc o quan trng trong gii khoa hc d liu. Cc cng ty nh
American Express v New York Times bt u lit k th hng Kaggle nh mt chng ch cn thit
trong qung co tm kim nhn ti ca mnh. N khng ch l huy hiu m cn l ch s v nng lc, c
ngha quan trng v gi tr hn cc tiu chun truyn thng v trnh v chuyn mn. Bng cp t cc
trng i hc danh ting v l lch lm vic ti nhng cng ty tn tui nh IBM c th khng c ngha
bng im s Kaggle. Ni cch khc, cng vic c th o m v th hng ca bn trn th trng gi
tr hn ni bn lm vic. Bn CV (Curriculum Vitae l lch lm vic) ri s khng cn cn na?
- Kaggle to nn mt loi th trng lao ng mi, ni m k nng c tch bch khi nhng y nhim
th khng tin cy l bng cp v l lch. y thc s l bc thay i ln.

4
Chapter 3

GII THIU V CUC THI JOB


SALARY PREDICTION

- Thng thng khi ng tuyn vic lm, ngi s dng lao ng thng b qua vic cp n mc lng.
V khi mt c nhn tm kim mt cng vic, iu ny t ra mt tnh hung kh x, lm h c nguy c
lng ph thi gian qu bu vo mt cng vic vi mc lng thp, hoc b qua qung co vi nguy c b
qua mt c hi vic lm tuyt vi.
- Adzuna l mt cng ty Rao vt Anh vi a s cc qung co v vic lm. V hn mt na trong s
qung co khng lit k mc lng. cung cp dch v tt hn, Adzuna mun cung cp mt s c
tnh v mc lng cho cng vic khi m nh tuyn dng khng lit k. kt thc iu ny, Adzuna
t chc cuc thi Kaggle vi mc tiu nng cao s d on mc lng ca cng vic.
- M hnh thnh cng s kt hp mt s phn tch v tc ng ca vic a cc t kha hoc cm t khc
nhau, cng nh cch s dng trng d liu c cu trc ging nh a im, thi gian hoc cng ty. Mt
s d liu c cu trc hin th c suy ra bi cc quy trnh ring ca Adzuna, da vo ni qung co n
t u hoc ni dung ca n, v c th khng ng nhng li l i din ca cc d liu thc t.

- Bn s c cung cp mt tp d liu hun luyn xy dng m hnh, v s bao gm tt c cc bin


(bao gm c tin lng). Mt tp d liu th hai s c s dng cung cp thng tin phn hi trn
bng cng cng. Sau khong 6 tun, Kaggle s pht hnh mt b d liu cui cng m khng bao gm
lnh vc tin lng cho ngi tham gia. Sau , ngi tham gia s c yu cu np d on mc lng
ca h i vi mi cng vic nh gi.

5
Chapter 4

JOB SALARY PREDICTION CHY


TRN R

4.1 c d liu
Kaggle cung cp tt c d liu dng .csv nn ta cn c vo R bng phng thc [Link]

kim tra tn ct trong d liu, ta dng names()

kim tra tn s d liu, ta s dng phng thc table()

trn ta thy:
1. full_time, part_time, contract, permanent l nhng thuc tnh c trong d liu train.
2. Nhng con s th hin tn s xut hin ca thuc tnh .

4.2 Xy dng Top Sources


4.2.1 Tp d liu train
Lnh summary() cho ta nhng thng tin chnh xc v y ca d liu

6
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

Lc ra 10 Sources c tn s cao nht.

Gn tn s vo Top Sources

To thm thuc tnh Other

Gn tn s ca thuc tnh NA qua Other

4.2.2 Tp d liu test (tng t tp d liu train)


Gn tn s vo Top Sources

7
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

To thm thuc tnh Other

Gn tn s ca thuc tnh NA qua Other

4.3 Xy dng Top Location


4.3.1 Tp d liu train
m tn s ca cc thuc tnh a im v lu vo [Link]

Lc ra top 10 locations c tn s cao nht

Gn tn s cho cc Top Locations

8
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

To thuc tnh Other

Chuyn tn s ca thuc tnh NA qua thuc tnh Other

Cng dn tn s ca thuc tnh UK vo thuc tnh Other

4.3.2 Tp d liu test


Gn tn s cho cc Top Locations

9
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

To thuc tnh Other

Chuyn tn s ca thuc tnh NA qua thuc tnh Other

Cng dn tn s ca thuc tnh UK vo thuc tnh Other

4.4 Xy dng Model


Xy dng model mc lng (SalaryNormalized) da trn: Loi cng vic (Category), Hn
hp ng (ContractTime), Top Location v Top Sources.

10
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

1. Residual: phn d (hay cn gi l phn khc bit gia gi tr thc t v tin on). Ta k vng n
gn bng 0 bi ton chnh xc hn. Nhng vn dao ng t Min -> Max.

2. Residual Standard Error: c tnh bng cch 0.16 = 0.4 v 244664 l con s thuc tnh c trong
tp d liu train.
3. Multiple R-squared: th hin c 32,69% dao ng ca ton b thuc tnh.

4559 + 6788 + 7990


R2 = = 0.3269
4559 + 6788 + 7990 + 39810

4. Adjusted R-squared: th hin ci tin ca m hnh

4559 + 6788 + 7990 + 39810


s2 = = 0.2416
8 + 10 + 85 + 244664

s2 0.1627
R2 = = 0.3266
s2

5. Df (degree of freedom): bc t do

6. Sum Sq: tng bnh phng


7. Mean Sq: trung bnh bnh phng
8. F value: gi tr F c tnh nh sau

569.91
F = = 3502.58
0.1627

9. Pr (>F): Tr s P dng kim nh F

Thc hin bc d on lng

To output

Xut output

11
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

THE END.

12
CHAPTER 4. JOB SALARY PREDICTION CHY TRN R

TI LIU THAM KHO CHNH


Bi ging ca lp Phng php nghin cu nh lng nng cao (Advanced Qualitative Research Methods),
k hiu HLN706, Queensland University of Technology, Australia.
Dupont, W.D., Statistical modeling for biomedical researchers: a simple introduction to the analysis of
complex data. second ed 2009: Cambridge Univ Press. 544.
Book: Machine Learning with R - Brett Lanz
Website: [Link]

13

You might also like