You are on page 1of 26

Khoa Khoa Hc & K Thut My Tnh

Trng i Hc Bch Khoa Tp. H Ch Minh

Chng 3: Hi qui d liu


Khai ph d liu
(Data mining)

Hc k 1 2009-2010

Ni dung
3.1.

Tng quan v hi qui

3.2.

Hi qui tuyn tnh

3.3.

Hi qui phi tuyn

3.4.

ng dng

3.5.

Cc vn vi hi qui

3.6.

Tm tt

Ti liu tham kho

[1] Jiawei Han, Micheline Kamber, Data Mining:


Concepts and Techniques, Second Edition, Morgan
Kaufmann Publishers, 2006.

[2] David Hand, Heikki Mannila, Padhraic Smyth,


Principles of Data Mining, MIT Press, 2001.

[3] David L. Olson, Dursun Delen, Advanced Data Mining


Techniques, Springer-Verlag, 2008.

[4] Graham J. Williams, Simeon J. Simoff, Data Mining:


Theory, Methodology, Techniques, and Applications, SpringerVerlag, 2006.

[5] ZhaoHui Tang, Jamie MacLennan, Data Mining


with SQL Server 2005, Wiley Publishing, 2005.

[6] Oracle, Data Mining Concepts, B28129-01, 2008.

[7] Oracle, Data Mining Application Developers Guide,


B28131-01, 2008.

3.0. Tnh hung 1


Ngy mai
gi c phiu
STB s l
bao nhiu???

3.0. Tnh hung 2


y
Y1

y=x+1

Y1
X1

M hnh phn b d liu ca y theo x???

3.0. Tnh hung 3


Bi ton phn tch gi hng th
trng (market basket analysis)
s kt hp gia cc mt hng?

3.0. Tnh hung 4

Kho st cc yu t tc ng n xu hng
s dng qung co trc tuyn ti Vit Nam

S gii tr cm nhn (+0.209)

Cht lng thng tin (+0.261)

Cht lng thng tin cm nhn (+0.199)

S kh chu cm nhn (-0.175)

S tin cy cm nhn

Thi v tnh ring t

S tng tc (+0.373)

Chun ch quan (+0.254)

Nhn thc kim sot hnh vi (+0.377)

3.0. Tnh hung

Hi qui (regression)

Khai ph d liu c tnh d bo (Predictive data


mining)

Tnh hung ???

Khai ph d liu c tnh m t (Descriptive data


mining)

Tnh hung ???

3.1. Tng quan v hi qui

nh ngha - Hi qui (regression)

J. Han et al (2001, 2006): Hi qui l k thut


thng k cho php d on cc tr (s) lin tc.

Wiki (2009): Hi qui (Phn tch hi qui


regression analysis) l k thut thng k cho
php c lng cc mi lin kt gia cc bin

R. D. Snee (1977): Hi qui (Phn tch hi qui) l


k thut thng k trong lnh vc phn tch d liu
v xy dng cc m hnh t thc nghim, cho
php m hnh hi qui va c khm ph c
dng cho mc ch d bo (prediction), iu
khin (control), hay hc (learn) c ch to ra
d liu.
9

3.1. Tng quan v hi qui

M hnh hi qui (regression model): m hnh m t mi

lin kt (relationship) gia mt tp cc bin d bo


(predictor variables/independent variables) v mt hay
nhiu p ng (responses/dependent variables).

Phn loi

Hi qui tuyn tnh (linear) v phi tuyn (nonlinear)

Hi qui n bin (single) v a bin (multiple)

Hi qui c thng s (parametric), phi thng s


(nonparametric), v thng s kt hp (semiparametric)

Hi qui i xng (symmetric) v bt i xng (asymmetric)


10

3.1. Tng quan v hi qui

Phng trnh hi qui: Y = f(X, )

X: cc bin d bo (predictor/independent
variables)

Y: cc p ng (responses/dependent variables)

: cc h s hi qui (regression coefficients)

dng gii thch s bin i ca cc p ng Y.

dng m t cc hin tng (phenomenon)


c quan tm/gii thch.

Quan

h gia Y v X c din t bi s ph thuc


hm ca Y i vi X.

m t s nh hng ca X i vi Y.

11

3.1. Tng quan v hi qui

Phn loi
Hi qui tuyn tnh (linear) v phi tuyn (nonlinear)

Hi qui n bin (single) v a bin (multiple)

Single: X = (X1)
Multiple: X = (X1, X2, , Xk)

Hi qui c thng s (parametric), phi thng s (nonparametric), v thng s


kt hp (semiparametric)

Linear in parameters: kt hp tuyn tnh cc thng s to nn Y


Nonlinear in parameters: kt hp phi tuyn cc thng s to nn Y

Parametric: m hnh hi qui vi hu hn thng s


Nonparametric: m hnh hi qui vi v hn thng s
Semiparametric: m hnh hi qui vi hu hn thng s c quan tm

Hi qui i xng (symmetric) v bt i xng (asymmetric)

Symmetric: m hnh hi qui c tnh m t (descriptive) (eg. log-linear models)


Asymmetric: m hnh hi qui c tnh d bo (predictive) (eg. generalized linear
models)
12

3.2. Hi qui tuyn tnh

Hi qui tuyn tnh n bin

Hi qui tuyn tnh a bin

13

3.2.1. Hi qui tuyn tnh n bin


Cho N i tng c quan st, m hnh hi qui tuyn
tnh n bin c cho di dng sau vi i dng gi phn
bin thin ca p ng Y khng c gii thch t X:

-Dng ng thng

-Dng parabola

14

3.2.1. Hi qui tuyn tnh n bin

Y= 0 + 1*X1 Y = 0.636 + 2.018*X


Du ca 1 cho bit s nh hng ca X i vi Y.

15

3.2.1. Hi qui tuyn tnh n bin

c lng b thng s (
) t c m
hnh hi qui tuyn tnh n bin

Thng d (residual)
Tng thng d bnh
phng (sum of
squared residuals)
ti thiu ha

Tr c lng ca

16

3.2.2. Hi qui tuyn tnh a bin

Hi qui tuyn tnh a bin: phn tch mi


quan h gia bin ph thuc
(response/dependent variable) v hai hay
nhiu bin c lp (independent variables)
yi = b0 + b1xi1 + b2xi2 + + bkxik
i = 1..n vi n l s i tng quan st
k = s bin c lp (s thuc tnh/tiu ch/yu t)
Y = bin ph thuc
X = bin c lp
b0 = tr ca Y khi X = 0
b1..k = tr ca cc h s hi qui

17

3.2.2. Hi qui tuyn tnh a bin


Tr c lng ca Y

Tr c lng ca
b thng s b

y b0 b1 x1 b2 x2 K bk xk

b X X

Y1

Y
Y 2 , X


Yn

1
1
M
1

X Y

x1,1 x1,2 K
x2,1 x2,2 K
M M
xn ,1 xn ,2 K

x1,k

x2, k

, b
M

xn ,k

b0
b
1
M

bk
18

3.2.2. Hi qui tuyn tnh a bin

Example: a sales manager of Tackey Toys,


needs to predict sales of Tackey products in
selected market area. He believes that
advertising expenditures and the population
in each market area can be used to predict
sales. He gathered sample of toy sales,
advertising expenditures and the population
as below. Find the linear multiple regression
equation which the best fit to the data.

19

3.2.2. Hi qui tuyn tnh a bin


Market Area Advertising Expenditures
(Thousands of Dollars) x1

Population
(Thousands) x2

Toy sales
(Thousands of Dollars) y

1.0

200

100

5.0

700

300

8.0

800

400

6.0

400

200

3.0

100

100

10.0

600

400

20

3.2.2. Hi qui tuyn tnh a bin

y 6.3972 20.4921x1 0.2805 x2

21

3.3. Hi qui phi tuyn

Y = f(X, )

Y l hm phi tuyn cho vic kt hp cc thng s


.

V d: hm m, hm logarit, hm Gauss,

Xc nh b thng s ti u: cc gii thut ti u


ha

Ti u ha cc b

Ti u ha ton cc cho tng thng d bnh phng (sum of


squared residuals)

22

3.4. ng dng

Qu trnh khai ph d liu

Giai on tin x l d liu

Giai on khai ph d liu

Khai ph d liu c tnh m t

Khai ph d liu c tnh d bo

Cc lnh vc ng dng: sinh hc (biology),


nng nghip (agriculture), x hi (social
issues), kinh t (economy), kinh doanh
(business),
23

3.5. Cc vn vi hi qui

Cc gi nh (assumptions) i km vi bi
ton hi qui.

Lng d liu c x l.

nh gi m hnh hi qui.

Cc k thut tin tin cho hi qui:

Artificial Neural Network (ANN)

Support Vector Machine (SVM)


24

3.6. Tm tt

Hi qui

K thut thng k, c p dng cho cc thuc tnh lin tc


(continuous attributes/features)

C lch s pht trin lu i

n gin nhng rt hu dng, c ng dng rng ri

Cho thy s ng gp ng k ca lnh vc thng k trong


lnh vc khai ph d liu

Cc dng m hnh hi qui: tuyn tnh/phi tuyn, n


bin/a bin, c thng s/phi thng s/thng s kt hp,
i xng/bt i xng

25

Hi & p

26

You might also like