Cours
e
Summer
Minin
g
Data
Summer Course: Data Mining
Regression Analysis
Regression
Analysis
Presenter: Georgi
Presenter:
Georgi Nalbantov
Nalbantov
Cours
e
Summer
Minin
g
Data
Regression analysis: definition and examples
Classical Linear Regression
LASSO and Ridge Regression (linear and nonlinear)
Nonparametric (local) regression estimation:
kNN for regression, Decision trees, Smoothers
Support Vector Regression (linear and nonlinear)
Variable/feature selection (AIC, BIC, R^2 adjusted)
Cours
e
Summer
Minin
g
Data
Clustering
Regression
X 2

+

+

+

+

+

+ +

+

+
+
+

^{+}

+ +

+
+

X 1
kth Nearest Neighbour Parzen Window Unfolding, Conjoint Analysis, CatPCA
Classification
X 2
X 1
Linear Discriminant Analysis,
QDA
Logistic Regression (Logit)
Decision Trees, LSSVM, NN, VS
+
X 1
Classical Linear
Regression
Ridge Regression
NN, CART
Cours
e
Summer
Minin
g
Data
Given data on m explanatory variables and 1 explained variable, where the explained variable can take real values in ^{1} , find a function that gives the “best” fit:
Given:

( x _{1} , y _{1} ), … , ( x _{m} , y _{m} )

^{n} X ^{1}

Find:

:

^{n} ^{1}

“best function” = the expected error on unseen data ( x _{m}_{+}_{1} , y _{m}_{+}_{1} ), … ,
(
x _{m}_{+}_{k} , y _{m}_{+}_{k} )
is minimal
Cours
e
Summer
Minin
g
Data
Classical Linear Regression (OLS)
Explanatory and Response Variables are Numeric
Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line)
Model:
Y
0
x
1
N
~
(0, )
_{•} _{1} > 0 Positive Association _{•} _{1} < 0 Negative Association
_{•} _{1} = 0
No Association
Cours
e
Summer
Minin
g
Data
Classical Linear Regression (OLS)
_{0} Mean response when x=0 (y
intercept)
_{1} Change in mean response when
x increases by 1 unit (slope)
_{0} , _{1} are unknown parameters (like
)
Task:
_{0} + _{1} x Mean response when
Minimize the
explanatory variable takes on the
sum value of x squared
errors:
^
^
^
y
0
1
x
SSE
n
i
1
y
i
^
y
i
2
n
i
1
y
i
^
^
0
1
x
i
2
Cours
e
Summer
Minin
g
Data
Classical Linear Regression (OLS)
Parameter: Slope in the population model ( _{1} )
Estimator: Least squares estimate:
s/ S
Estimated standard error:
S
x
x
Methods of making inference regarding population:
Hypothesis tests (2sided or 1 sided) Confidence Intervals

a. Dependent Variable: SCORE
y
Cours
e
Summer
Minin
g
Data
Classical Linear Regression (OLS)
Coefficient of determination (r ^{2} ) :
proportion of variation in y “explained” by the regression on x.
2
r
S
yy
SSE
S
yy
where
0
r
2
1
S
y
y
SSE
y
y
Cours
e
Summer
Minin
g
Data
Classical Linear Regression (OLS):
Multiple regression
Numeric Response variable (y) p Numeric predictor variables Model:
Y = _{0} + _{1} x _{1} + + _{p} x _{p} +
Partial Regression Coefficients: _{i} effect (on the mean response) of increasing the i ^{t}^{h} predictor variable by 1 unit, holding all other predictors constant
Cours
e
Summer
Minin
g
Data
Classical Linear Regression (OLS):
Ordinary Least Squares estimation
• Population Model for mean response:
E Y x x x x
(

1
,
p
)
0
1
1
p p
• Least Squares Fitted (predicted) equation, minimizing SSE:
^
^
^
^
Y
0
1
x
1
p
x
p
SSE
^
Y Y
2
Cours
e
Summer
Minin
g
Data
Classical Linear Regression (OLS):
Ordinary Least Squares estimation
• Model:
^
^
^
^
Y
0
1
x
1
p
x
p
• OLS estimation:
• LASSO estimation:
min SSE
^
Y Y
2
min SSE
n
i
1
^
Y Y
2
p
j
1
j
• Ridge regression estimation:
min SSE
n
i
1
^
Y Y
2
p
j
1
j
2
Cours
e
Summer
Minin
g
Data
Nonparametric (local) regression estimation:
kNN, Decision trees, smoothers
How to Choose k or h?
When k or h is small, single instances matter; bias is small, variance is large (undersmoothing): High complexity
As k or h increases, we average over more instances and variance decreases but bias increases (oversmoothing):
Low complexity
Crossvalidation is used to finetune k or h.
Cours
e
Summer
Minin
g
Data
Linear Support Vector Regression
biggest area
●
●
●
●
●
●
●
Expenditures
small area
●
●
●
●
●
●
●
Expenditures
“Lazy case” (underfitting)
“Suspiciously smart case”
(overfitting)
middlesized
area
●
●
●
●
●
●
●
“Support
vectors”
Expenditures
“Compromise case”, SVR (good generalisation)
The thinner the “tube”, the more complex the model
Cours
e
Summer
Minin
g
Data
Nonlinear Support Vector Regression
Map the data into a higherdimensional space:
●
●
●
●
●
●
●
Expenditures
Cours
e
Summer
Minin
g
Data
Nonlinear Support Vector Regression
Map the data into a higherdimensional space:
●
●
●
●
●
●
●
Expenditures
Cours
e
Summer
Minin
g
Data
Nonlinear Support Vector Regression:
Technicalities
The SVR function:
To find the unknown parameters of the SVR function, solve:
Subject to:
How to choose
,
,
= RBF kernel:
Find
,
,
, and
from a crossvalidation procedure
Cours
e
Summer
Minin
g
Data
SVR Technicalities: Model Selection
Do 5fold crossvalidation to find
values of
.
and
for several fixed
0.064
0.063
0.062
0.061
0.06
0.059
0.02
0.058
0.01
0
5
10
150
CVMSE
Cours
e
Summer
Minin
g
Data
Variable selection for regression
Akaike Information Criterion (AIC). Final prediction error:
Cours
e
Summer
Minin
g
Data
Variable selection for regression
Bayesian Information Criterion (BIC), also known as Schwarz
criterion. Final prediction error:
BIC tends to choose simpler models than AIC.
Cours
e
Summer
Minin
g
Data
Conclusion / Summary / References
Classical Linear Regression
(any introductory statistical/econometric book)
LASSO and Ridge Regression (linear
and nonlinear)
http://wwwstat.stanford.edu/~tibs/lasso.html , Bishop, 2006
Nonparametric (local) regression
estimation:
Alpaydin, 2004, Hastie et. el., 2001
kNN for regression, Decision trees,
Smoothers
Smola and Schoelkopf, 2003
Support Vector Regression (linear and
nonlinear)
Hastie et. el., 2001, (any statistical/econometric book)
Variable/feature selection (AIC, BIC,