Professional Documents
Culture Documents
1
• Linear regression- Linear models, A bi-
dimensional example, Linear Regression and
higher dimensionality, Ridge, Lasso and Elastic
Net, Robust regression with random sample
consensus, Polynomial regression, Isotonic
regression,
Logistic regression-Linear classification, Logistic
regression, Implementation and Optimizations,
Stochastic gradient descendent algorithms,
Finding the optimal hyper-parameters through
grid search, Classification metric, ROC Curve.
2
Linear Regression
• Linear regression is a linear model, e.g. a model
that assumes a linear relationship between the
input variables (x) and the single output variable
(y).
• More specifically, that y can be calculated from a
linear combination of the input variables (x).
• When there is a single input variable (x), the
method is referred to as simple linear regression.
• When there are multiple input
variables, literature from statistics often refers to
the method as multiple linear regression
Simple Linear Regression
• Simple regression problem (a single x and a
single y), the form of the model would be:
Constant Coefficient
y = b0 + b1 *
x1
y = b0 + b1 *
x1
SALARY = b0 + b1 *
+10
EXPERIENCE
K
+1 EXPERIENC
Yr E
Simple Linear Regression
ANALYZING DATASET
IV DV
Simple Linear Regression
• LET's CODE!
• Prep your Data Preprocessing Template
• Import Dataset
• No need for Missing Data
• Splitting into Training & Testing dataset
• Keep Feature Scaling but least preffered
here
• Co-relate Salaries with Experience
• Later carry out prediction
• Verify the Values of prediction
• Prediction on TEST SET
Example-2
• Let’s make this concrete with an example.
Imagine we are predicting weight (y) from
height (x). Our linear regression model
representation for this problem would be:
y = B0 + B1 * x1
or
weight =B0 +B1 * height
• Where B0 is the bias coefficient and B1 is the coefficient for
the height column. We use a learning technique to find a
good set of coefficient values. Once found, we can plug in
different height values to predict the weight.
• For example, lets use B0 = 0.1 and B1 = 0.5. Let’s
plug them in and calculate the weight (in kilograms)
for a person with the height of 182 centimeters.
weight = 0.1 + 0.05 * 182
weight = 91.1
• You can see that the above equation could be plotted as a line
in two-dimensions. The B0 is our starting point regardless of
what height we have.
• We can run through a bunch of heights from 100 to 250
centimeters and plug them to the equation and get weight
values, creating our line.
Multi Linear Regression
Constant Coefficients
y = b0 + b1 * x1 + b2 * x2 + ... + bn *
xn
Independent variables
Dependent
(IVs)
variable
(DV)
Multiple linear regression analysis
makes several key assumptions:
• Multivariate Normality–Multiple regression assumes that the
residuals are normally distributed.
• No Multicollinearity—Multiple regression assumes that the
independent variables are not highly correlated with each
other. This assumption is tested using Variance Inflation
Factor (VIF) values.
• Homoscedasticity–This assumption states that the variance of
error terms are similar across the values of the independent
variables. A plot of standardized residuals versus predicted
values can show whether points are equally distributed across
all values of the independent variables.
• Intellectus Statistics automatically includes the assumption
tests and plots when conducting a regression.
• Multiple linear regression requires at least
two independent variables, which can be
nominal, ordinal, or interval/ratio level
variables.
• A rule of thumb for the sample size is that regression analysis
requires at least 20 cases per independent variable in the
analysis.
• First, multiple linear regression requires the
relationship between the independent and
dependent variables to be linear.
• The linearity assumption can best be tested with
scatterplots. The following two examples depict a
curvilinear relationship (left) and a linear relationship
(right).
curvilinear relationship (left) and a
linear relationship (right).
• Second, the multiple linear regression analysis
requires that the errors between observed and
predicted values (i.e., the residuals of the regression)
should be normally distributed.
• This assumption may be checked by looking at a
histogram or a Q-Q-Plot. Normality can also be
checked with a goodness of fit test (e.g., the
Kolmogorov-Smirnov test), though this test must be
conducted on the residuals themselves.
• Third, multiple linear regression assumes that there
is no multicollinearity in the data.
• Multicollinearity occurs when the independent
variables are too highly correlated with each other.
• Multicollinearity may be checked multiple ways:
• 1) Correlation matrix – When computing a matrix of
Pearson’s bivariate correlations among all independent
variables, the magnitude of the correlation coefficients
should be less than .80.
• 2) Variance Inflation Factor (VIF) – The VIFs of the linear
regression indicate the degree that the variances in the
regression estimates are increased due to multicollinearity.
VIF values higher than 10 indicate that multicollinearity is a
problem.
• If multicollinearity is found in the data, one possible solution
is to center the data. To center the data, subtract the mean
score from each observation for each independent variable.
However, the simplest solution is to identify the variables
causing multicollinearity issues (i.e., through correlations or
VIF values) and removing those variables from the regression.
• A scatterplot of residuals versus predicted
values is good way to check for
homoscedasticity. There should be no clear
pattern in the distribution; if there is a cone-
shaped pattern (as shown below), the data is
heteroscedastic.
Multi Linear Regression
DUMMY
VARIABLES
Categorical
Variable
Multi Linear Regression
DUMMY
VARIABLES
1 0
0 1
0 1
0 1
y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 *
D1
Multi Linear Regression
DUMMY VARIABLE
TRAP
D
NEW YORK CALIFORNIA
1 0
D2 = 1 - D1 0 1
y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + b4 * D1 + b5 * D2
Always OMIT one Dummy Variable
Building A Model
STEP BY
STEP
Building A Model
• METHODS OF BUILDING A
MODEL
• Forward Elimination
Building A Model
• METHODS OF BUILDING
A MODEL
• ALL - IN
• Throw in every variable
• Prior Knowledge
• Known Values
• Preparing Backward elimination
Building A Model
• BACKWARD ELIMINATION
MODEL (Best Model in All )
• Step 1
• Select significance level to stay in model (0.05)
• Step 2
• Fit in full model with all possible predictors
• Step 3
• Consider the predictor with highest P value
• If P > SL, go to Step 4, otherwise go to FIN
• Step 4
MODEL
• Remove the Predictor BUILT
• Step 5
• Fit the model w/o this
Bi-Dimensional Example
>>> scores.mean()
-36.859219426420601
>>> scores.std()
45.704973900600457
• Another very important metric used in
regressions is called the coefficient of
determination or R2. It measures the amount
of variance on the prediction which is
explained by the dataset
• C=0.10
• Sparsity with L1 penalty: 28.12%
• score with L1 penalty: 0.9026
• Sparsity with L2 penalty: 4.69%
• score with L2 penalty: 0.9021
• C=0.01
• Sparsity with L1 penalty: 84.38%
• score with L1 penalty: 0.8625
• Sparsity with L2 penalty: 4.69%
• score with L2 penalty: 0.8898
Let Code Continue PPT 102
>>> gs = GridSearchCV(estimator=LogisticRegression(),
param_grid=param_grid, scoring='accuracy', cv=10,
n_jobs=multiprocessing.cpu_count())
>>> gs.fit(iris.data, iris.target)
>>> gs.best_estimator_LogisticRegression(C=1.5,
class_weight=None, dual=False,
fit_intercept=True,intercept_scaling=1,
max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l1', random_state=None, solver='liblinear',
tol=0.0001, verbose=0, warm_start=False)
>>> cross_val_score(gs.best_estimator_, iris.data,
iris.target, scoring='accuracy', cv=10).mean()
0.96666666666666679
Basic-Grid-search and cross-validated
estimators
• By default, the GridSearchCV uses a 3-fold cross-
validation.
• However, if it detects that a classifier is passed,
rather than a regressor, it uses a stratified 3-fold.
• The default will change to a 5-fold cross-validation in
version 0.22.
• Eg
• >>> clf.score(X_digits[1000:], y_digits[1000:])
0.943...
• Nested cross-validation
>>> cross_val_score(clf, X_digits, y_digits)
array([0.938..., 0.963..., 0.944...])
find the best parameters of an SGDClassifier
trained with perceptron loss. The dataset is plotted
in the following figure:
SGDClassifier
• This estimator implements regularized linear
models with stochastic gradient descent (SGD)
learning: the gradient of the loss is estimated
each sample at a time.
• This implementation works with data represented as
dense or sparse arrays of floating point values for the
features.
• The model it fits can be controlled with the loss
parameter; by default, it fits a linear support vector
machine (SVM).
• The regularizer is a penalty added to the loss
function that shrinks model parameters
towards the zero vector using either the
squared euclidean norm L2 or the absolute
norm L1 or a combination of both (Elastic Net).
• If the parameter update crosses the 0.0
value because of the regularizer, the
update is truncated to 0.0 to allow for
learning sparse models and achieve
online feature selection.
from sklearn.model_selection import GridSearchCV
>>> param_grid = [
{
'penalty': [ 'l1', 'l2', 'elasticnet' ],
'alpha': [ 1e-5, 1e-4, 5e-4, 1e-3, 2.3e-3, 5e-3, 1e-2],
'l1_ratio': [0.01, 0.05, 0.1, 0.15, 0.25, 0.35, 0.5, 0.75, 0.8]
} ]
>>> sgd = SGDClassifier(loss='perceptron',
learning_rate='optimal')
>>> gs = GridSearchCV(estimator=sgd, param_grid=param_grid,
scoring='accuracy', cv=10,
n_jobs=multiprocessing.cpu_count())
>>> gs.fit(X, Y)
>>> gs.best_score_
0.89400000000000002
>>> gs.best_estimator_
SGDClassifier(alpha=0.001, average=False,
class_weight=None, epsilon=0.1, eta0=0.0,
fit_intercept=True, l1_ratio=0.1, learning_rate='optimal',
loss='perceptron', n_iter=5, n_jobs=1, penalty='elasticnet',
power_t=0.5, random_state=None, shuffle=True, verbose=0,
warm_start=False)
Classification metrics
• A classification task can be evaluated in many
different ways to achieve specific objectives.
Of course, the most important metric is the
accuracy, often expressed as: