You are on page 1of 17

DS Unit 2 Essay Answers

1. Explain the least square methods for estimation of coefficients of linear multiple regression.

Answer : 

Linear Multiple Regression


Linear Multiple Regression (LMR) model is a statistical model that establishes existence of a linear relationship
(association) between a dependent variable and several independent variables
Y = a + b1X1 + b2X2 + b3X3 + ... + bnXn
Where:
Y = the variable that we are trying to predict (dependent variable),
X = the variable that we are using to predict Y (independent variable),
a = the intercept,
b=(b1 , b2 ,b3 ,...,bn ) = the slope.
Eg :  Blood Sugar Level vs Weight and Age
Method
To estimate the parameters a and b=(b1 , b2 ,b3 ,...,bn ) using the principle of least squares, form the sum of squared
deviations of the observed Yi ’s from the regression line: 

Solution
2. Estimate R2 and adjusted R2 value for the following data set

Answer : 
3. Define Regression. Design a Multiple Linear regression model to find final score for given dataset of student
placement marks and perform R2 , Adjusted R2 to test the performance of the model

Answer :

Regression is a statistical technique that is used to model the relationship of a dependent variable with respect to one or
more independent variables
4. Define Regression. Design a Multiple Linear regression model to find final score for given dataset of student
placement marks and Find the Mean Square Error (MSE) of the model. -> Mid Question

Answer :

Regression is a statistical technique that is used to model the relationship of a dependent variable with respect to one or
more independent variables
5. Explain Logistic regression with example and How to implement in Python?

Answer :

Logistic Regression
Logistic regression is a Machine Learning algorithms, which comes under the Supervised Learning technique. It is used
for predicting the categorical dependent variable using a given set of independent variable
Logistic regression is used for solving the classification problems
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two
maximum values (0 or 1)
Code:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
LR=LogisticRegression()
LR.fit(x_train,y_train)
pred=LR.predict(x_test)
print("Logistic Regression accuracy::","\n",accuracy_score(y_test,pred))
print("\n")
print(confusion_matrix(y_test,pred))
print("\n")
print(classification_report(y_test,pred))

6. Define Classification. Discuss the procedure of the KNN-Classifier to classify a Person X ( sugar level 190, Age 45) from
given case study of diabetic patients. -> Mid Question

Answer :

Classification
The method of arranging data into homogenous classes according to common features present in the data is known as
classification
Types of Classification

KNN Classifier
K nearest neighbors (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a
similarity measure (distance function)
KNN Algorithm
Problem

7. Explain random forest with example and How to implement in Python?

Answer :

Random forests
is a supervised learning algorithm. It can be used both for classification and regression.
Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects
the best solution by means of voting
RF are ensemble methods used to boost the performance of DT 
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction
Example - did not get
Code
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
def train(model, X, y):
    # train the model
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.7,random_state=0)
    model.fit(x_train, y_train)
    
    # predict the training set
    pred = model.predict(x_test)
    
    print("Model Report")
    print("RMSE:",sqrt(mean_squared_error(y_test, pred)))

from sklearn.ensemble import RandomForestRegressor


model3 = RandomForestRegressor()
train(model3, X, y)

8. Explain ridge regression with example and How to implement in Python?

Answer :

Ridge regression
is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2
regularization
When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted
values to be far away from the actual values. By adding a degree of bias to the regression estimates, ridge regression
reduces the standard errors.
Example - did not get
Code
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(indp_data,target,test_size=0.25,random_state=3405)
from sklearn.linear_model import Ridge
import warnings
warnings.filterwarnings("ignore")
# Model initialization
regression_model = Ridge(normalize=True,random_state =100)
#Fit the data(train the model)
model= regression_model.fit(X_train, Y_train)
# Predict
y_predicted = model.predict(X_train)
# model evaluation
rmse = mean_squared_error(Y_train, y_predicted)
r2 =model.score(X_train, Y_train)
# printing values
print('Slope:' ,model.coef_)
print('Intercept:', model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

9. What is Multi collinearity issue? How can you address this issue?

Answer :

Multicollinearity, or collinearity, is the existence of nearlinear relationships among the independent variables
occurs when independent variables in a regression model are correlated
Issue
create inaccurate estimates of the regression coefficients, inflate the standard errors of the regression coefficients,
deflate the partial t-tests for the regression coefficients, give false, nonsignificant, pvalues, and degrade the
predictability of the model
How can it be measured ?
1. Tolerance
is percentage of variance in the independent variable that is not accounted for by other independent variables

2. Variance Inflation Factor


indicates degree to which standard errors are inflated due to the levels of collinearity
VIIF= reciprocal of tolerance

How to address it
1. Remove highly correlated predictors from the model.  If you have two or more factors with a high VIF, remove one
from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't
drastically reduce the R-squared.  Consider using stepwise regression, best subsets regression, or specialized
knowledge of the data set to remove these variables. Select the model that has the highest R-squared value. 
2. Use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number
of predictors to a smaller set of uncorrelated components.
3. Obtaining more data on an expanded range would cure multicollinearity problem caused due to Data collection (ie
data collected from a narrow subspace of the independent variables(
4. situation of Over-defined model(there are more variables than observations) should be avoided
5. outlier-induced multicollinearity can be corrected by removing the outliers before ridge regression is applied
6. If Strutural multicollinearity
centering variables is efficient solution
7. Data multicollinearity
remove some highly corelated independent variables
Linearly combine correlated Independent variables , add them together
use LASSO or ridge regression
8. First check to see if one of predictor variable is a duplicate
9. Remove a redundant variable
10. Aggregate similar variables
11. Increase sample size

10. Explain SVM with example and How to implement in Python?

Answer :

Support Vector Machines (SVM)


It is Supervised Learning algorithms, which is used for Classification as well as Regression problems.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in the correct category in the future. This best decision
boundary is called a hyperplane. 
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support Vector Machine
Example - did not get
Code
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
svm=SVC()
svm.fit(x_train,y_train)
pred=svm.predict(x_test)
print("svm accuracy::","\n",accuracy_score(y_test,pred))
print("\n")
print(confusion_matrix(y_test,pred))
print("\n")
print(classification_report(y_test,pred))

11. Explain ANOVA with example and How to implement in Python?

Answer :

ANOVA ( Analysis of variance )


is a procedure for testing the difference among different of data for homogeneity
ANOVA help us to figure out if there is need to reject the null hypothesis or accept the alternate hypothesis
ANOVA is two types
1. One way ANOVA 
only one factor is investigate
one independent variable (with 2 levels) 
2. Two Way ANOVA 
investigate two factors at the same time.
two independent variables (can have multiple levels).
1. Two way ANOVA without replication
2. Two way ANOVA with replication
Example
refer below question
Code
did not get

12. What is ANOVA. Find the significance of the noise on solving questions from dataset given below

Answer :

ANOVA ( Analysis of variance )


is a procedure for testing the difference among different of data for homogeneity
ANOVA help us to figure out if there is need to reject the null hypothesis or accept the alternate hypothesis
ANOVA is two types
1. One way ANOVA 
only one factor is investigate
one independent variable (with 2 levels) 
2. Two Way ANOVA 
investigate two factors at the same time.
two independent variables (can have multiple levels).
1. Two way ANOVA without replication
2. Two way ANOVA with replication
Problem

You might also like