Professional Documents
Culture Documents
1. Explain the least square methods for estimation of coefficients of linear multiple regression.
Answer :
Solution
2. Estimate R2 and adjusted R2 value for the following data set
Answer :
3. Define Regression. Design a Multiple Linear regression model to find final score for given dataset of student
placement marks and perform R2 , Adjusted R2 to test the performance of the model
Answer :
Regression is a statistical technique that is used to model the relationship of a dependent variable with respect to one or
more independent variables
4. Define Regression. Design a Multiple Linear regression model to find final score for given dataset of student
placement marks and Find the Mean Square Error (MSE) of the model. -> Mid Question
Answer :
Regression is a statistical technique that is used to model the relationship of a dependent variable with respect to one or
more independent variables
5. Explain Logistic regression with example and How to implement in Python?
Answer :
Logistic Regression
Logistic regression is a Machine Learning algorithms, which comes under the Supervised Learning technique. It is used
for predicting the categorical dependent variable using a given set of independent variable
Logistic regression is used for solving the classification problems
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two
maximum values (0 or 1)
Code:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix
LR=LogisticRegression()
LR.fit(x_train,y_train)
pred=LR.predict(x_test)
print("Logistic Regression accuracy::","\n",accuracy_score(y_test,pred))
print("\n")
print(confusion_matrix(y_test,pred))
print("\n")
print(classification_report(y_test,pred))
6. Define Classification. Discuss the procedure of the KNN-Classifier to classify a Person X ( sugar level 190, Age 45) from
given case study of diabetic patients. -> Mid Question
Answer :
Classification
The method of arranging data into homogenous classes according to common features present in the data is known as
classification
Types of Classification
KNN Classifier
K nearest neighbors (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a
similarity measure (distance function)
KNN Algorithm
Problem
Answer :
Random forests
is a supervised learning algorithm. It can be used both for classification and regression.
Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects
the best solution by means of voting
RF are ensemble methods used to boost the performance of DT
Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction
Example - did not get
Code
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
def train(model, X, y):
# train the model
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.7,random_state=0)
model.fit(x_train, y_train)
# predict the training set
pred = model.predict(x_test)
print("Model Report")
print("RMSE:",sqrt(mean_squared_error(y_test, pred)))
Answer :
Ridge regression
is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2
regularization
When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted
values to be far away from the actual values. By adding a degree of bias to the regression estimates, ridge regression
reduces the standard errors.
Example - did not get
Code
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(indp_data,target,test_size=0.25,random_state=3405)
from sklearn.linear_model import Ridge
import warnings
warnings.filterwarnings("ignore")
# Model initialization
regression_model = Ridge(normalize=True,random_state =100)
#Fit the data(train the model)
model= regression_model.fit(X_train, Y_train)
# Predict
y_predicted = model.predict(X_train)
# model evaluation
rmse = mean_squared_error(Y_train, y_predicted)
r2 =model.score(X_train, Y_train)
# printing values
print('Slope:' ,model.coef_)
print('Intercept:', model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)
9. What is Multi collinearity issue? How can you address this issue?
Answer :
Multicollinearity, or collinearity, is the existence of nearlinear relationships among the independent variables
occurs when independent variables in a regression model are correlated
Issue
create inaccurate estimates of the regression coefficients, inflate the standard errors of the regression coefficients,
deflate the partial t-tests for the regression coefficients, give false, nonsignificant, pvalues, and degrade the
predictability of the model
How can it be measured ?
1. Tolerance
is percentage of variance in the independent variable that is not accounted for by other independent variables
How to address it
1. Remove highly correlated predictors from the model. If you have two or more factors with a high VIF, remove one
from the model. Because they supply redundant information, removing one of the correlated factors usually doesn't
drastically reduce the R-squared. Consider using stepwise regression, best subsets regression, or specialized
knowledge of the data set to remove these variables. Select the model that has the highest R-squared value.
2. Use Partial Least Squares Regression (PLS) or Principal Components Analysis, regression methods that cut the number
of predictors to a smaller set of uncorrelated components.
3. Obtaining more data on an expanded range would cure multicollinearity problem caused due to Data collection (ie
data collected from a narrow subspace of the independent variables(
4. situation of Over-defined model(there are more variables than observations) should be avoided
5. outlier-induced multicollinearity can be corrected by removing the outliers before ridge regression is applied
6. If Strutural multicollinearity
centering variables is efficient solution
7. Data multicollinearity
remove some highly corelated independent variables
Linearly combine correlated Independent variables , add them together
use LASSO or ridge regression
8. First check to see if one of predictor variable is a duplicate
9. Remove a redundant variable
10. Aggregate similar variables
11. Increase sample size
Answer :
Answer :
12. What is ANOVA. Find the significance of the noise on solving questions from dataset given below
Answer :