Professional Documents
Culture Documents
SUBMITTED BY
ANKITA MISHRA
Data Dictionary:
1. sales: Sales (in millions of dollars).
6. sp500: Membership of firms in the S&P 500 index. S&P is a stock market index that measures
the stock performance of 500 large companies listed on stock exchanges in the United States
7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio between a physical asset's
market value and its replacement value.
Project 1
Problem 1.1:
1.1) Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values,
data types, shape, EDA). Perform Univariate and Bivariate Analysis.
NULL Values:
Info:
The first step to know our data understand it, get familiar with it. What are the answers
we’re trying to get with that data? What variables are we using, and what do they mean?
How does it look from a statistical perspective? Is data formatted correctly? Do we have
missing values? And duplicated? What about outliers? So all these answers are can be found
out step by step as below:
Step1: Import: a) all the necessary libraries and b) The Data.
Step2: Describing the Data after loading it. Checking for datatypes, number of columns and
rows, checking for missing number of values, describing its min, and max, mean values.
Depending upon requirement dropping off missing values or replacing it.
Step3: Reviewing new dataset and Inferences
Price, depth and have many outliers, which can be inferred. They have three kinds of
datatypes, int, float, and object.
Correlation analyses using pair plot and heat map:
Problem 1.2:
1.2) Impute null values if present? Do you think scaling is necessary in this case?
SCALING:
The scaling data shows the ranges of the data is between -2 to +3 and most of the variables
are ordinal variables so there is no need of scaling
Observation:
As can be seen from point 1.1 info Total Data has 759 rows with data while Independent Variable-
tobinq showing 738 entries meaning 21 null values , for which Median value of variable tobinq is
replaced.
1.3) Encode the data (having string values) for Modelling. Data Split: Split the dat a into
test and train (70:30). Apply linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using R-square, RMSE.
Encode the SALES with Train and Test Data predicators and SP500 as a Categorical as Y/N
data.
Stats Model - Apply Linear Regression
Scatter plot Pre- and Post -Scaling with Z-Score
Pre – Scaling:
POST –SCALING:
Co-efficient – of Scaled Data Set:
1.4) Inference: Based on these predictions, what are the business insights and
recommendations.
The investment criteria for any new investor is mainly based on the capital invested in the
company by the promoters and investors are vying on the firms where the capital
investment is good as also reflecting in the scatter plot.
To generate capital the company should have the combination of the following attributes
such as value, employment, sales and patents.
The highest contributing attribute is employment followed by patents.
When the Employment increase by 1 Unit the Sales increase by 80.33 units, by keeping all
the predictors constant,
When the Capital increase by 1 Unit the Sales increase by 0.42 units by keeping all the
predictors constant.
Project 2:
Problem 2
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for varying
sampling probabilities. (The inverse probability weighting estimator can be used to demonstrate
causality when the researcher cannot conduct a controlled experiment but has observed data to
model)
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy,
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more
bags deployed.
15. caseid: character, created by pasting together the populations sampling unit, the case
number, and the vehicle number. Within each year, use this to uniquely identify the vehicle.
2.1) Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check,
write an inference on it. Perform Univariate and Bivariate Analysis. Do exploratory data analysis.
List of Categorical Columns
Apply Median for the Categorical Data
2.2) Encode the data (having string values) for Modelling. Data Split: Split the
data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
Before Encoding:
After Encoding:
Data Split: Split the data into train and test (70:30)
Value Counts of the Test and Train Daya of Y Values of Survive Column
Predicting on Training and Test dataset LDA (Linear
Discriminant Analysis)
Confusion Matrix:
2.3) Performance Metrics: Check the performance of Predictions on Train and Test sets
using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
Compare both the models and write inferences, which model is best/optimized.
2.4) Inference: Based on these predictions, what are the insights and recommendations.
Inference:
Score of Both Train and Test Data are coming near-by.
Linear Discriminant Analysis Model Giving Better Recall and Precision in comparison to Logistic
Regression.
Hence, LDA Model cab be considered further upgrading the same using SMOTE model, whereby its
predictive ability gets further enhanced.
Conclusion:
The model accuracy of logistic regression on both training data as well as testing data is
almost same i.e., 97%.
Similarly, AUC in logistic regression for training data and testing data is also similar.
The other parameters of confusion matrix in logistic regression is also similar, therefore we
can presume in this that our model is over fitted.
We have therefore applied Grid Search CV to hyper tune our model and as per which F1
score in both training and test data was 97%.
In case of LDA, the AUC for testing and training data is also same and it was 97%, besides
this the other parameters of confusion matrix of LDA model were also similar and it clearly
shows that model is over fitted here too.
Overall, we can conclude that logistic regression model is best suited for this data set given
the level of accuracy in spite of the Linear Discriminant Analysis that the model is over
fitted.
END OF PROJECT