You are on page 1of 3

IES’s Management College and Research Centre, Mumbai

(Final Examination)
Date: 11/10/2023 Day: Wednesday Time: 10.30am to 1.30pm Duration: 3 hrs.
Program: PGDM (Analytics) Term: IV Course: Predictive Modelling Max Marks: 60

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Instructions:
1) Use RStudio to answer the question paper.
2) Read the questions carefully and answer the questions.
3) Your R File needs to be submitted as your answer paper. Make sure you write your Analysis within your R script itself.
4) The File needs to named as such: PGDM_XX_PM.R
In place of XX place your roll number.

------------------------------------------------------------------------------------------------------------------------------------------------------
Q. No. (Questions) (Marks) COs
1 Solve any two out of three 20 CO1
A A firm has HR data collected for individuals who apply to their company. It
contains information like: Duration to Accept an offer, Notice Period details,
Percent hike requested, percent hike granted when job is offered, where the
candidate is sourced from, etc. All this is provided to you in : hr_data.csv
You are an HR analyst in this company and you’ve been asked to do the
following tasks:
i) Create a boxplot with the variable Status on the x-axis and
Duration to accept offer on the y-axis. Discuss what you see on the 5
chart and explain the results to your manager.
ii) Are variables correlated? Specify which ones are. Draw charts to
prove your point. And report your interpretations of these charts. 5

B A firm has HR data collected for individuals who apply to their company. It
contains information like: Duration to Accept an offer, Notice Period details,
Percent hike requested, percent hike granted when job is offered, where the
candidate is sourced from, etc. All this is provided to you in : hr_data.csv
Your Manager wants to predict Duration to accept offer
using Percent hike expected in CTC
and Percent hike offered in CTC
i) Create a predictive model for the same. Is the model a good fit or
not?
5
ii) What are the assumptions that need to be met with for a linear
3
model?
2
iii) Check if the model you created, satisfies all the assumptions that
are needed
C A firm has HR data collected for individuals who apply to their company. It
contains information like: Duration to Accept an offer, Notice Period details,
Percent hike requested, percent hike granted when job is offered, where the
candidate is sourced from, etc. All this is provided to you in : hr_data.csv
Your Manager has asked you to create the best model you can to predict
Duration to accept offer
i) Which variables did you identify as significant predictors? 5
ii) Label the model that you created as the “Best Model” and find MAE 1
Page 1 of 3
iii) RMSE for this mode.
1
iv) What business insight would you derive from MAE and RMSE?
3
Report this to your manager.

2 Solve any two out of three 20 CO2


A The Baltimore Police Department has recorded the number of crimes
committed daily in the city from 2011 to 2016. This is given to you in:
CrimeData.csv
i) On which day did the city witness the most crime?
ii) Find Moving averages.
iii) Use MAE and RMSE to discuss if Moving averages in good 1
technique to analyze trend or not. 3
6
Report all your findings in a way that is useful to the Baltimore Police
Department.

B The Baltimore Police Department has recorded the number of crimes


committed daily in the city from 2011 to 2016. This is given to you in:
CrimeData.csv
i) Create a trend chart (time series) and see if you can view any 2
pattern in the data
ii) Create an interactive plot to discuss the trend of crime. 2
iii) Make predictions use Exponential Smoothing 2
iv) Make predictions use ARIMA. 2

Use your analysis to tell the Baltimore Police Department how they can use a 2
data analyst on their team.
C The NBA (National Basketball Association) is a professional basketball league
in North America, widely considered the premier men's basketball league in
the world. In the dataset “nba.csv” , we have a list of 125 players and a few of
their statistics, namely: No. of games they played, minutes they play, points
per game.
If the career length of a player >= 5 years, he has had a long career. If the
career length of a player < 5 years he hasn’t had a long career. The
TARGET_5Yrs is the variable with 1 for long career and 0 as short career.

You are a sports performance analyst.


You need to predict the whether a player will have a long career or short.
i) Which model will you fit for the same? Use that model on the given
data.
2
ii) Use the model to make predictions.
3
iii) Use a confusion matrix to discuss whether your model is good or
3
not.
2
As an analyst, are there 2 more variables that you can suggest to be added to
the dataset to help improve the model?

3 Solve any two out of three 20 CO3

Page 2 of 3
A A media company has data about Bollywood movies that is given the dataset
“Bollywood.csv”
They’re writing an article about movies and required certain information to
be given to them by their analyst. The questions they need answered is given
below

i) Which movies have the highest budget? List down the top 10
ii) How many movies belong to which genre? 2
iii) How would you assess popularity of movies? 2
iv) If I spend more on making a movie, will it generate more revenue? 2
v) Is there a relationship between When a movie is released and how 2
much is collected in the box office? 2

B The Iris dataset is a well-known and widely used dataset in the field of
machine learning and statistics. The dataset contains measurements of 150
iris flowers from three different species: setosa, versicolor, and virginica.
There are four features (also called attributes) in the Iris dataset, which are
all numeric:
Sepal length in centimeters
Sepal width in centimeters
Petal length in centimeters
Petal width in centimeters
The target variable is the species of the iris flower, which is a categorical
variable with three possible classes: setosa, versicolor, and virginica.
The iris dataset is inbuilt in R. Use this dataset and predict the Species of
flower based on all the variables in the dataset.
i) Which model are you fitting on the dataset and why. 2
ii) Which variables are significant predictors of Species? 3
iii) Are variables correlated? Use VIF to discuss if they are. 1
iv) Write interpretations of the coefficients of the model. 4

C What are the differences between a Binary Logistic Regression Model and
Multiple Linear Regression Model. 5
Give examples for the same (one for multiple regression and one for logistic
regression) 2
Discuss how you check model performance for multiple regression and for 3
binary logistic regression

*************************

Page 3 of 3

You might also like