DDMB Project Report

Data-driven decision making in business
(DDMB) project report
Submitted by- Group-5

Group Name: Data Decoders
Members of Group-5
1. Dwaipayan Chatterjee
2. Lokesh Devpura
3. Narendra Reddy
4. Nivedita Sharma
5. Vishal Methi
Table of Contents
1 Introduction .................................................................................................................................. 3
2 Abbreviations ................................................................................................................................ 3
3 Identification of dataset .................................................................................................................. 3
4 Description of the dataset ............................................................................................................... 3
5 Set “Objective” .............................................................................................................................. 3
6 The Hypothesis being tested ........................................................................................................... 3
7 Multiple regression modelling to identify prediction expression ....................................................... 4
8 Scatterplot to analyse correlation among variables .......................................................................... 4
9 Highlights/ conclusions ................................................................................................................... 6
DDMB project report submission against dataset “Salary of university professor” by Group-5
1 Introduction
This project report provides a broad understanding on the analysis done by our group on the selected dataset. Set
of few dataset was provided by professor initially, later on we selected one out of all given options, concluded it
while analysing through multiple regression modelling.
2 Abbreviations
Abbreviation Description
DDMB Data-driven decision making for business
VIF Variance inflation factor
p-value Probability of obtaining the observed results, assuming that the null hypothesis is true
3 Identification of dataset
We identified the dataset “Salary of the University Professors” based on the following factors-
a) Availability of categorical and numerical variable to get better depth of analysis

b) Visibility of at least one response clearly that is “Salary” which was dependent on multiple variables
4 Description of the dataset

The given data set is “Salaries of the University Professors” with their corresponding rank, discipline, years
since Ph.D., years of service, and gender. The data set includes data for 397 such professors, associate
professors and assistant professors (collectively referred to as Professors in the report).
a) Response: In the given dataset, we have considered salary to be a response.

b) Predictors: In the given dataset, we have considered rank, discipline, years since Ph.D., years of
service, and gender to be predictors for the variable salary.
5 Set “Objective”
To identify correct predictors that influence the salary of professor (Response variable) in a university using
correlation & multiple regression modelling. Also to find out the relationship among different variables.
6 The Hypothesis being tested

We considered null hypothesis (H0) like there is no relationship between response (salary) and predictors (rank,
discipline, years since Ph.D., years of service and gender).
Analysis: Since there are few categorical variables as well along with numerical variable so we have drawn “Fit
model” curve and plotted “Indicator function parameterization” rather than “parameter estimates”. Further, we
compared the p-values with level of significance (0.05) and accepted the null hypothesis for greater value and
rejected for lower than 0.05 values.
Outcomes:
 Based on the p-value, we can conclude there is no relation between gender and salary so this variable
can be avoided in further analysis.
 All other variables are considered as predictors based on their p-values.
Page 3 of 6
7 Multiple regression modelling to identify prediction expression

In our project, we used the multiple regression model. It is a statistical technique to analyse relationship among
multiple variables and we used this information to predict the response (salary).
A generalized prediction expression is written as-

Response = Intercept + (slope1 X predictor1) + (slope2 X predictor2) + (slope3 X predictor3) + ….
While considering both types of variables, we come up with prediction expression which is as follows-
Salary = 129661.85 - (32456.15 X Assoc. Prof) – (45287.69 X Asst. Prof) - (14505.15 X Discipline A) +
(534.63 X Years since Ph.D.) – (476.72 X Years in service)
8 Scatterplot to analyse correlation among variables

We plotted both numeric variables viz. Years since Ph.D. and Years in service against response (salary) on
scatterplot matrix (as per below figure) and came to know that both variables are closely coupled.
To verify this, we derived VIF (variance inflation factor) values and

came to know that VIF values for both variables are higher than 5
but less than 10 so it was an indicator of multicollinearity but up to
certain extent so it was equally important to know whether we
should drop these variables or not?
So we derived prediction expression in both cases i.e. when we
considered both variables and when we dropped both. Then, we
predicted the salary values using both prediction expressions.
Through data analysis like average values, standard deviation, min/
max value on predicted salaries, we came to know that considering
both variables is better approach rather than dropping both one.
Page 4 of 6
Option-1 Option-2
Actual dataset Predicted values when both variables Predicted values when both variables
considered (Option-1) rejected (Option-2)
Salary data given in Prediction expression: Salary = 129661.85 - Prediction expression: Salary =
dataset (32456.15 X Assoc. Prof) – (45287.69 X Asst. 133549.12 - (34082.3 X Assoc. Prof) –
Prof) - (14505.15 X Discipline A) + (534.63 X (47843.84 X Asst. Prof) - (13760.96 X
Years since Ph.D.) – (476.72 X Years in service) Discipline A)
Avg. salary = 113706.5 Avg. predicted salary = 113706.5 Avg. predicted salary = 113706.5
Std. Dev. = 30289.04 Std. Dev. = 20423.81 Std. Dev. = 20204.87
Min. salary = 57800 Min. salary = 66763.55 Min. salary = 71944.33
Max. salary = 231545 Max. salary = 142676.8 Max. salary = 133549.1
- R square = 45.25% R square = 44.49%
As per aforesaid table, we can conclude that option-1 (considering both variables) is better approach than
option-2 (rejecting both variables).
Page 5 of 6
9 Highlights/ conclusions
1. There are 4 variables: 2 are numerical & 2 are categorical which affect the response/ salary.
2. Two variables viz. Years since Ph.D. and Years in service are closely coupled.
3. VIF value for both aforesaid variables is observed more than 5 but less than 10. It is an indication of
multicollinearity but still we considered these variables for prediction expression because of better
closeness to original dataset and higher R square value.
4. R-Square Value: For linear function we got R-square value as 0.4525 or 45.25%
5. Salary depends on following variables which are contributing greatly like –
 Rank: Salary increases while increasing rank. Highest to Prof and least to Asst. Prof.
 Years since PhD: Salary increases slightly while increasing number of years after Ph.D.
 Years in service: Salary slightly decreasing while increasing service years, this needs further
detail analysis and exploration.
Page 6 of 6

DDMB Project Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DDMB Project Report

Uploaded by

Copyright:

Available Formats

Data-driven decision making in business

(DDMB) project report

Submitted by- Group-5

a) Availability of categorical and numerical variable to get better depth of analysis

4 Description of the dataset

a) Response: In the given dataset, we have considered salary to be a response.

6 The Hypothesis being tested

7 Multiple regression modelling to identify prediction expression

A generalized prediction expression is written as-

8 Scatterplot to analyse correlation among variables

To verify this, we derived VIF (variance inflation factor) values and

5. Salary depends on following variables which are contributing greatly like –

You might also like