Employee Satisfaction Report

DATA 603 Project:
Regression Modelling of Employee Satisfaction
Ali Campbell (30024201), Crystal Wai (30233867), Niza Ngwira (30064557)
Group 20 (L03)
DATA 603: Statistical Modeling with Data
Instructor: Qingrun Zhang
December 8th, 2023

2
TABLE OF CONTENTS
1.1. MOTIVATION
1.1.1. Context
1.1.2. Problem
1.2. OBJECTIVES
1.2.1. Overview
1.2.2. Goals & Research Questions
2. METHODOLOGY
2.1 Data
2.2 Approach
2.3 Workflow
2.4 Workload Distribution
3. MAIN RESULTS OF THE ANALYSIS
3.1 Multiple Regression Assumptions
3.2 Results
4. CONCLUSION AND DISCUSSION
4.1 Approach
4.2 Future Work
5. REFERENCES
6. APPENDIX
3
1. INTRODUCTION
1.1. MOTIVATION
1.1.1. Context
In this project, we chose to investigate a Employee Satisfaction Survey Dataset from
Kaggle. This dataset offers a set of variables related to employee experiences. The topic that
will be investigated is; How do these work attributes impact employee satisfaction? “An
extensive study into happiness and productivity has found that workers are 13% more
productive when happy.” (Bellet et al. 2023) and “Satisfied employees are more committed,
engaged, productive, resulting in lower turnover rates and higher overall performance levels”
(Long, 2023). Therefore, how happy people are at work reflects positively into their work.
The project aims to determine significant variables in the prediction of employee satisfaction.
1.1.2. Problem
“The average person will spend 99 000 hours at work” which accumulates to one third
of people's lives (Naber, A., n.d.). The fulfillment or lack thereof, from the work environment
extends beyond the work day. Satisfaction at work can negatively or positively affect a
person's mental health and well-being. The significance of workplace characteristics on
employee satisfaction will be investigated to determine the factors which make a satisfied and
happy employee. These factors include an employee's score on their last evaluation, the
number of projects they partake in, their average monthly hours, the length of time they have
been at the company, if they have been in any work accidents, if they have had a promotion
in the last 5 years, their department, and their salary. Additionally, from the employers point
of view, “Companies with high worker satisfaction outperform low satisfaction companies by
202%” (Apollo Technical, 2022). If a company can understand the attributes that matter to
their employees overall satisfaction, they can improve employee performance.
1.2. OBJECTIVES
1.2.1. Overview
The overall intent of the project is to better understand how different characteristics
regarding work can impact an employee's satisfaction level. Understanding the elements
4
which make a satisfied employee can provide actionable insights to cultivate the proper
working conditions that can be created to achieve satisfied employees. If assumptions are met
for the linear regression model, companies could use the model to predict the satisfaction of
their employees and make changes where necessary.
1.2.2 Goals & Research Questions

The goals of the project include fitting a model with significant predictor variables
and the largest possible adjusted r-squared and lowest possible root mean square error. Use
visualizations and tests to determine if the assumptions of linear regression modeling are
met.
To check linearity, a residual versus fitted values plot will be used. To confirm equal
variance the residual versus fitted values plot will be created in addition to the Breush-Pagan
test. Independence will be confirmed. We checked for outliers with the method of cook's
distance and removed if found. Q-Q plot and histogram of residuals created in addition to the
Shapiro-Wilk’s test to confirm the normality assumption. Variance inflation factor (VIF) to
check for multicollinearity between variables. If the model meets these assumptions it would
be expected that this model could be used to predict the satisfaction level of employees with
different profiles.
The expectation is that the model will be defined based on techniques learned in class.
This model will meet the assumptions of Linearity, Equal Variance, normality and
Independence. If one of the assumptions is not met, transformations such as box-cox or log
will be applied. A check for outliers will be done and if outliers are found they will be
removed. Lastly, a test for multicollinearity between variables will be done. If collinearity is
found amongst any of the variables all but one of the variables with collinearity will be
removed.
This topic is important because we want to analyze and understand what creates the
highest satisfaction level within a work environment. In the future when looking for jobs we
could potentially use the insights gained to help decide what attributes we should prioritize in
a workplace to improve. Additionally, this evidence is important for companies to help
cultivate an environment that fosters productivity. This subject will be explored by building a
multiple regression model with the data provided.
5
2 METHODOLOGY
2.1 Data
The dataset has 10 columns consisting of employee ID, 8 characteristics of employee work
environment and satisfaction level, which is our predictor. This data was collected through an
employee answering an online survey. There are 12 783 entries. Below is a table that fully
describes the dataset and variables.
This data set was retrieved from kaggle with licensing from APACHE LICENSE (VERSION
2.0 (https://www.apache.org/licenses/LICENSE-2.0) which allows for free and non-exclusive
usage of its data.
Employee ID will not be used in the model to predict employee satisfaction as it serves
purely as an identifier for the employees and not as a predictor for satisfaction.
satisfaction last number avg monthly time spend Work promotion

Emp ID level evaluation project hours company accident last 5 years dept salary
Type Categoric Numerica Numerical Numerica Numerical Numerical Categoric Categorical Categorical Categorical
al l (Quantitative) l (Quantitati (Quantitative) al (Qualitative) (Qualitative (Qualitative )
(Quantitat (Quantita (Quantitat ve) (Qualitati )
ive) tive) ive) ve)
Description Employee Employe Employee's Number Average Number of Indicates Indicates The Employee's salary
ID e's most recent of number of years the whether whether the department level (e.g., low,
self-repor performance projects hours employee has the employee or division medium)
ted job evaluation the worked spent with the employee has received in which
satisfacti score employee per month company has a promotion the
on level is by the experienc in the last 5 employee
currently employee ed a years works
working work
on accident
Measureme Score Score from Project # Hours Years 1: yes 1: yes department low, medium, high
nt from 0-1 0-1 0: no 0: no
Table 1. Employee Satisfaction Dataset Variables

- Any techniques that you use are introduced and their use is justified (e.g., interactions,
polynomial terms,individual t-tests, global and partial F-tests, automated model
selection, statistical tests of residuals, etc.)
- The description of these techniques and the justification for their use is consistent
with how they have been taught in DATA 603.
2.2 Approach
6
The approach we will be using to address the problem of workplace satisfaction is a

systematic method using the statistical method of multiple linear regression we learned in
DATA 603. Initially, we will build a first order model and by utilizing multiple statistical tests
(T-test and stepwise regression procedure with an alpha value of 0.05) we will reduce the
model down to significant predictors. Then we will include interaction terms and higher order
terms if necessary. We believe that this systematic approach will work because it will allow
us to see what predictors are consistently chosen as significant. This allows us to choose the
best possible model. After a model is fitted and the best model is chosen, we will use this
model to check the assumptions of linear regression, and that no outliers or multicollinearity
are present in order to confirm that the model is valid.
2.3 Workflow
Initially a first order model will be created using all of the potential predictor
variables. A t-test will be performed to determine which of these coefficients are significant.
The second method was the stepwise regression procedure which will be used to check the
results of the t-test. Next an interaction model will be created. Similar to before a t-test will
be used to test which of the interaction terms are significant. Once the model including
interaction terms is known, each predictor variable being used in the model will be plotted
against the response variable to determine if there are any higher order terms to be included.
Based on the predictor variables and interaction terms deemed significant and the higher
order terms necessary the final model can be completed.
Once the model is created the six different assumptions of linear regression will be
checked for the model. These include the linearity assumption, independence assumption,
normality assumption, equal variance assumption, multicollinearity assumption, and outlier
check. Outliers are found to be significant (cook’s distance greater than 1), will be removed.
If the equal variance or the normality assumption are not met, transformations of box-cox or
log can be applied to the response variable.
2.4 Workload Distribution

The workload was distributed evenly amongst team members. Working together, each
of us shared our screen for a period of time and completed specific tasks, before handing off
to the next person to complete a new set of tasks. This was a great method as everyone was
able to contribute and be actively engaged in all parts of the project. This also improved the
7
speed at which tasks were completed as those not sharing their screen could find info needed
to complete specific parts and write out any tedious code or descriptions related to the section
being worked on.
There are three group members so the project was divided equally between three
roles. The first role included creating the model using the t.test and stepwise regression
followed by checking for higher order terms and interaction terms. The second and third roles
checked the regression tests and assumptions. There were six different checks that were
looked at in Data 603; Linearity Assumption, independence assumption, equal variance
assumption, normality assumption, multicollinearity and outliers. The first three checks were
completed by the person in role 2 and the remaining three checks were completed by the
person in role 3. Each group member was responsible for including and documenting their
assigned role in the report. The remaining sections of the report were worked on together,
with each person taking ownership for sections as needed.
3 MAIN RESULTS OF THE ANALYSIS
3.1 Results
Due to the sample size exceeding the limit for the Shapiro-Wilk Test we decided to resample
our data (n=5000) and then compute our results.
The Alpha value of 0.05 will be used for all significant testing.
Variable Selection Procedures

The first step of our model creation involved building a first order model with all possible
predictors.
First Order Model:
𝑌 = β0 + β1𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙 + β2𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡 + β3𝑋𝑎𝑣𝑔 𝑚𝑜𝑛 ℎ𝑟𝑠 + β4𝑋𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑 + β5𝑋𝑤𝑜𝑟𝑘 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑠
+ β6𝑋𝑝𝑟𝑜𝑚𝑜𝑡𝑖𝑜𝑛𝑠 + β7𝑋𝑑𝑒𝑝𝑡 + β8𝑋𝑠𝑎𝑙𝑎𝑟𝑦

8
To begin the selection procedure, we decided to use both the t-test and stepwise regression
analysis. This allowed for a systematic and comprehensive approach to selecting predictors.
T - Test
𝐻(0) ⇒ β𝑖 = 0
𝐻(𝐴) ⇒ β𝑖 ≠ 0
By applying the t-test and using the hypothesis statement above we decided to remove
insignificant variables. First, the p-value = 0.55 for average_monthly_hours led us to the
conclusions that we can reject the null hypothesis and drop the variable as it has no
significance on the model. Second, the p-value = 0.127 for promotion_last_5years led us to
the conclusions that we can reject the null hypothesis and drop the variable as it has no
significance on the model.the. The department (IT), was close to the significance value of
0.05 with a p-value = 0.075. Based on the hypothesis, we must accept the null hypothesis and
the dept variable can be removed. However, because the p-value is close to the alpha value,
there is potential to keep dept in the model to test for its involvement in significant interaction
terms.
9
Stepwise Regression Selection
The stepwise regression selection procedure produced the best model that dropped
average_monthy_hours, dept, and promotion_last_5years.
To conclude, since average_monthly_hours and promotion_last_5year were dropped from

both the t-test method and the stepwise regression method we did not keep these variables in
our final model. Lastly since the department variable was only somewhat significant we
concluded that it should be kept to test for significant interaction terms. If there are no
significant interaction terms involving the department then it will be dropped from the final
model.
10
Higher Order Terms
After reviewing the various scatter plots of employee satisfaction versus each predictor
variable, it was concluded that none of the variables displayed patterns that would deem
higher order terms necessary.
Interaction Terms
Hypothesis Statement for Individual T-tests (Interaction Terms):
𝐻(0) ⇒ β𝑖 = 0
𝐻(𝐴) ⇒ β𝑖 ≠ 0
After testing for interaction terms it was determined that there were 9 interactions that are
significant to the model:
● last_evaluation:number_project
● last_evaluation:time_spend_company
● last_evaluation:factor(Work_accident)
● last_evaluation:factor(dept)
● number_project:time_spend_company
● number_project:factor(Work_accident)
● number_project:factor(salary)
● time_spend_company:factor(dept)
11
● factor(dept):factor(salary)
These 9 interactions all have p values less than the alpha value 0.05. Therefore we must reject
the null hypothesis and state that these interactions have coefficients that are not equal to 0.
Thus we will include these 9 terms into our final model. Additionally, since the dept variable
is significant in some interaction terms, it will be kept in the final model.
Final Model
𝑌 = β0 + β1𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙 + β2𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡 + β3𝑋𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑 + β4𝑋𝑤𝑜𝑟𝑘 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑠
+ β5𝑋𝑑𝑒𝑝𝑡 + β6𝑋𝑠𝑎𝑙𝑎𝑟𝑦 + β7𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙*𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡 + β8𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙*𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑
+ β9𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙*𝑤𝑜𝑟𝑘 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑠 + β10𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙*𝑑𝑒𝑝𝑡 + β11𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡*𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑 + β12𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡*𝑤𝑜𝑟𝑘 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡
+ β13𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡*𝑠𝑎𝑙𝑎𝑟𝑦 + β14𝑋𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑*𝑑𝑒𝑝𝑡 + β15𝑋𝑑𝑒𝑝𝑡*𝑠𝑎𝑙𝑎𝑟𝑦
2
𝑅𝑎𝑑𝑗 = 0. 1998 , this value indicates that 19.98% of the variation of the response variable
satisfaction level is explained by the final model containing the predictors as well as the
interaction terms.
𝑅𝑀𝑆𝐸 = 0. 2218, this value indicates that the standard deviation of the unexplained
variation in estimation of response variable satisfaction level is 0.2218.
Interpretation of Coefficients:
There are a total of fifteen coefficients in our final model. These coefficients are interpreted
below in relation to employee satisfaction levels.
𝛃0: Intercept - This is the baseline satisfaction level when all other predictor variables are set
to their reference levels or zero. The reference levels for this model are
department(accounting), work_accidents(0), salary(High).
𝛃1: last_evaluation - Represents the change in satisfaction level for each unit increase in the
last evaluation score, assuming other factors are held constant. .

𝛃2: number_project - Represents the change in satisfaction level for each additional project ,
assuming other factors are held constant.

12
𝛃3: time_spend_company - The coefficient shows the change in satisfaction level for each
additional year spent in the company, assuming other factors are held constant.
𝛃4: Work_accident - The coefficient tells us the difference in employee satisfaction level
between employees who have not (0) (the base) and who have (1) been in a workplace
accident.
𝛃5: dept - The coefficients for different departments show the difference in satisfaction levels
compared to the baseline department of accounting, assuming other factors are held constant.
𝛃6: salary: The coefficients for different salary levels indicate how satisfaction varies across
salary levels compared to the baseline salary level of high, assuming other factors are held
constant.
𝛃7:last_evaluation:number_project - This interaction term is the product of last_evaluation
and number_project. The coefficient will tell us how satisfaction changes for every one
project an employee takes on the employee satisfaction will increase by a constant plus the
last evaluation multiplied by 𝛃7. Or how satisfaction changes for every one score
improvement in last evaluation the employee satisfaction will increase by a constant plus the
number of projects multiplied by 𝛃7.
𝛃8:last_evaluation:time_spend_company - This term, formed by multiplying
last_evaluation and time_spend_company. The beta value indicates how satisfaction changes
relative to the interaction of last_evaluation and time_spend_company. For instance, for
every additional year an employee spends at company, the employee satisfaction will increase
by a constant plus the last evaluation multiplied by 𝛃8, and vice versa.
𝛃9:last_evaluation:factor(Work_accident) - By multiplying last_evaluation with the
categorical variable Work_accident, the coefficient of this interaction tells the difference in
the effect of the last evaluation on satisfaction from those who have not vs. have been in a
workplace accident.
𝛃10:last_evaluation:factor(dept) - This term is the product of last_evaluation and the dept.
The coefficient of this interaction tells the difference in the effect of the last evaluation on
satisfaction from those in accounting (the base) vs those in other departments.
𝛃11:number_project:time_spend_company - The multiplication of number_project and
time_spend_company creates this term. The coefficient will tell us how satisfaction changes
13
for every one project an employee takes on, the employee satisfaction will increase by a
constant plus the time_spend_company multiplied by 𝛃11. Or how satisfaction changes for
every one score improvement in last evaluation the employee satisfaction will increase by a
constant plus the number of projects multiplied by 𝛃11.
𝛃: number_project:factor(Work_accident) - This interaction, formed by multiplying

number_project with Work_accident, investigates if the impact of the number of projects on
satisfaction is different for those who have experienced a work accident. The coefficient of
this interaction tells the difference in the effect of the number of projects one undertakes, on
satisfaction from those who have or have not been in a workplace accident.
𝛃13:number_project:factor(salary) - By multiplying number_project with salary, this term
assesses if the effect of the number of projects on satisfaction varies across different salary
levels low, medium and high, with high being the base. The coefficient of this interaction tells
the difference in the effect of the number of projects one undertakes, on satisfaction from
those in with a high salary vs those with other salaries
𝛃14: time_spend_company:factor(dept) - This interaction, created by multiplying
time_spend_company with the dept variable, explores how the impact of tenure at the
company on satisfaction might vary across departments The coefficient of this interaction
tells the difference in the effect of the years spent at the company on satisfaction from those
in accounting (the base) vs those in other departments.
𝛃15: factor(dept):factor(salary) - This term is the product of dept and salary. It examines if
the relationship between department and satisfaction differs based on the salary level.The
coefficient tells us the difference in the effect of being in the accounting department on
satisfaction between those with a high salary vs other salaries and vice versa. The effect of
those with a high salary vs other salaries on satisfaction between those in the accounting
department vs other departments.
14
3.1 Multiple Regression Assumptions
Linearity Assumption
From the plot created, the plot does not fit linearity assumptions, as the fitted blue line is not
linear. This display of pattern and lack of random scatter, leads us to conclude that this model
does not fulfill the linearity assumption necessary for our model.
15
Independence Assumption
The data was collected by employees individually answering a survey, which is not related to
time, space, or group. Therefore each entry is separate and does not influence the results of
another. This characteristic satisfies the independence assumption.
Normality Assumption
𝐻(0): 𝑇ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑑𝑎𝑡𝑎 𝑎𝑟𝑒 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡𝑙𝑦 𝑛𝑜𝑟𝑚𝑎𝑙𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑
𝐻(𝐴): 𝑇ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒 𝑑𝑎𝑡𝑎 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑡𝑙𝑦 𝑛𝑜𝑟𝑚𝑎𝑙𝑙𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑑
From the histogram for residuals we can note that the data is skewed to the right. The points
on the normal Q-Q plot depict the tail beginning to deviate as x increases. To get a quantified
interpretation, the Shapiro-Wilks test was also performed. The p-value calculated was
2.2e-16<0.05, therefore, we reject the null hypothesis and can state that the data is not
significantly normally distributed.
Equal Variance Assumption

𝐻(0): 𝐻𝑒𝑡𝑒𝑟𝑜𝑠𝑐𝑒𝑑𝑎𝑠𝑡𝑖𝑐𝑖𝑡𝑦 𝑖𝑠 𝑛𝑜𝑡 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 (𝐻𝑜𝑚𝑜𝑠𝑐𝑒𝑑𝑎𝑠𝑡𝑖𝑐𝑖𝑡𝑦)
𝐻(𝐴): 𝐻𝑒𝑡𝑒𝑟𝑜𝑠𝑐𝑒𝑑𝑎𝑠𝑡𝑖𝑐𝑖𝑡𝑦 𝑖𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡
16
To visually test for homoscedasticity, we populated a residual vs fitted values plot and to
statistically test for homoscedasticity we performed the Breusch-Pagan test. The residual vs
fitted values plot shows a rectangular shape which indicates that our model is heteroscedastic.
From the Breusch-Pagan test our value 2.2e-16<0.05 therefore, we reject our null hypothesis
and heteroscedasticity is present thus not fulfilling the equal variance assumption.
Multicollinearity Tests
17
To test for multicollinearity in our model, we computed the multiple variance inflation factors
(VIF) to determine which variables should remain in our best fitted model. The VIF values all
fell between 1 ≤ 𝑉𝐼𝐹 ≤ 5 which suggests that there is moderate collinearity, but it does not
require corrective measures.
Influential Points and Outliers

18
To identify influential cases we will plot residuals vs leverage plot. From the residual vs
leverage points all cases are well inside of the Cook’s distance lines.
The plot above shows Cook’s distance plotted for each variable. This plot will compute the
overall influence the outlier points have on our regression and the extent of its effect. The
points that were found to be outliers were 2227, 2360, and 3309. However, their Cook’s
distance value is less than 0.5, therefore they are not influential.
Transformation
Since our data violates the normality and equal variance assumptions. We decided to conduct
a box-cox transformation. The reason why we chose to conduct a box-cox transformation is
because the transformation will help with equal variance and follow a normal distribution.
After transformation the model still failed the constant variance and normality assumptions.
To further try and correct our model we also attempted a log transformation. However, the
log transformed model still failed the constant variance and normality assumptions.
3.2 Discussion
In all our model does not meet all of the necessary assumptions to be a viable model
to predict employee satisfaction. To test the linear assumption we plotted the residuals vs
19
fitted where we found that the blue fitted line did not show a linear line. Therefore, violating
the linear assumption. For the normality assumption we plotted a histogram, Normal Q-Q
plot, and computed the Shapiro-Wilks test. The histogram showed the data was right skewed
and the Normal Q-Q plot showed data points leaving the line; both plots visually showed that
the data was not normally distributed. Additionally, when computed the Shapiro-Wilks test
our p-value<0.05 showed that we would reject our null hypothesis that there is normality in
the data. To investigate the equal variance assumption we conducted the Breusch-Pagan test
which showed that p-value<0.05, concluding that we reject the null hypothesis that there is
equal variance. Therefore, showing that the model violates the equal variance assumption. To
attempt to improve the model we conducted a box-cox transformation. However, this did not
improve our model as it still violated the linearity, normality, and equal variance assumption.
To test for multicollinearity, we computed the VIF values. We found that all variables in the
model had a value 1 ≤𝑉𝐼𝐹≤5 which shows moderate collinearity. Which did not require
corrective measures.
The failure of these assumptions is an important part of our findings. Since our model
did not meet the assumptions we can not ensure the reliability or validity of the predictions
made. Without meeting these assumptions we can create false conclusions and interpretations
of our data. Moreover, if comparison to other models is desired, we would not be able to
produce a fair comparison within an invalid model. In all, these assumptions exist to ensure
the model has no result altering biases.
4 CONCLUSION AND DISCUSSION
4.1 Approach
Yes, the overall approach we took is promising. We used a systematic approach and
confirmed our results through multiple visual and statistical methods. For verification of
results we prioritized the findings from quantitative tests such as Shapiro-Wilk and
Brusch-Pagan. The precise results of these tests offered less ambiguous results and reduced
the room for interpretation errors. This approach leads to the most accurate result. A variant
of this approach would be to include more visual evidence into our checks. This would help
20
improve our approach by allowing us to visually understand exactly why our model fails and
we would be better equipped at troubleshooting and deciding on transformation methods.
4.2 Future Work

An opportunity for follow-up work to be done next is to find a transformation method so our
data can meet assumptions. This would potentially involve transformations that we have yet
to learn. Additionally future work could include looking at different employee satisfaction
data. There is room to expand the data and add things such as employee benefits. Next it
would be interesting to look at the difference in satisfaction between those who work
remotely, hybrid and in the office. Lastly, looking at different industries would be a good
place to continue this work.
21
5 REFERENCES
Apollo Technical. (2022). 11 surprising Job Satisfaction Statistics Retrieved from
https://www.apollotechnical.com/job-satisfaction-statistics/
Bellet, C. S., De Neve, J. E., & Ward, G. (2023). Does employee happiness have an impact
on productivity?. Management science.
Long, R. (2023, September 19). Why employee satisfaction matters more than happiness.
Recruiting Resources: How to Recruit and Hire Better.
https://resources.workable.com/tutorial/employee-satisfaction-happiness#:~:text=Satis
fied%20employees%20are%20more%20committed,and%20higher%20overall%20per
formance%20levels.
Naber, A. (n.d.). One third of your life is spent at work. Retrieved from
https://www.gettysburg.edu/news/stories?id=79db7b34-630c-4f49-ad32-4ab9ea48e72
b#:~:text=The%20average%20person%20will%20spend%2090%2C000%20hours%2
0at%20work%20over%20a%20lifetime.
22
6 APPENDIX
1. RMD File with all R Code has been submitted in dropbox.

Employee Satisfaction Report

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Employee Satisfaction Report

Uploaded by

Copyright:

Available Formats

DATA 603 Project:

Regression Modelling of Employee Satisfaction

Ali Campbell (30024201), Crystal Wai (30233867), Niza Ngwira (30064557)

DATA 603: Statistical Modeling with Data

Instructor: Qingrun Zhang

December 8th, 2023

1.2.2 Goals & Research Questions

satisfaction last number avg monthly time spend Work promotion

Table 1. Employee Satisfaction Dataset Variables

The approach we will be using to address the problem of workplace satisfaction is a

2.4 Workload Distribution

3 MAIN RESULTS OF THE ANALYSIS

Variable Selection Procedures

First Order Model:

+ β6𝑋𝑝𝑟𝑜𝑚𝑜𝑡𝑖𝑜𝑛𝑠 + β7𝑋𝑑𝑒𝑝𝑡 + β8𝑋𝑠𝑎𝑙𝑎𝑟𝑦

Stepwise Regression Selection

To conclude, since average_monthly_hours and promotion_last_5year were dropped from

Higher Order Terms

𝑌 = β0 + β1𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙 + β2𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡 + β3𝑋𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑 + β4𝑋𝑤𝑜𝑟𝑘 𝑎𝑐𝑐𝑖𝑑𝑒𝑛𝑡𝑠

+ β5𝑋𝑑𝑒𝑝𝑡 + β6𝑋𝑠𝑎𝑙𝑎𝑟𝑦 + β7𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙*𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡 + β8𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙*𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑

+ β13𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡*𝑠𝑎𝑙𝑎𝑟𝑦 + β14𝑋𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑*𝑑𝑒𝑝𝑡 + β15𝑋𝑑𝑒𝑝𝑡*𝑠𝑎𝑙𝑎𝑟𝑦

last evaluation score, assuming other factors are held constant. .

assuming other factors are held constant.

𝛃8:last_evaluation:time_spend_company - This term, formed by multiplying

𝛃9:last_evaluation:factor(Work_accident) - By multiplying last_evaluation with the

𝛃: number_project:factor(Work_accident) - This interaction, formed by multiplying

3.1 Multiple Regression Assumptions

Equal Variance Assumption

Influential Points and Outliers

4 CONCLUSION AND DISCUSSION

4.2 Future Work

Apollo Technical. (2022). 11 surprising Job Satisfaction Statistics Retrieved from

Recruiting Resources: How to Recruit and Hire Better.

1. RMD File with all R Code has been submitted in dropbox.

You might also like

+ β5𝑋𝑑𝑒𝑝𝑡 + β6𝑋𝑠𝑎𝑙𝑎𝑟𝑦 + β7𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡 + β8𝑋𝑙𝑎𝑠𝑡 𝑒𝑣𝑎𝑙𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑

+ β13𝑋𝑛𝑢𝑚 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑠𝑎𝑙𝑎𝑟𝑦 + β14𝑋𝑡𝑖𝑚𝑒 𝑠𝑝𝑒𝑛𝑑𝑑𝑒𝑝𝑡 + β15𝑋𝑑𝑒𝑝𝑡*𝑠𝑎𝑙𝑎𝑟𝑦