Regression and Neural Network Based Prediction Model For The Participation of Female Employment in Bangladesh

REGRESSION AND NEURAL NETWORK BASED PREDICTION
MODEL FOR THE PARTICIPATION OF FEMALE

EMPLOYMENT IN BANGLADESH
Department of Computer Science and Engineering

Pabna University of Science and Technology, Pabna-6600
Course Title: Project/Thesis

Course Code: CSE 4100 and CSE 4200
A thesis has been submitted to the Department of Computer Science

and Engineering for the fulfillment of the requirement of Bachelor
of Science in Computer Science and Engineering
Submitted By:
Sharthok Ghosh
Roll Number:160119
Registration Number: 101699
Session: 2015-16
Supervised By:
Md. Mahmudul Hasan
Assistant Professor, Department of Computer Science and Engineering
Pabna University of Science and Technology
January, 2022
DECLARATION
In accordance with rules and regulations of Pabna University of Sci-
ence and Technology following declarations are made:
I hereby declare that this thesis has been done by me under the su-
pervision of Md. Mahmudul Hasan, Assistant professor, Department
of Computer Science and Engineering, Pabna University of Science
and Technology, Pabna-6600. .
Signature of the Examinee

CERTIFICATE
I am pleased to certify that Sharthok Ghosh, Roll No: 160119, Reg No:
101699, Session: 2015-16 performed a thesis work entitled “Regression
and Neural Network Based Prediction Model for the Participation of
Female Employment in Bangladesh” under my supervision for the re-
quirement of the completion of course entitled ‘Project/Thesis’. So far
as I concern this is an original thesis that has been carried out for one
year in the Department of Computer Science and Engineering, Pabna
University of Science and Technology, Pabna-6600, Bangladesh..
To the best of my knowledge, this paper has not been duplicated

from any other paper or submitted to elsewhere prior submission to
the department.
Md. Mahmudul Hasan
Assistant Professor,
Department of Computer Science and Engineering
Pabna University of Science and Technology, Pabna-6600.
Bangladesh.
ACKNOWLEDGEMENT
All praise for God who has created us and given a greatest status
among his all creations. First of all I express my gratefulness to the
Almighty God for enabling me to perform this task successfully. I
would like to express my deepest sense of gratitude to my honorable
supervisor Md. Mahmudul Hasan, Assistant Professor, Department
of Computer Science and Engineering (CSE), Pabna University of
Science Technology (PUST), for his scholastic supervision, valuable
guidance, adequate encouragement and helpful discussion throughout
the progress of this work. I am highly grateful to him for allowing me
to pursuing this study under his supervision.
I am deeply thankful to my honorable supervisor, Md. Mahmudul

Hasan, and all the respectable teachers of the Department of Com-
puter Science and Engineering, Pabna University of Science Tech-
nology, Pabna-6600, Bangladesh, for their encouragement and help in
the last few months that enabled me to acquire a lot of knowledge
relevant to my research work.
Finally, I am much grateful to my family members especially to

my parents, all of my friends and well-wishers for their encouragement
and supports.
January, 2022
Author
ABSTRACT
Participation of females in employment along with men serves as a

key indicator of the progress of any country. The employment of fe-
males in Bangladesh has increased since independence in the nineties
and it is important to identify the influential factors involved in this
evolution. This paper is about to build predictive models by identi-
fying such factors responsible for female employment in Bangladesh.
The World Bank data repository based on “World Development Indi-
cator (WDI) 2020” has been analyzed. We identify significant factors,
build multiple stepwise linear regression models under data mining,
and evaluate cross coefficient model performances using MAE, RAE,
and RMSE under the shed of k-fold cross-validation technique. The
result is then analyzed to various classification methods such as the
Gaussian Process Method, Decision Table, and Random Forest. For
getting better accuracy we also tried to find a model using a neural
network algorithm. In the performance of the final prediction model,
we can identify significant factors which are responsible for increasing
the participation of females in employment.
Contents
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Review 6
2.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Data Mining Methods and Techniques . . . . . . . . . . . . . . . 8
2.2.1 Techniques of Data Mining . . . . . . . . . . . . . . . . . . 8
2.2.2 Data Mining Tools . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Related Method . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3.1 ANOVA Test . . . . . . . . . . . . . . . . . . . . 10
2.2.3.2 Linear Regression Model . . . . . . . . . . . . . . 10
2.2.3.3 Residuals . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3.4 QQ plot . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3.5 Cross Validation . . . . . . . . . . . . . . . . . . 11
2.2.3.6 Gaussian Process . . . . . . . . . . . . . . . . . . 12
2.2.3.7 Random Forest . . . . . . . . . . . . . . . . . . . 12
v
CONTENTS
2.2.3.8 Decision Table . . . . . . . . . . . . . . . . . . . 12

2.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 System Architecture 16
3.1 Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Source of Data . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.4 ANOVA Test . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.5 Identifying Significant and Relevant Factors . . . . . . . . 18
3.1.6 Building Regression Models . . . . . . . . . . . . . . . . . 19
3.1.7 Performance Improvement . . . . . . . . . . . . . . . . . . 19
3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Implementation 21
4.1 Implementation Step . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Result and Discussion 29

5.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Performance Improvement . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Limitations and Future Works 42

6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vi
CONTENTS
References 48
vii
List of Figures
3.1 Proposed System Model . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Stepwise Indicators Selection . . . . . . . . . . . . . . . . . . . . . 22

4.2 Vulnerable Employed . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Self Employed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Agriculture Employed . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Industry Employed . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1 Residual vs Fitted Graph . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Normal Q Q Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Boxplot of MAE . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.4 Boxplot of MRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.5 Boxplot of RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 : Line Chart of Actual VS Predicted Participation . . . . . . . . 35
5.7 Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . 36
5.8 Scatterplot of Neural Network Model . . . . . . . . . . . . . . . . 36
5.9 Line Chart of Actual VS Neural Predicted Participation . . . . . 38
5.10 : Line Chart of Actual VS Neural Predicted Participation (For
80% Training Data Set) . . . . . . . . . . . . . . . . . . . . . . . 38
5.11 Line Chart of Actual VS Neural Predicted Participation (For 20%
Testing Data Set) . . . . . . . . . . . . . . . . . . . . . . . . . . 39
viii
LIST OF FIGURES
5.12 Line Chart of Actual VS Regression Predicted VS Neural Predicted

Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
ix
List of Tables
4.1 Significant and Relevant Factor . . . . . . . . . . . . . . . . . . . 23

4.2 ANOVA Test Result for Model Data . . . . . . . . . . . . . . . . 25
4.3 Association between identifying factors . . . . . . . . . . . . . . . 27
5.1 Sample Division for K-fold Cross Validation . . . . . . . . . . . . 32

5.2 5 Fold Cross Validation Result . . . . . . . . . . . . . . . . . . . 32
5.3 Performance Comparison with Other Classifying Technique . . . 33
5.4 5 Fold Cross Validation Result with Neural Network . . . . . . . 37
x
Chapter 1
Introduction
In this chapter, we will introduce our thesis overview, background, objectives,

and outcome. In an overview section, we discuss our work overview. In the
background section, we describe our problem briefly. In the objective section, we
discuss our thesis objective. In the outcome section, we discuss the outcome of our
paper. We got four-factor and a prediction model for our paper. In section 1.1
we will discuss our thesis overview; in section 1.2 we will discuss the background
of our thesis; in section 1.3 we will discuss our thesis objectives; in section 1.4
we will discuss our thesis outcome; in section 1.5 we discuss chapter summary.
1
Chapter 1: Introduction
1.1 Overview
Female employment is a vital factor for the economic development of Bangladesh.
In many approaches, female employment can increase. In our study, we are
trying to find out the significant factor which is responsible for rising female
employment in Bangladesh. For this purpose, we use a stepwise linear regression
algorithm under the classifying technique. We also use k fold cross-validation
for performing our model performance. Other classification approaches such as
Gaussian Method, Random Forest, and Decision Tree are used to test the validity
of our model. We also use a neural network algorithm. For identifying factors,
we use the data from the “World Bank Development Indicator (WDI-2020)” for
Bangladesh for the period of 1991-2020.
1.2 Background
Female employment is an essential target of sustainable development goals. Sus-
tainable Development Goals were approved by the United Nations as a universal
call for ending poverty in 2015 and achieving peacetime and affluence for all by
2030. There are 17 Sustainable Development Goals, one among that is “Gender
Equality”. Ending all inequalities against women is not only an essential goal but
also crucial for sustainable development. It has been proven that female empow-
erment contributes to economic growth and development, and female employment
plays a significant role in advancing female empowerment [1].
The participation of employment is the number of the working population
aged 16 to 64 that are currently employed or looking for work in the economy.
Anybody who is still in studies, housewives, and people over the age of 64 are
excluded from the labor force. Participate in the workforce for both men and
women for a country’s economy to progress. However, in underdeveloped nations
2
such as Bangladesh, men are given greater weight in the workplace than women.
Females, on the other hand, families suffer numerous barriers to employment,
which has a negative influence on the country’s economy. Researchers have been
particularly interested in the female labor market in Bangladesh, as well as other
aspects of gender and development, in the previous two decades. In the context
of developing issues about gender inequality and its consequent adverse effects on
society and the economy, the issue of female contribution to the national econ-
omy has become a focus of discussion in Bangladesh as well as in most countries
[2]. Integrating the contribution of women has become essential for any economy
based on equity and efficiency. It is now widely recognized that female partic-
ipation in the labor market improves their relative economic position and also
stimulates the performance and improved ability of the economy from a broader
perspective. In Bangladesh, female contribution to the national economy is tons
lower due to low participation in the hard work marketplace [3]. While females
make significant contributions to off-market activities, such as household chores
and caring for children and the elderly at home, an important factor in ensur-
ing inclusive progress in the economy is ensuring women’s greater participation in
market-based industrious activities. And it’s not just economical skills that make
the female more involved in mainstream economical activities; this is important
for greater equity and also from a complete growth perspective [4].
In view of the above analysis, we attempted to discover female labor factors in
order to aid economic growth.
1.3 Thesis Objective

The main objective of our study is described below:
1.Identify Factors: The main objective of our study is to identify factors that
3
are responsible for the progress of female employment.

2.Prediction Model: We are trying to create a predictive model that can pre-
dict the future employment of women based on the input given in the model.
3.Measure Accuracy Level: We measure the accuracy level of our model
using k fold cross verification and other classification techniques
1.4 Thesis Outcome

The outcomes of this thesis are:
1. We have found that among other classifying techniques linear regression is
best.
2. To minimize the error of the regression-based model Neural Network works
efficiently.
3. We have found the significant indicators which have positive and negative
effects on the Participation of female employment in Bangladesh.
4. Linear regression technique has helped us to find the significant factors.
5. According to the prediction model, the variables such as “Self employed”
and “Industry employed” have a positive effect on the participation of female
employment. The variable like “Vulnerable employed” and “Agriculture em-
ployed” have a negative effect on the participation of female employment .
1.5 Summary
In this chapter, we discuss our thesis overview, background, objective, and
outcome. We give a short discussion about our thesis in this chapter. In an
overview section, we discuss the whole thesis work. In the background section,
we discuss our problem. In the objective section, we discuss the main objec-
tive of our study. In the outcome section, we discuss our thesis outcome. In the
4
next section, we will study our literature review. In the literature review chap-
ter we discuss data mining, related methods, and related work so that we can
understand our work.
5
Chapter 2
Literature Review
In this chapter, we discuss related studies related methods, and related work.
In related studies, we discuss what is Data Mining, Its advantage and disad-
vantage, methods we use in Data Mining to know which method we use in our
work. Tools we use in Data mining so that we can use these tools in our study.
In related methods, we know which method or technique we use in our work. in
the related methods section, we know about the ANOVA test so that we iden-
tify significant factors. ANOVA test helps us to identify which factor is signif-
icant for our study. In this section, we know about the linear regression model.
Linear regression model helps us to find a prediction model for female employ-
ment. Residual versus fitted graph help us to find if our model is linear or not.
QQ plot helps us to know if our model is too scattered with its sample data or
not. We also know about cross-validation, Gaussian Process, Random Forest,
Decision Tree to measure our model performance. In the related work section,
we discuss previous work-related with our work. In section 2.1 we discuss data
mining. In 2.2 we discuss the related method of our study. In 2.3 we discuss
our related work. In section 2.4 we discuss the summary of this chapter.
6
Chapter 2: Literature Review
2.1 Data Mining

To understand our work better it is important to know about data mining, ap-
plications, advantages, disadvantages, tools, methods are used in data mining-
based analysis. We will discuss data mining concepts, applications of data min-
ing, advantages and disadvantages, tools, and techniques of data mining.
Data mining is referred to as a system used to extract practical facts from a
bigger set of any raw information. It implies analyzing statistics patterns in
huge batches of records by the use of one or extra software programs [6]. It is
the system of reading massive volumes of information to discover business intel-
ligence that facilitates agencies to remedy troubles, mitigate risks, and capture
new possibilities[7].
Application of Data Mining

Data mining has emerged as more famous within the previous few years. The
purpose of the data mining manner is to extract statistics from a recordset and
remodel it into an understandable structure for further use[8]. Data mining
helps in business to build a strong relationship between customer and retailer
and helps to improve customer satisfaction. Some of the businesses that apply
data mining techniques include retail, telecommunication, sales and marketing,
healthcare, insurance, finance, manufacturing, and so on[9]. Data mining is also
applied in several business areas including customer relationship management,
business analysis and management, risk analysis and management, bioinformat-
ics, computer security, and so on[10].
Advantages and Disadvantages of Data Mining

Data mining is useful correctly in many areas such as advertising, retail, fi-
nance, banking, production, climate prediction, medication, transportation,
healthcare, insurance, authorities, etc[11]. Data mining assists marketing busi-
7
nesses in developing models based on historical data to predict who would re-
spond to new marketing initiatives such as direct mail and online marketing. It
provides a lot of advantages to retail businesses in the same way that advertis-
ing does[12].
Data Mining is used for several constructive purposes consisting of advertis-
ing/retail, finance/banking, manufacturing, and so on[13]. It is likewise utilized
by Governments for diverse purposes. But it has its hazards. It has questions
about privateer’s issues, safety issues, and misuse of statistics. When the net is
booming with social networks, e-trade, boards, blogs. . . . Because of privateer’s
problems, most people are afraid of their private records are gathered and used
in an unethical manner that causes them to several problem[28].
2.2 Data Mining Methods and Techniques
2.2.1 Techniques of Data Mining
The data mining duties may be labeled normally into two types based on what
a particular task attempts to gain. Those two classes are descriptive responsi-
bilities and predictive tasks [14]. Predictive Task makes use of a few variables
to be expecting unknown or destiny values of other variables. It may determine
what might show up in the future. Descriptive assignments locate human inter-
pretable styles that describe the records. It describes what passed off past [15].
The techniques of data mining are-
• Association
• Classification
• Decision Tree
• Clustering
• Prediction
8
2.2.2 Data Mining Tools
Data mining tools assist us in doing brief analyses. It takes the pain of com-
manding any well-known algorithm from scratch however at an equal time gives
us the power to adjust the code of the device as consistent with necessities. All
the tools mentioned below have their peculiarity in terms of implementation
and each has its own merits[16]. The tools are-
• Rapid Miner
• WEKA
•R
• Teradata
• Python
• Orange
• Kaggle etc.
2.2.3 Related Method
For finding the related factor and building a prediction model we use the mul-
tiple stepwise linear regression method. We find the significant factor through
ANOVA Test and build co-relation among the factor. For finding performance
measurement we use k-fold cross-validation and finding Mean Absolute Error
(MAE), Mean Relative Error (MRE), and Root Mean Square Error (RMSE).
Apart from that, we test the performance of our model using various classifi-
cation techniques such as the Gaussian Method, Random Forest, and Decision
Table. For a more realistic model, we employ a neural network technique.
9
2.2.3.1 ANOVA Test
An ANOVA test is a way to find out if analysis or test effects are significant. In
other words, they assist you in finding out in case you want to reject the null
hypothesis or receive the alternate hypothesis[17].
Types of analysis of variance:
Analysis of variance is of two types. One-way ANOVA and two-way ANOVA.
One-way or two-way refers to the number of independent variables (IVs) in
your Analysis of Variance test.
• One-way has one independent variable.
• Two-way has two independent variables (it can have multiple levels).
2.2.3.2 Linear Regression Model
Linear regression efforts to model the connection among two variables by the
way of fitting a linear equation to experimental data. One variable is consid-
ered to be an instructive variable, and the other is taken into consideration to
be a dependent variable. Before trying to fit a linear form to determine records,
a modeler should to first decide whether or not there is a connection between
the variables of interest. If there appear to be no association among the pro-
posed explanatory and structured variables then fitting a linear regression model
to the information probable will no longer offer a useful form [18].
2.2.3.3 Residuals
Once a regression model has been matched to a group of facts, examination

of the residuals (the deviations from the outfitted line to the found values) al-
lows the modeler to analyze the validity of his or her assumption that a linear
relationship exists. Plotting the residuals on the y-axis in opposition to the ex-
planatory variable on the x-axis exhibits any possible non-linear courting most
10
of the variables.
2.2.3.4 QQ plot
The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us check if

the model is linear or not in a visual look. For instance, if we run a statisti-
cal analysis that assumes our dependent variable is normally distributed, we
can use a Normal Q-Q plot to test that assumption. A Q-Q plot is a scatter-
plot created with the aid of plotting sets of quantiles against each other. If each
unit of quantiles got here from the same distribution, we have to see the points
forming a roughly straight line.
2.2.3.5 Cross Validation
In k-fold pass-validation, the original pattern is randomly partitioned into k

identical sized subsamples. Of the subsamples, a single subsample is reserved
because the validation data for checking out the model, and the final k 1 sub-
sample are used as training data [19]. The cross-validation process is then con-
tinual k times, with every of the k subsamples used just once as the validation
records. The k effects can then be averaged to produce a single approximation.
It is a famous technique because it is simple to apprehend and as it commonly
consequences in a less biased or less constructive estimate of the model skill
than different strategies, such as an easy train/test split [20].
The general procedure is as follows:
1. Shuffle the dataset randomly.
2. Split the dataset into k groups.
3. for each unique group:
I Take the group as a holdout or test data set
II Take the remaining groups as a training data set
11
III Fit a model on the training set and evaluate it on the test set.
IV Retain the evaluation score and discard the model.
4. Summarize the skill of the model using the sample of model evaluation scores.
2.2.3.6 Gaussian Process
Gaussian process is a stochastic process (a collection of random variables listed

via time or area), such that each finite series of those random variables has a
multivariate normal distribution, i.e. Each finite linear combination of them is
normally distributed. The distribution of a Gaussian process is the joint distri-
bution of all the ones (extremely countless) random variables, and as such, it’s
far a distribution over features with an endless domain, e.g. time or space [21].
2.2.3.7 Random Forest
Random forests are a collective learning approach for class, regression, and dif-
ferent responsibilities that function by building an assembly of decision trees at
training time and outputting the magnificence that is the type of the lessons
(category) or suggests prediction (regression) of the individual trees. Random
decision forests accurately for decision trees habit of over becoming to their
training set [21].
2.2.3.8 Decision Table
Decision tables are a summarizing visible representation for identifying which

actions to perform depending on given conditions. They are algorithms whose
output is fixed of actions. The data stated in choice tables can also be repre-
sented as choice trees or in a programming language as a series of if-then-else
and transfer-case statements[21].
12
2.3 Related Works

In this section, we found our related work with our study. We searched many
papers from Google Scholar, Elsevier, ACM Digital Library, Willy, and so on.
After studying related topics we found out the factors related to the evolution
of female employment. Some related study discuss below:
In 2009 Angela Luci had published a paper on economic growth on female la-
bor market participation. This paper examined the impact of female participa-
tion in the labor market on the economy. This paper indicated that relying on
the equal effects of economic growth is not enough in the short term to increase
females’ access to the labor force. In the interest of overall economic growth,
active labor market policies are needed to promote women’s participation in
the labor market, especially in developing countries[22].A critical research gap
is the lack of empirical information on the influence of growth on female labor
market participation. Empirical evidence for the ’feminisation U’ theory, which
predicts that growth has a mixed impact on women’s labor market activities,
suggests that explicitly improving women’s economic prospects is necessary to
boost a country’s long-term economic potential.
In 2013 Stéfanie André, Maurice Gesthuizen, and Peer Scheepers had published
a paper on female labor market participation, policy models, and gender differ-
ences. In this paper, we examined different and relevant explanations for dif-
ferences in support of traditional female roles among 32 countries. Highly edu-
cated, employed and those who do not believe in religion are the least helpful.
The higher the participation of women in the labor market, the less traditional
the average citizen: the relevant effect is stronger for women than for men.
Public child care spending did not affect the average level of support for tra-
ditional female roles [23]. The International Social Survey Programme (ISSP),
an annual program of large-scale cross-national research on themes essential to
13
the social sciences, was employed to complete this project (ISSP 2010). Except
quota sampling, the nations in the ISSP used a variety of sample strategies,
including (multi-) staged sampling, clustered sampling, and probability sam-
pling.Apply hypothesis method to proceed the work.
In 2014 Angela Cipollone, Eleonora Patacchini Giovanna Valenti had published
a paper on female labor market participation in Europe. This paper had been
used in 20 years using individual data from 15 countries. This research showed
that the observed trends in women’s participation differ significantly across
both countries and in different groups of women. We explore such differences
in trends by looking at the impact of policy and institutional factors in the la-
bor market on the participation of women in different households. Labor mar-
ket organizations and family-oriented policies account for about 25% of the ac-
tual increase in labor force participation for young women and more than 30%
for highly educated women, and surprisingly, changes in institutional and pol-
icy settings contributed less to explain the participation of low-skilled women
[24]. They created a unique dataset of similar household and individual level
characteristics across countries and over time by combining microdata from two
separate sources: the ECHP (European Community Household Panel) and the
EU-SILC (European Union Statistics on Income and Living Conditions). The
ECHP microdata is a household survey with a standard framework that is car-
ried out across the EU-15 Member States under Eurostat’s supervision. The
ECHP is an eight-year program that runs from 1994 to 2019.
In 2015 Eleni T. Stavrou, Wendy J. Casper, and Christiana Ierodiakonou had
published an article on the support for female employment and gender empow-
erment of labor market conditions. This article examines the characteristics of
both the organizational environment and the variable organizational level of
women’s employment using a multi-source data set collected across eight Euro-
14
pean countries. It also found that organizations that support part-time work
options are more likely to employ women. One reason for this may be that
offering part-time employment in high-GEM countries is a way to signal sup-
port for an organization’s work-life balance, which makes it more attractive to
women [25]. A. Cipollone, E. Patacchini, and G. Valenti, “Female labor market
participation in Europe: novel evidence on trends and shaping factors.
2.4 Summary
This chapter is the base chapter of this study. Discuss data mining , related
methods, and related work are described here. From this chapter, we have a
vast idea about data mining, its advantage disadvantage, methods of data min-
ing, tools of data mining, and many other things. We also know about ANOVA
test, regression model, q q plot, Gaussian Process, Random forest, Decision Ta-
ble, and so on. In the next chapter, we will go to discuss our system architec-
ture model. The system architecture model describes in which way we solve our
problem.
15
Chapter 3
System Architecture
In the previous chapter, we discuss data mining, related work, and related meth-
ods. Now in this chapter, we discuss a basic model of our proposed system. In
section 3.1 we discuss the proposed system architecture. First, we study some
related papers with our work. Then we select a data set. We use the world de-
velopment indicator (WDI-2020) for our work. Then we prepare our data. In
the next step, we apply ANOVA Test to identify significant factors. Then we
build a regression-based model. Then we apply various performance measure-
ment criteria in our model. In section 3.1.1 we discuss Literature Review; In
section 3.1.2 we discuss Source of Data; in section 3.1.3 we discuss Data prepa-
ration; In section 3.1.4 we discuss ANOVA Test; In section 3.1.5 we discuss
Identify significant and relevant model; in section 3.1.6 we discuss building re-
gression model; in section 3.1.7 we discuss Performance measurement criteria.
16
Chapter 3: System Architecture
3.1 Proposed System

The proposed systems major goal is to identify the characteristics that con-
tribute to the advancement of female employment. We identify key and rele-
vant factors from the existing work, evaluate data for significance tests, and
develop an estimation model.
Figure 3.1: Proposed System Model
3.1.1 Literature Review
We look at several related papers on female employment participation that are

available through Google Scholar, IEEE, ACM Digital Library, Springer Open,
and other sources Then from the paper we find out relevant papers based on re-
viewing paper title, abstract, and keywords. Then we study the papers to iden-
tify various factors that have significant effects on the labor market of females.
3.1.2 Source of Data
After reading some related work we have to find out a reliable data source for
our work. For this study, we use the data extracted from the “World Devel-
17
opment Indicator, 2020(WDI-2020) “database for Bangladesh [26]. The WDI

represents the accurate development data for each country and for this research
the database includes data up to 61 years– from 1960 to 2020.
3.1.3 Data Preparation
When we have a reliable data set next step is data preparation. We use the
data for this study includes up to 61 years from 1960 to 2020. Not all indica-
tors have all values among these years. So we find out some indicator which is
based on the number of data available in the data set. That indicator brings
out the data for the period of 1991-2020.
3.1.4 ANOVA Test
After practicing with the records, we ran an ANOVA test on the data. After
filtering such indications, we investigate the significant factors of the suicide
mortality rate. We use an F-statistic to look at the results of Analysis of Vari-
ance (ANOVA). For this significance check, we choose a p-value of less than
0.05. Following the ANOVA test, we get a list of significant and suitable factors
to use in the model construction. We’ll figure out which factor is important for
this study using the ANOVA test.
3.1.5 Identifying Significant and Relevant Factors
We obtain some significant and relevant factors for the labor market of the fe-
male through the ANOVA test. In the ANOVA test, we apply linear regression
to identify significant and relevant factors. For this research, we use Perfor-
mance Measurement, Literature Review, and Collection of data set. We find
some factors that are not in the literature; so we ignore them. We identify the
18
factors that are significant but not relevant. We also find out the relevant fac-
tors through the literature review. All the significant factors are not relevant so
we ignore them. Finally, we find out the factors which are significant and rele-
vant factors of the labor market of females.
3.1.6 Building Regression Models
After finding significant factors we use the linear regression algorithm of clas-
sification method to build the model. Classification is a process of data anal-
ysis and building models. For classifying a set of data into one set of prede-
fined classes it is used. A regression algorithm is used for predicting outcomes
based on the independent variable. Step-wise linear regression is used to check
whether the whole model together is significant or not. For this reason, we get
a large number of models. Then we analyze which model has higher accuracy
by considering some predefined performance measurement criteria [20]. A re-
gression equation is shown in equation (1).
3.1.7 Performance Improvement
Performance Measurement is a difficult task. By searching, we get some very

popular performance measurement criteria. Then for measuring performance,
we use Mean Absolute Error (MAE), Mean Relative Error (RAE), and Root
Mean Squared Error (RMSE). For proving our model validity we use various
classification algorithms like Gaussian Process, Decision Table, Random Forest
in WEKA tools for this we use the same data set. Formulas of MAE, RAE, and
RMSE are shown below in equations (2), (3), and (4)
19
In the above equations, ya denote the actual value from the data set, y(a ) is
the average of actual value and y(b ) is the predicted value generated from the
model. Moreover, we observe the model R-squared and Adjusted R-squared
values for the accuracy of the model. We also apply WEKA through linear re-
gression to find our measurement criteria MAE, RAE, and RMSE. So that we
can easily compare those values what we get from the equation and what we
get from WEKA.
3.2 Summary
In this chapter, we discuss various steps of our proposed system model. At
first, we choose the data set then prepare data and apply the ANOVA test of
these data and find significant factors. Then with significance, we build a pre-
diction model and apply various performance measurements. In the next chap-
ter, we will discuss the implementation chapter.
20
Chapter 4
Implementation
In the previous chapter, we discussed our proposed system model and now in
this chapter, we discuss implementation. This is the most important chapter.
There are some implementation steps. At first, we have to prepare our data.
Then we apply analysis of variance into data and identify significant factors.
Then we build a regression model. For finding a better model we apply a neural
network through our model and find more accuracy. In section 4.1 we discuss
the implementation step; in section 4.1.1 we discuss data preparation; in section
4.1.2 we discuss identifying significant factors; in section 4.1.3 we discuss the
Analysis of variance; in section 4.1.4 we discuss the building model. In the next
chapter, we will discuss the result of our study and improve the performance of
our study.
21
Chapter 4: Implementation
4.1 Implementation Step

To implement our thesis we need to use some tools. These tools are very im-
portant for us because we can’t implement our thesis without these tools. Tools
that are used here are-
Data Preparation
The WDI-2020 dataset contains 1447 indicators and incorporates data up to
61 years from 1960 to 2020. In this dataset there are plenty of different types
of data are available like education, finance, GDP, tourism, health, popula-
tion, electricity, mortality rate, employment, land area, business, banking, na-
tional resources, etc. From the data, we extract almost 650 indicators which
have 30 or more data for each indicator. We did not take on count those data
which have missing information. We overlook indicators that have a very small
amount of data and that are supposed to have less significance in the construc-
tion of significant models.
Identify Significant Factors
Figure 4.1: Stepwise Indicators Selection
22
We avoid those indicators which have a small amount of data and have less sig-
nificance in building significant models. After data preparation, we get 438 in-
dicators where data are available for the period 1991-2020. After significant
analysis, we got 32 indicators. From the dataset for the period of 1991-2020,
we apply the ANOVA test to identify significant and relevant factors. After
the significant testing with a p-value ¡ 0.05, we identified some factors that do
not exist in literature but have significance on female employment Based on
the method we follow the proposed system, we remove the significant but ir-
relevant factors identified from the dataset. The remaining significant factors
from the set are added to the list of factors identified in the literature. We got
six factors that are both significant and relevant. From table 1 we see that af-
ter the ANOVA test we find six significant and relevant factor that is related
to our study. The factors are self-employed, the industry employed, vulnerable
employed, agriculture employed, employers, and service employed. We apply a
linear regression algorithm among these six factors. From the linear regression
model we see that four factors are co-related with each other and we can build
a prediction model among those four-factor. Table 2 describes our final model
factor.
Factor name Adjusted R-squared Multiple R-squared P-value Star
Self
0.7767 0.7844 7.853e-11 ***
Employed
Industry Employed 0.9582 0.9597 <2.2e-16 ***
Vulnerable Employed 0.7824 0.7899 5.456e-11 ***
Agriculture Employed 0.8741 0.8784 2.444e-14 ***
Employers 0.8266 0.8326 2.214e-12 ***
Service Employed 0.7705 0.7784 1.156e-10 ***
Table 4.1: Significant and Relevant Factor
Analysis of Variance
Using R-Studio IDE we calculate analysis of variance of our data set. The sam-
23
ple format of ANOVA is as follows:-

ANOVA (object, ...)
Anova s different headers descriptions are below:
Df - Stands for Degrees of Freedom. It is the number of independent values
or quantities that can be assigned to a statistical distribution. Total degrees of
freedom, (n-1) = (m-1) + (n-m).
(1) If there are n total data points collected, then there are n1 total degrees of
freedom.
(2) If m groups are being compared, then there are m1 degrees of freedom asso-
ciated with the factor of interest.
(3) If there are n total data points collected and m groups being compared,
then there are nm error degrees of freedom.
Sum Sq – Stands for summation of squares. It is defined as being the sum,
over all observations, of the squared differences of each observation from the
overall mean. The total Sum square is the sum of squares between the n data
points and the grand mean. As the name suggests, it quantifies the total vari-
ability in the observed data.
Total Sum Sq = SS (Between) + SS (Error).
SS (Between) is the sum of squares between the group means and the grand
mean. As the name suggests, it quantifies the variability between the groups
of interest. SS (Error) is the sum of squares between the data and the group
means. It quantifies the variability within the groups of interest.
MS: As the name implies, the mean squares (MS) column contains the ”most”
sum of squares for the Factor and Error.
F-statistics: F-value finds out if the means between two populations are sig-
nificantly different.
Pr(is greater than F) -Pr is greater than F, the significance probability value
24
associated with the F Value.

Stars – How much the further analysis can be significant that is predicted by
the stars. If the number of stars is 3 then we imagine that the data can be used
that further analysis. On the other hand, if the number of stars is 1 or 0 then
we say that the data is not appropriate for further analysis
No Factors DF Sum of sq Mean of sq F-value Pr(>f ) Star

Vulnerable
1 1 238.05 238.05 105.25 5.456e-11 ***
employed
Self
2 1 236.352 236.35 101.86 7.852e-11 ***
employed
Agriculture
3 1 264.693 264.693 202.33 2.444e-14 ***
employed
Industry
4 1 289.166 289.166 665.96 <2.2e-16 ***
employed
Table 4.2: ANOVA Test Result for Model Data
Histograms of factors are described below:
Figure 4.2: Vulnerable Employed
Figure 4.2 is histogram of vulnerable employed. Its highest frequency is 15 and

lowest frequency is 2.
25
Figure 4.3: Self Employed
Figure 4.3 is histogram of self employed. Its highest frequency is 14 and lowest
frequency is 1.
Figure 4.4: Agriculture Employed
Figure 4.4 is histogram of agriculture employed. Its highest frequency is 7 and

lowest frequency is 1.
26
Figure 4.5: Industry Employed
Figure 4.5 is histogram of industry employed. Its highest frequency is 8 and

lowest frequency is 1
Building Model
After using the stepwise regression technique to analyze the data, we get a
model with four components. Vulnerable employed (VE), Self-employed (SE),
Agriculture employed (EP), and Industry employed are the four criteria (IE).
our output is Female employment participation (FEP). A set of coefficients are
determined from the regression model is shown in table 2 and the regression
model has presented the equation.
FEP = 20.68140 + (-7.22920*VE) + (7.30043*SE) + (-0.10162*AE) + (0.36365*IE)
No. factors Estimate Std. error T value Pr(>—t—) Sign star Multiple R-squared Adjusted R-squared
1 Intercept 20.68140 4.63235 4.465 0.000149 ***
Vulnerable
2 -7.22920 1.48063 -4.883 5.05e-05 *** 0.9827 0.9799
employed
Self
3 7.30043 1.51565 4.817 5.99e-05 ***
employed
Agriculture
4 -0.10162 0.03286 -3.093 0.004829 **
employed
Industry
5 0.36365 0.11853 3.068 0.005125 **
employed
Table 4.3: Association between identifying factors
27
Predicted Value Calculated = k+a1*fVE+a2*fSE+a3*fAE + a4*fIE Where,

k=intercept, a1=intercept of VE, a2=intercept of SE, a3=intercept of AE,
a4=intercept of IE.
fVE is the actual value of VE, fSE is the actual value of SE, fAE is the actual
value of AE, fIE is the actual value of IE.
We use the coefficient and intercept that we got from the equation. The pre-
dicted value is used when we calculate Mean Absolute Error (MAE), Relative
Absolute Error (RAE), and Root Mean Square Error (RMSE). Error calcula-
tion is needed to know the inaccuracy of a model.
4.2 Summary
In this chapter, we discuss the implementation of our work. At first, we prepare
our data set then identify factors by applying the ANOVA test and build a pre-
diction model with those factors. In the next chapter we will discuss the result
of our work and performance improvement.
28
Chapter 5
Result and Discussion
In our previous chapter, we discuss our implementation process now in this

chapter we discuss our result. In our result section, we got a residual versus
fitted graph that describes the linearity of our model. If we got a straight line
then our model is linear otherwise it is non-linear. QQ plot describes if the in-
put and output data came from the same sample data. We employ k fold cross-
validation to assess the performance of our model and quantify MAE, MRE,
and RMSE. In addition to these, we apply other classification techniques in our
models, such as the Gaussian process, Random forest, and decision tree Method.
Also, we’ll be able to improve the results of our research.
29
Chapter 5: Result and Discussion
5.1 Result
A residual is a difference between the observed value of the dependent variable
(y) and the predicted value (ŷ). This plot tests the assumptions of whether the
relationship between your variables is linear (i.e. linearity) and whether there
is equal variance along the regression line. The residuals versus fitted plots are
about residual on the y axis and fitted values on the y-axis. We use this plot to
detect linearity. The residual vs fitted graph below describes how the model is
more linearly appropriate. A good residual vs. fitted plot should be a straight
line and have no outliers and it is distributed around the 0 lines without partic-
ularly large residuals. If we find the line around a horizontal line without any
outliers then it indicates the linear relationship between dependent and inde-
pendent variable otherwise it is a non-linear relationship.
Figure 5.1 shows us the Residual versus fitted graph of our model. From the
figure, we see that our model is not linear with input factors and output fac-
tors.
Figure 5.1: Residual vs Fitted Graph
A q-q plot is a plot of the quantiles of the first data set against the quantiles of
30
the second data set. A 45-degree reference line is also plotted. If the two sets
come from a population with the same distribution, the points should fall ap-
proximately along this reference line. The greater the departure from this ref-
erence line, the greater the evidence for the conclusion that the two data sets
have come from populations with different distributions.
Figure 5.2 shows the Q Q plot of our model. From figure 5.2 we see that al-
though our model is near to linear, it is not also too scattered. From this, we
can assume that the sample mean has a normal distribution. .
Figure 5.2: Normal Q Q Plot
We utilize k-fold cross-validation with k=5 to measure the performance of our

data set. We also assess which testing set gives their training set the most ac-
curacy. Using the training set and the testing set shown in Table IV, we gen-
erate five different models from the five cases above. The first six years of data
are used as a testing set, while the remaining years are used as a training set
for all those data sets. For reference, consider the first six years of data from
1991 to 1996 as a testing set, and the remaining 24 years of data from 1997 to
2020 as a training set.
31
Case no Training set Testing set

1 1997 − 2020 1991 − 1996
2 1991 − 1996, 2003 − 2020 1997 − 2002
3 1991 − 2002, 2009 − 2020 2003 − 2008
4 1991 − 2008, 2015 − 2020 2009 − 2014
5 1991 − 2014 2015 − 2020
Table 5.1: Sample Division for K-fold Cross Validation
To test our basic model’s higher performance, we compare it to different clas-

sification algorithms. We use WEKA to calculate MAE, RMSE, and RAE for
Linear Regression, Decision Table, Gaussian Process, and Random Forest, and
display the results in table VI. From table V we see that linear regression give
more accuracy.
Relative Root Mean

Mean Absolute
No. Model Absolute Error Square Error
Error (MAE)
(RAE) (RMSE)
1 Base 0.46% 16.85% 0.58%
0.08% 2.68% 0.09%
5-fold Cross
2 0.05% 16.65% 0.08%
Validation
0.02% 3.19% 0.01%
0.02% 2.31% 0.01%
0.01% 1.39% 0.01%
Table 5.2: 5 Fold Cross Validation Result
32
Relative Root Mean

Mean Absolute
No Algorithm Absolute Square
Error(MAE)
Error(RAE) Error(RMSE)
1 Decision tree 1.25% 45.73% 1.47%
2 Gaussian process 0.58% 21.43% 0.82%
3 Random Forest 0.07% 2.66% 0.09%
Table 5.3: Performance Comparison with Other Classifying Technique
A boxplot is a graph that provides you a good sign of exactly how the values in
the data are spread out. Although boxplots may seem primitive in comparison
to a histogram or density plot, they have the advantage of taking up less space,
which is useful when comparing distributions between many groups or datasets.
Error boxplot is given below. The boxplot of MAE is given below. Figure 5.3
shows the boxplot of MAE. Figure 5.4 shows the boxplot of MRE. Figure 5.5
shows the boxplot of RMSE.
Figure 5.3: Boxplot of MAE
Figure 5.3 shows us the boxplot of MAE which lowest error is -0.1 and highest
33
error is 0.2.
Figure 5.4: Boxplot of MRE
Figure 5.4 shows us the boxplot of MRE which lowest error is above 0.012 and
highest error is below 0.016.
Figure 5.5: Boxplot of RMSE
Figure 5.5 shows the boxplot of RMSE which lowest error started from 0.075
and highest error started from below 0.23.
34
Figure 5.6: : Line Chart of Actual VS Predicted Participation
Figure 5.6 shows the relationship between the actual participation and pre-
dicted participation of female employment. We can conclude from this graph
that the real and anticipated values are nearly identical.
5.2 Performance Improvement

There is a little anomaly in the regression model outcome. Now, this is an ef-
fort to overcome this little anomaly as far as possible by applying the optimiza-
tion method. There is various optimization algorithm for these purposes. Neu-
ral network model algorithm is one of them. Neural networks are a fixed of al-
gorithms, modeled loosely after the human minds, that are designed to capture
styles. They understand sensory data through a kind of machine perception,
labeling, or clustering raw input. The styles they pick out are numerical, con-
tained in vectors, into which all actual-world data, be it images, sound, text, or
time collection, must be translated [27].
In our data set and find a model for better accuracy. Figure 5.2.1 shows our
neural network model. The black line shows the connections with weight and
35
the blue line displays the bias term. Our model has four input neurons, three
hidden neurons, and one output neuron. Our neural network model has less
error which is 0.02359.
Figure 5.7: Neural Network Model
Figure 5.8 shows the scatterplot of the neural network model which compares
the predicted output with the real output.
Figure 5.8: Scatterplot of Neural Network Model
36
From this figure 5.8, we can say that the actual and neural predicted output is
linearly promoting.
Relative Root Mean

Mean Absolute
No. Model Absolute Error Square Error
Error (MAE)
(RAE) (RMSE)
1 Base 0.12% 4.67% 0.18%
0.06% 2.27% 0.09%
5-fold Cross
2 0.07% 2.26% 0.09%
Validation
0.01% 2.14% 0.01%
0.01% 1.52% 0.013%
0.01% 2.21% 0.022%
Table 5.4: 5 Fold Cross Validation Result with Neural Network
The 5 Fold Cross-Validation Results with Neural Network are shown in Table
5.4. The resultant error is significantly small after using the neural network
algorithm.
37
Figure 5.9: Line Chart of Actual VS Neural Predicted Participation
Figure 5.9 shows the relationship between the actual participation and neural
predicted participation of female employment using full training set data. We
can tell from this figure that the actual and neural predicted value is about the
same.
We’ll now apply 80% of the total data set as training data and 20% as testing
data to try to generate a graphical view of the relationship between them.
Figure 5.10: : Line Chart of Actual VS Neural Predicted Participation (For

80% Training Data Set)
38
After 80 percent of the training data set has been applied, the predicted val-
ues are shown with the actual values in figure 5.10. The performance improves
slightly when the 80 percent training data set is used instead of the full train-
ing data set.
Figure 5.11: Line Chart of Actual VS Neural Predicted Participation (For 20%
Testing Data Set)
After applying a 20% testing data set to figure 5.11, the predicted values are
moving away from the real values.
We have used 80 percent of the training data set and 20% of the test data set
in the two figures above 5.10 and 5.11. However, we can say that using 80% of
the training data set is capable of improving performance over using the entire
training data set. And it is moving away from the actual value after applying
20% of the test data set, meaning that its performance is decreasing as a result
of applying the testing data set.
39
5.3 Discussion
According to the model, the variables such as self-employed and industry em-
ployed have a significantly positive effect on female work participation whereas
vulnerable employed and agriculture employed to harm female work participa-
tion. The contribution of the female labor force is found tremendously supe-
rior in comparison with the total labor force. We were found three regression
models in which base model is best. Other two models we were found consist
of four factors each of which vulnerable employed, self-employed, service em-
ployed and the industry employed. And another model has service employed,
self-employed, agriculture employed, and vulnerable employed factors. From the
three models, we can see that vulnerable employed and self-employed are com-
mon. So we analyze the models and our datasets to find any effects or relation
of self-employed and vulnerable employed do have in our models.
Figure 5.12: Line Chart of Actual VS Regression Predicted VS Neural Pre-

dicted Participation
Figure 5.12 shows the graphical illustration of the actual, linear regression model
and neural network model algorithm. From this figure, we can say that the
40
neural network model algorithm is giving better results than the linear regres-
sion model. We aimed to increase the performance of our model using neural
network model algorithms, which we were able to do at least somewhat.
5.4 Summary
In this chapter, we discuss residual versus fitted graph, QQ plot, and k fold
cross-validation, performance measurement with other classifying techniques,
boxplot of MAE, MRE, and RMSE, performance improvement, and so on. In
the next chapter, we will discuss the limitation and future work of our study.
41
Chapter 6
Limitations and Future Works
In the previous chapter, we discussed our results and in this chapter, we will try
to find out our limitations and future work. We try to create a system that has
fewer errors as possible. But we have some limitations in this system. We can
solve this problem with the help of another system, these are described in this
chapter. In section 6.1 we discuss the limitation of our system, in section 6.2
we discuss future works to solve our limitations.
42
Chapter 6: Limitations and Future Works
6.1 Limitations
The limitations refer to the design or procedures that influenced the interpreta-
tion of our research findings. Limitation can be a valuable tool for identifying
new gaps in the literature and indicating the need for additional studies.
To start with, the sample size we used was insufficient. Finding significant asso-
ciations from our data will be tricky. Because the sample size is so small, find-
ing a pattern and a meaningful association is difficult.
Secondly, although we have probable factors we don’t include them due to a
lack of data set.
Thirdly, because of the lack of prior research or studies on Bangladesh, it may
be required to develop an entirely new research typology which seems to be
very difficult.
Fourthly, other methods (association, clustering, and decision tree) maybe also
fitted for this research but that is not checked in this implementation.
Fifthly, if the dataset is big then more folds can be shaped as a result the cross-
validation model becomes more exact.
Sixthly, all the fractional points are not taken into consideration and also we
take the fractional as approximate value, thus the error rate may slightly in-
crease.
6.2 Future Works

To overcome our limitations we have some plans.
Firstly, we remove low significant indicators for that reason we may not acquire
our preferred result that‘s why in the future we will try to analyze for all values
low or high significant stages.
43
Chapter 6: Limitations and Future Works
Secondly, we will try to find the model that will be fit for both linear and non-
linearly.
Thirdly, the cross-validation model should be more perfect in the future study.
Fourthly, we can usage other algorithm techniques to get a superior model fur-
ther.
6.3 Summary
In this chapter, we tried to highlight the limitations of our work and our main
limitations the lack of data around possible factors. And we have tried to give
the right steps to overcome these limitations. Various approaches including As-
sociation, Clustering, and Decision trees must be used. We prerequisite to ex-
pand the accuracy of cross-validation results.
44
References
[1] UNDP, “https://www.undp.org/sustainable-development-goals,” 2021 [On-

line].
[2] S. Raihan and S. H. Bidisha, “Female employment stagnation in

bangladesh,” 2018.
[3] M. Chowdhury, M. Hossain, et al., “Determinants of unemployment in

bangladesh: A case study,” Developing Country Studies, vol. 4, no. 3,
2014.
[4] V. Motkuri, “Caste and rural youth in india: Education, skills and employ-
ment,” 2013.
[5] M. Hedayat, S. M. Kahn, and J. Hanafi, “Factors affecting the unemploy-

ment (rate) of female art graduates in iran,” Educational Research and Re-
views, vol. 8, no. 9, pp. 546–552, 2013.
[6] Economictimes, “https://economictimes.indiatimes.com/definition/data-

mining,” 2021 [Online].
[7] Talend, “https://www.talend.com/resources/what-is-data-mining/,” [On-

line].
45
REFERENCES
[8] D. J. Hand, “Principles of data mining,” Drug safety, vol. 30, no. 7,
pp. 621–622, 2007.
[9] F. Gorunescu, Data Mining: Concepts, models and techniques, vol. 12.
Springer Science & Business Media, 2011.
[10] R. P, “https://bigdata-madesimple.com/14-useful-applications-of-data-
mining/,” Aug 20 2014 [Online].
[11] D. Enke and S. Thawornwong, “The use of data mining and neural net-
works for forecasting stock market returns,” Expert Systems with applica-
tions, vol. 29, no. 4, pp. 927–940, 2005.
[12] Zentut, “www.zentut.com/data-mining/advantages-and-disadvantages-of-

data-mining/,” [Online].
[13] H. W. Ian and F. Eibe, “Data mining: Practical machine learning tools
and techniques,” 2005.
[14] Talend, “https://www.talend.com/resources/data-mining-techniques/,”

[]Online].
[15] T. A. Kumbhare and S. V. Chobe, “An overview of association rule mining

algorithms,” International Journal of Computer Science and Information
Technologies, vol. 5, no. 1, pp. 927–930, 2014.
[16] J. D. Rodriguez, A. Perez, and J. A. Lozano, “Sensitivity analysis of k-

fold cross validation in prediction error estimation,” IEEE transactions
on pattern analysis and machine intelligence, vol. 32, no. 3, pp. 569–575,
2009.
[17] G. A. Churchill, “Using anova to analyze microarray data,” Biotechniques,

vol. 37, no. 2, pp. 173–177, 2004.
46
REFERENCES
[18] D. C. Montgomery, E. A. Peck, and G. G. Vining, Introduction to linear

regression analysis. John Wiley & Sons, 2021.
[19] J. Brownlee, “A gentle introduction to k-fold cross-validation,” Machine

Learning Mastery, vol. 2019, 2018.
[20] J. D. Rodriguez, A. Perez, and J. A. Lozano, “Sensitivity analysis of k-

fold cross validation in prediction error estimation,” IEEE transactions
on pattern analysis and machine intelligence, vol. 32, no. 3, pp. 569–575,
2009.
[21] G. Kesavaraj and S. Sukumaran, “A study on classification techniques in

data mining,” in 2013 fourth international conference on computing, com-
munications and networking technologies (ICCCNT), pp. 1–7, IEEE, 2013.
[22] A. Luci, “Female labour market participation and economic growth,” In-
ternational Journal of Innovation and Sustainable Development, vol. 4,
no. 2-3, pp. 97–108, 2009.
[23] S. André, M. Gesthuizen, and P. Scheepers, “Support for traditional fe-

male roles across 32 countries: Female labour market participation, pol-
icy models and gender differences,” Comparative Sociology, vol. 12, no. 4,
pp. 447–476, 2013.
[24] A. Cipollone, E. Patacchini, and G. Vallanti, “Female labour market par-

ticipation in europe: novel evidence on trends and shaping factors,” IZA
Journal of European Labor Studies, vol. 3, no. 1, pp. 1–40, 2014.
[25] E. T. Stavrou, W. J. Casper, and C. Ierodiakonou, “Support for part-time

work as a channel to female employment: The moderating effects of na-
tional gender empowerment and labour market conditions,” The Interna-
47
REFERENCES
tional Journal of Human Resource Management, vol. 26, no. 6, pp. 688–
706, 2015.
[26] Worldbank, “https://data.worldbank.org/country/bangladesh, december

2020,” 2020 [Online].
[27] H. Lu, R. Setiono, and H. Liu, “Effective data mining using neural net-
works,” IEEE transactions on knowledge and data engineering, vol. 8,
no. 6, pp. 957–961, 1996.
[28] Zentut, “https://www.zentut.com/data-mining/advantages-and-

disadvantages-of-data-mining/,” 2021 [Online].
48

Regression and Neural Network Based Prediction Model For The Participation of Female Employment in Bangladesh

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression and Neural Network Based Prediction Model For The Participation of Female Employment in Bangladesh

Uploaded by

Copyright:

Available Formats

REGRESSION AND NEURAL NETWORK BASED PREDICTION

MODEL FOR THE PARTICIPATION OF FEMALE

Department of Computer Science and Engineering

Course Title: Project/Thesis

A thesis has been submitted to the Department of Computer Science

Signature of the Examinee

To the best of my knowledge, this paper has not been duplicated

Md. Mahmudul Hasan

Department of Computer Science and Engineering

Pabna University of Science and Technology, Pabna-6600.

I am deeply thankful to my honorable supervisor, Md. Mahmudul

Finally, I am much grateful to my family members especially to

Participation of females in employment along with men serves as a

2.2.3.8 Decision Table . . . . . . . . . . . . . . . . . . . 12

5 Result and Discussion 29

6 Limitations and Future Works 42

3.1 Proposed System Model . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Stepwise Indicators Selection . . . . . . . . . . . . . . . . . . . . . 22

5.1 Residual vs Fitted Graph . . . . . . . . . . . . . . . . . . . . . . . 30

5.12 Line Chart of Actual VS Regression Predicted VS Neural Predicted

4.1 Significant and Relevant Factor . . . . . . . . . . . . . . . . . . . 23

5.1 Sample Division for K-fold Cross Validation . . . . . . . . . . . . 32

In this chapter, we will introduce our thesis overview, background, objectives,

1.3 Thesis Objective

are responsible for the progress of female employment.

1.4 Thesis Outcome

2.1 Data Mining

Application of Data Mining

Advantages and Disadvantages of Data Mining

2.2 Data Mining Methods and Techniques

2.2.1 Techniques of Data Mining

2.2.2 Data Mining Tools

2.2.3 Related Method

2.2.3.1 ANOVA Test

2.2.3.2 Linear Regression Model

Once a regression model has been matched to a group of facts, examination

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us check if

2.2.3.5 Cross Validation

In k-fold pass-validation, the original pattern is randomly partitioned into k

2.2.3.6 Gaussian Process

Gaussian process is a stochastic process (a collection of random variables listed

2.2.3.7 Random Forest

2.2.3.8 Decision Table

Decision tables are a summarizing visible representation for identifying which

2.3 Related Works

3.1 Proposed System

Figure 3.1: Proposed System Model

3.1.1 Literature Review

We look at several related papers on female employment participation that are

3.1.2 Source of Data

opment Indicator, 2020(WDI-2020) “database for Bangladesh [26]. The WDI

3.1.3 Data Preparation

3.1.4 ANOVA Test

3.1.5 Identifying Significant and Relevant Factors

3.1.6 Building Regression Models

3.1.7 Performance Improvement

Performance Measurement is a difficult task. By searching, we get some very

4.1 Implementation Step

Figure 4.1: Stepwise Indicators Selection

Table 4.1: Significant and Relevant Factor

ple format of ANOVA is as follows:-

associated with the F Value.

No Factors DF Sum of sq Mean of sq F-value Pr(>f ) Star

Predicted Value Calculated = k+a1fVE+a2fSE+a3fAE + a4fIE Where,