Professional Documents
Culture Documents
Data definition 22
Data preparation 24
Classification 27
Naïve Bayes 28
BayesNet 28
J48 28
CHAPTER 3 METHODOLOGY
Methodology
Success percentage rate of any institute can be improved by knowing the reasons for dropout
student. In the present study, the primary available data on “Prediction of Dropout Students
from Government Schools using Educational Data Mining (EDM)” is based on various
features/variables that were extracted from school census data that includes school profile,
geographical, infrastructure, human resource and students’ personal information. It is a
classification task where we try to build an accurate classification model to classify student
into two classes such as dropout or continue from composite sample of 858,013 students of
Government Schools of Balochistan (KG to Matric) of the year 2015-2016. Here dropout is
explicit to represent the status of students’ exit from the education system. The study also
provide more insight into dropout behaviour and factors affecting students’ dropout from
policy, planning and decision making perspective.
Nowadays, however, the deployment of Student Information Systems at the institutional level
provides an appropriate infrastructure for student’s data organization and storage as well as
data acquisition and deeper analyses. This data can help model the behavior of dropouts, and
predict future dropouts, therefore giving a chance to counselors to advise and guide students
into success. The demand for education in Pakistan has increased as more and more children
are now attending their schools. But there is lots of problem with the education system
causing many students to drop their study. The schools lack good infrastructure, quality
teachers and poor delivery of course content in India causing people to drop out. It is a
common excuse for the students that they don’t have easy access to educational institutions.
This problem is very true for that student how migrants for different places due to their
family problem. They just face the problem for the issue of transfer certificates, school
leaving certificate and other such formalities. “It is our educational system that is not
encouraging people, creating more and more formalities for the migrant’s students.” Due to
all the formalities needed to fulfill, it looks easier to shift jobs than to shift a schools/colleges
and once a child is out of school for too long, admissions become even more difficult.
The primary focus of this study will be to analyze the prediction technique and identify the
one that has qualified in providing the best results. Secondly, the research would focus on the
utilization of data mining techniques to predict the dropout rates by collecting the data from
the secondary school in Balochistan. Several factors are considered such as gender, district,
session, shifts, level, school building, and its structure, class, laboratory, toilets, teachers, and
many others to ensure which factors affect the student dropout rates the most.
Techniques of data analysis
Two kinds of methods have had been used to have the collection of the datasets
appropriately. The benefit of the collection of real-time data had enabled the researcher to
understand the situation and provide actual prediction that can help the schools to ensure
higher development of the self and the students. Therefore, the original data was used
between the years 2015-2016 concerning the secondary level education in Balochistan.
Measure of dropout rate has had been done differently in several literatures and schools
respectively. The situation of Balochistan had been similar and there are zero appropriate
technique to predict the dropout rates within the schools,which could have had enabled the
authorities to provide better education. Several litratures provided possible reasons for the
dropout rates including socio-economic reasons and academic failures. However, appropriate
prediction has not been provided anywhere concerning the schools of Balochistan. Therefore,
this particular research holds a due importance concerning the students dropout rate. It is,
therefore, considered that the utilization of best techniques for prediction can enable the
schools to get valuable information.
The primary technique that had been used to collect the data was mere observation and the
gathering of information from the actual governmental database concerning the critical
information about the students. This technique enabled the researcher to ensure validity of the
information collected for the benefit of the schools entirely. It was done so that the reason
behind dropout can be better understood and predicted so that the schools can take
appropriate actions beforehand. The data was carefully transformed in the form of datasets
that enabled the researcher to easily predict the information.
Weka is another powerful tool that would be performing 90% of the entire work done for this
study individually and collectively. Several algorithms, classification methods, decision tree
and other aspects to predict the factors that negative affect the retention rate of the students.
Decision tree and classification methods were specifically used to predict the cause and
actual rate of dropout that happens every year in the Balochistan secondary level schools.
Weka tool has been used to have effective classification mode that can especially be
implemented by the utilization of the weka tool. Several classifiers were used to ensure better
prediction and comparative study was done to find effective classifier that can be used for
prediction purposes in the future. The analysis of attribute was carried out in the feature
selection step by selecting various techniques such as information gain and correlation based
feature selection to ensure the best method to have prediction. Most useful attributes and
features were selected that could provide better prediction and effective tool to find dropout
rates and aspects affecting it. Lastly, the association rule mining method was also considered
for utilization to find a relationship between several factors associated with the large
database. It would provide appropriate discrimination between various aspects such as
instances and classes so that prediction can be done appropriately.
We identity the factor through Data Mining (DM) .furthermore, on student level the
prediction will be done to find out which student dropout in next academic session and to
study the causes of dropout which belongs to the process of knowledge discovery and data
mining. This information will be helpful for the management to reduce the dropout rate in
Districts. In order to achieve the above mentioned objectives the following steps were
followed (Fig. 1):
RAW DATA
DATA
PREPARATION/GATHERING
DATA PRE-PROCESSING
DATA
SELECTION/TRANSFORMATION
FEATURE EXTRACTION
CLASSIFICATION
RESULT EVALUATION
The raw data has been taken from EMIS PPIOU department the data is based on student and
school profile of the year 2015-2016.
This data has been taken from two data base which consists of school level and student level
data in which school profile and student profile.
GENDER
BLANKS 1
BOY 506216
GIRL 351797
CLASS
BLANKS 1
10TH 12100
11TH 100
12TH 101
2ND 25586
3RD 20985
4TH 17754
5TH 18552
6TH 15302
7TH 7347
8TH 9644
9TH 8814
KACCHI 174415
PAKKI 41053
UN ADMITTED 44
DISTRICT
STATUS
SELECT ALL 44
BLANKS 1
CONTINUE 22
NEW ADMISSION 22
SESSION
SELECT ALL 22
BLANKS 1
2015 22
VCSHOOLNAME
SELECT ALL 22
BLANKS 1
SHIFTS
BLANKS 5757
LEVEL
BLANKS 1
HIGH 248368
MIDDLE 140802
PRIMARY 431091
TOTAL AREA
BLANKS 2028
SCHOOL BUILDING
SELECT ALL 858013
BLANKS 4469
NO 43325
YES 810219
OWNERSHIPOFBUILDING
BLANKS 47860
DONATED 261980
RENTED 12193
STRUCTUREOFBUILDING
BLANKS 1
KACCHA 7610
MIX 1234
PAKKA 3349
CONDITIONOFBUILDING
BLANKS 1
ADEQUATE 1435
WATERFACILITYINLATRIN
BLANKS 4469
NO 614851
YES 238693
BOUNDARYWALL
BLANKS 4469
NO 283586
YES 569958
CONDITIONOFBW
BLANKS 537824
COMPLETE 213574
UNCOMPLETE 106615
ELECTRICITYINAREA
BLANKS 4140
NO 134239
YES 719634
ELECTRICITYINSCHOOL
BLANKS 1
NO 347572
YES 372062
GASINAREA
BLANKS 1
NO 208881
YES 163181
GASINSCHOOL
NO 82449
YES 80732
ANYWATERSCHEME
BLANKS 4140
NO 441004
YES 412869
SOURCEOFWATER
BLANKS 4140
NAHAR 67277
STREAM 29190
TAP 240452
WELL 138923
WATER TANK
BLANKS 1
NO 103215
YES 35708
SCIENCE LAB
BLANKS 1
NO 21232
YES 14476
CONDITIONOFSCLAB
BLANKS 682982
ADEQUATE 27853
PROVISIONOFSCLABITEMS
BLANKS 1
COMPLAB
BLANKS 1
NO 753312
YES 94407
CONDITIONOFCOMPLAB
BLANKS 56
ADEQUATE 753312
NEED REPAIR 45
COMPUTERSAREAVAILABLE
BLANKS 34
NO 9000
YES 56789
COMPAREINUSE
BLANKS 23
NO 78990
YES 94407
IFCOMPLABISNOTFUNCTIONAL
BLANKS 7
OTHER 886
LIBRARY IS AVAILABLE
BLANKS 75
NO 7654
YES 08876
CONDITIONOFLIB
BLANKS 64
ADEQUATE 8754
BLANKS 22
YEARLY 900
SEPRATEROOMAVAILABLEFORKACCHI
NO 597142
YES 240912
IFSEPCLASSISNOTAVAILABLE
BLANKS 7654
BARAMDA 23344
OTHER 987
SAHAN 876
SEPARATETEACHERFORKACCHI
BLANKS 54956
NO 448396
YES 354661
PTSMCISAVAILABLE
BLANKS 6836
NO 384224
YES 466953
YEARPTSMCFORMED
BLANKS 52808
BLANKS 1985
BLANKS 1985
BLANKS 1
BLANKS 1
BLANKS 95
NO 6
YES 617
BLANKS 95
EXAMINATION HALL
BLANKS 6602
NO 359
YES 359
PLAY GROUND
BLANKS 4016
NO 2145
YES 1159
BLANKS 7320
BLANKS 7320
In the data preparation process we integrated entire database with the unique code that is
EMIS code the data was of two years from 2015-2016.further more,deleted the missing data
,at first in raw data there was no labeling of dropout therefore we labeled it .The data used in
this study was prepared from the Secretariat of Baluchistan through census bases. The data
has been constructed based on theoretical and empirical grounds about factor affecting
student’s performance and causes of dropout. The data included socio-demographic
indicators ( Age, Date of birth, Geographical location), Educational factors (Performance in
primary school, middle school and Secondary School , Location of Schooling, Type of
Examination Board, Medium of Study etc.), Parental Attitudes, Causes of dropout, and
Institutional factors, etc.
Before the initial visit to review the records, a coding system was created for each variable to
be documented (e.g., rural=0, urban=1). It was not important to document dropout status but
also all withdrawal reasons for the students.
The dataset is formulated for applying the data mining techniques. For further process, the traditional
pre-processing methods that consist of data cleaning, renovation of factors and the data partitioning
have to be applied. Other methods, such as the selecting of the attributes and re-balancing the data
also applied in order to resolve the problems related of the high dimension and the imbalanced data,
which were typically undertaken in the datasets. The data has been taken from Education Information
management System of the year 2015-2016, which comprises of 858013students records under the
following fields.
NAME OF Features
Students id Name Computer are available Science lab
Gender Class If computer lab is not Provision of science
functional lab
District Status Condition of library Condition of computer
lab
Session Vchschoolname Separate room available Computer are in use
Vchdistrict Vchtehsil Separate teacher for kacchi Library is available
class
Shifts Expr1 Year PTSMC formed Provision of books
Level Total area Total pakka rooms If separate class is not
available
Schoolbuildin Ownership of building Total pakka toilet PTSMC is available
Structure of building Condition of building Examination hall Total kaccha rooms
Q27 Water facility in area Sanctioned teacher Total kaccha toilet
Boundary wall Condition of boundary Space for new rooms
wall
Electricity in area Electricity in school Play ground
Gas in area Gas in school Appointed teacher
Table1 of School profile of the year 2015-2016
After data cleaning following fields are left for data processing and then we label the age of
the student having age having 5 to 6 label as A1, age 7 to 8 A2, age 9 to 10 A3, age 11 to 12
A4, age 13 to 14 A5, 15 to 16 A6 with the attribute of DUMMY AGE. Furthermore,
GENDER attribute was considered as, Boy=B Girl=G same as in rest of the field codes has
been defined in status field Dropout =Dout. Certain fields were merged into one attribute due
to the duplication, such as total pakka toilet, total kaccha toilet its was named as total toilets.
Attribute total kaccha room and total pakka rooms were merged into one attribute as total
rooms. Some other fields were named as, New admission=NEWADM In shifts field Single
Shift schools=SSS Double Shift Schools=DSS In ownership of building field Education
department=EDUD In condition of building field Need major repair=NMJR Need Minor
Repair =NMR In if separate class is not available field Sits with other class
students=SWOCS, tube well =TW, no water source =NWS, sanctioned teaching staff =Staff,
appointed teaching staff =AT staff, examination hall =EXAM HALL, new rooms required
=Required, space for new rooms =SForNRooms.Before the data transformation the attributes
were labeled in order to analyze, furthermore we removed the missing values. The data has
been used for the prediction of dropout students and the factors behind it .in the 2015 data is
to check whether the student is enrolled ,repeater, promoted and dropout for this process is to
analysis if the student is available in the same class as in the last year then it is considered
that the student is repeater ,if the student is not in the same class as last year class it means
the student is promoted whereas if the student of the year 2015 is not in any class in the
upcoming year it shows that the student is dropout due to various circumstance or reasons
The students which were in the year 2015 and not present in the year 2016 were considered
as dropout
3.4 DATA SELECTION AND TRANSFORMATION
Only the attributes that were required for the data mining process were selected. All the
predictor and response variables are shown in Table 1 for reference.
Stud_Id Student ID
PTSMCIsAvailable Yes, No
SForNRooms Yes, No
NR_Required 0, 1, 2, 3, 4, 5, 6, 10, 26
R_Rooms Required rooms Few, Many, None
The source of the data is students profile and school profile and the total number of data is
The following above fields are further illustrated for data analysis experiments in WEKA
Like in gender fields Boy=B Girl=G same as in rest of the field codes has been defined
In status field Dropout =Dout New admission=NEWADM In shifts field Single Shift
schools=SSS Double Shift Schools=DSS In ownership of building field Education
department=EDUD In condition of building field Need major repair=NMJR Need Minor
Repair =NMR In if separate class is not available field Sits with other class students=SOCS
The data has been used for the prediction of dropout students and the factors behind it .in the
2015 data is to check whether the student is enrolled ,repeater, promoted and dropout for this
process is to analysis if the student is available in the same class as in the last year then it is
considered that the student is repeater ,if the student is not in the same class as last year class
it means the student is promoted whereas if the student of the year 2015 is not in any class in
the upcoming year it shows that the student is dropout due to various circumstance or reason
3.7 SPLITTING OF DATA
In splitting of data we construct training and test data by using our splitting protocol for
1. Randomly select 75% of each district students for training data and remaining
2. Stratification by ensuring that each data set enough examples of each class
We make classification algorithm then made data training set to trained the model and test it
for evaluation.
Data mining is the process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database systems (Christopher, 2010).
Data definition
The collection of information was done accordingly from the valuable sources such as
Educational Management Information System (EMIS) by the educational department of
Balochistan. The data included two years of important information about several aspects
regarding the students' education and others, even including the gender and building
information. Other information that the collected data included geographical location, age,
students’ id, toilet, education system, provision of books, library, type of examination, causes
of drop out, and many others. Furthermore, the coding system had been created before the
variables had been documented before the prediction system was initiated. The data
preparation was done using the already available information in the database of the actual
school about the students of secondary level in Balochistan.
Stud_Id Student ID
Gender Student’s Gender Male, Female
Class Student’s class grade from Kachi, Pakki,
Kachi to Paki. 1,2,3,4,5,6,7,8,9,10
Age Student’s Age { >5, 5-20, <20}
Dummy Age {A1-A6}
District Student’s Location Awaran, Barkhan, Chaghi,
Dera Bugti, Gwadar, Harnai,
Jafar Abad, Kachhi, Kech,
Kharan, Khuzdar,Killa
Saifullaf, Kohlu, Lasbela,
Loralai, Musakhel, Naseer
Abad, Pishin, Quetta, Sherani,
Sohbat Pur, Washuk, Zhob,
Ziarat.
Status Student’s Status Continue, Dropout, New
Admission, and Repeater.
Session Student enrolled in which 2015-2016
year
Level Student’s Class Level Primary, Middle, High
School Building Building is available for Yes, No
school or not.
Owner_Of_Build Ownership of Building Donated, EDUD,rented
Struct_Of_Build Structure Of Building Kachha, Pakka, Mix
Cond_of_Building Condition of Building Adequate, NMOR, NMR,
Blanks.
W_F_ In_washroom Water Facility In washroom Yes, No
B_Wall Boundary Wall Yes, No
Elect_In _A Electricity In Area Yes, No
Elect_In_School Electricity In School Yes, No
G_In_Area Gas In Area Yes, No
G_In_School Gas In School Yes, No
A_W_Scheme Any Water Scheme Yes, No
Source_of_Water Source Of Water Nahar, NWH, Stream, Tap,
TW, Well
W_Tank Water Tank Yes, No
Sci_Lab Science Lab Yes, No
Comp_Lab Computer Lab Yes, No
Lib_Is_Avail Library Is Available Yes, No
Sep_R_Avail_for Kacchi Separate Room Available for Yes, No
Kacchi
If Sep_Class_Is_Not_Avail If Separate Class Is Not Baramda, Sahan, SWOCS,
Available other.
SeprateTeacherForKacchi Separate Teacher For Kacchi Yes, No
PTSMCIsAvailable Yes, No
T_K_Room Total Kacha Room 0,1, 2, 3,4
K_ Rooms Kaccha Rooms Few, Many, More
T_P_Room Total Pakka Room 0, 1, 2, 3, 4, 5, 7, 17
P_Rooms Pakka Rooms Few, Many, None
T_K_Toilet Total Kaccha Toilet 0,1
K_TOILET Kaccha toilet Few, Many, None
T_P_Toilet Total Pakka Toilet 0, 1, 2, 3, 4, 5, 6
P_TOILET Pakka toilet Few, Many, None
SForNRooms Yes, No
NR_Required 0, 1, 2, 3, 4, 5, 6, 10, 26
R_Rooms Required rooms Few, Many, None
E_Hall Exam Hall Yes, No
P_Ground Play Ground Yes, No
ST_Staff 0, 1, 2, 3, 4, 6, 8, 12, 13, 14, 18,
19, 35
AT_Staff 0, 1, 2, 3, 4, 6, 7, 8, 11, 13, 29
T_Status Teachers status 0, 1, 2, 4, 6, 7
Dmmy_T_status Dummy teachers status Sufficient, In Sufficient, Under
pressure
Status Status Continue, DOUT
Dmmy_status Dummy Status Repeater, DOUT
Data preparation
Since the data had not been in the revised format, as needed, the appropriate datasets were
formed by the utilization of the student ID and other information. all the missing values,
features, and attributes were deleted from the dataset in this step. The data was specifically
integrated to ensure that the preprocessing technique would provide appropriate outcomes.
the data field had been transformed and combined for this reason. the dataset originally
included the record of students such as name, education, and other aspects that can provide
the outcomes concerning the dropout rates. The new dropout attribute was thereafter created
as a dummy variable to provide answers to the research questions.
Data preprocessing and feature selection
After having a collection of all the required data, the preparation of dataset was done
appropriately by utilizing the data mining tools and techniques. Several attributes were
arranged to ensure that the classification and the organization of the data obtained in the form
of a dataset. The students belonging to the age group of 5 and 6 had been labeled as A1. The
ones having an age of 7 and 8 belonged to the A2 group. Other than that, the aspects such as
A3, A4, A5, and A6 had belonged to the age group such as 9-10, 11-12, 13-14, and 15-16
respectively. These factors came under the Dummy age attribute. Other than that, the gender
attribute had been given two factors: B and G respectively for boys and girls. Furthermore,
many fields were forced into in single attributes to ensure that the information would not be
found to have duplication. One of the attributes called as total toilets had the collection of
many attributes such as total kaccha toilet, total pakka toilets, and others to ensure that the
prediction can be done effectively without the fear of duplication. Many fields were renamed
such as NEWADM, SSS, EDUD, NMJR, NMR, SWOCS, NWS, TW, Staff, AT staff,
EXAM HALL,Required, SForNRooms, and DSS. These aspects belonged to the fields such
as New Admission, Single Shift Schools, Double Shift Schools, Education Department, Need
major repair, Need Minor Repair, Sits with other class students, tube well, no water source,
sanctioned teaching staff, appointed teaching staff, examination hall, new rooms required,
space for new rooms, and double shift schools respectively. This data had been thoroughly
studied and arranged to predict the student dropout rate and factors that affect it the most.
Before going an ahead with the examination and application of the appropriate models, the
data was processed through the series of preprocessing aspects measures by applying some
filters to the dataset so that the end result can be attained faster and clearer. The attributes that
had been needed for the data mining were chosen whereas the others one had been
eliminated. Along with that, missing values removal, removal of irrelevant values,
smoothening of noisy data, removal or identification of outlier values, and resolution of data
inconsistencies. Furthermore, removal of the irrelevant parameters and variables was also
done such as mother tongue, state of domicile, category, birth, and marital status. Since the
students belonged to Pakistani school having a similar birth and marital status, these
parameters were considered as unnecessary for the paper.
Furthermore, the feature selection technique was undertaken to ensure that the best feature
and variables have been identified that can have the greatest effect on the respective output
and prediction. The primary objective behind the utilization of this aspect to have a reduction
in the number of attributes so that the prediction can be done easily without affecting the
overall classification technique and its reliability. This particular procedure would remove the
irrelevant variables and features that can create difficulties in ensuring effective prediction
after the utilization of the appropriate models. In practical lives, the utilization of various
attributes, including irrelevant ones, is certainly possible that can create difficulties and
redundant information. Therefore, the range of algorithms was selected to have appropriate
feature selection so that such situations can be easily avoided by the researchers and the
schools involved in it.
Therefore the correlation-based feature selection methodology had been used in this research
to ensure that the irrelevant features can be eliminated which could have produced zero
predictive information concerning the dropout rates. It enabled the research to find feature
subsets that have a higher correlation with the prediction and dropout rates and causes of it.
The best first search feature was used that starts with a set of empty features and enables the
researchers to have a generation of new features that can be collected by every iteration. A
single feature is added to the one highest elevation subset. However, if the particular subset
or attribute shows zero improvements concerning the feature expansion, it is rather kept
aside. The search capacity goes back to the unexpanded one and the features addition
continues.
Ms is the merit of the subset S feature containing overall k of the total features within the
database.
has been the correlation aspect of mean feature-class
The search of finding the best features in the sea of relevant and irrelevant ones behind with
having an empty set of features after the researcher selects the Best-first method. The highest
merit subset is observed such that the reduction in training and testing dimensionality is
determined for the benefit of having an effective prediction. After finding zero improvements
in the subsets concerning the CFS procedure for over five more iterations, the particular
attribute is Set aside so that the next attributes can be given due chance to become appropriate
aspects for the purpose of prediction. After the procedure, the reduced dataset passes through
the machine learning so that the classifier can be built to have an appropriate prediction of the
dropout students and the factors affecting it drastically.
Once the decision has been made and the appropriate inter-correlation has been actively
calculated, the next step can be actively achieved. Such steps are critical because they provide
the general information of the predictiveness of the attribute with respect to another one in
the dataset. The measurement of the quality attribute can only be done when the irrelevant
attribute and feature has been minused from the final dataset. When the consideration of the
pure instance is to be taken, the researcher should provide due consideration to the feature
selection method like this. Each instance differs from one another concerning features and
attribute sets. The decision tree involves the prediction of the aspects where the attributes can
be predicted easily and feature selection can aid in effective measurement.
Training, validation, and testing set
The training and testing sets are equally critical for the development of an appropriate
prediction model in the dataset to allow the researcher to get accuracy. The training set is
considered as a set of examples drawn out of the general dataset information so that it can fit
in with the parameters or the classifiers. The role of validation datasets is to have the
hyperparameters perfectly tuned within the classifiers. In several classifiers such as artificial
neural networks, the recognized hyperparameters are the hidden units. The primary aim of
having validation set is to ensure that the dataset avoids the overfitting. Lastly, the training
dataset is used independently of the training set allowing the probability distribution. For
instance, if a particular model ends up fitting the testing and training sets appropriately, the
aspect can be termed as having lower overfitting issues. Therefore, the testing set is equally
critical within the predictive model to ensure the better and accurate results.
The training set was randomly divided into 75% of the original database that was drawn and
filtered in the preprocessing unit. The procedure took place so that the training set can
determine which model can have the best fitting with the predictive model for the
confirmation of the accuracy of results. The remaining aspects were further given to the
testing set to ensure that the testing set can help in providing accurate answers.
Classification
The data mining has the set of various algorithms that enable the classification of the dataset
to have an accurate prediction. Such aspects include decision tree, rules inductive learning,
artificial neural networks, evolutionary algorithms, and instance-based learning. This
particular paper would deal with the instances such as decision tree, Naive Base, and
BayesNet. These predictive measures create an opportunity for the researcher to make the
decision directly based on the results explained by the algorithms and classification
techniques. Among them, the utilization of a decision tree is usually done to ensure that the
situation is satisfied and solved from the root of the tree. Following aspects were used for
effective classification so that the prediction can be done accurately.
Naïve Bayes
The technique of classification has had been popular since the 1950s and its introduction was
done under a different name in 1960. The Naïve Bayes belong to the probabilistic classifiers
family and it is usually based on the application of Bayes theorem. This algorithm is usually
utilized by the researchers to have the effective construction of the classifier models for
prediction of a certain aspect or instance. However, zero algorithms are present to ensure the
training of the classifiers. However, the classifier is still able to provide effective prediction
concerning the dropout rates.
BayesNet
This aspect comes under the Bayesian network that enables a probabilistic graphical model
that enables the researcher to compute the prediction of the certain aspect. The classifier has
been observed to be closely situated with the J48 model in many instances concerning the
prediction models in several kinds of literature. Various authors have recognized it as the
mandatory classifier that can be utilized after the decision tree for accurate results.
J48
The decision tree algorithm is critical for the development of the predictive model. Several
scholars have recognized the importance of the utilization of the J48 decision tree in the
crucial aspect such as the prediction of dropout rates of the students. In the literal sense, the
utilization of a decision tree can be seen as IF-THEN set rules that can enable a simplified
representation of the data concerning the database. Without the appropriate utilization of the
decision tree or any classifier algorithm, the teachers would fail to recognize the reasons
behind the dropout rate
Success percentage rate of any institute can be improved by knowing the reasons for dropout
student. In present study, the primary available data on “Prediction of Dropout Students from
Government Schools using Educational Data Mining (EDM)” are based on various
parameters that were collected through a census it is a school census data which includes
school profile, geographical infrastructure, students personal information.it is a classification
technique, basis from a composite sample of 858013 students of Government Schools (KG to
Matric) by Asad Ullah (Software Engineer) in Secretariat from PPIOU department for the
student of year 2015-2016. Predicting the students dropout status whether they continue to
their study or not, needs lots of parameters such as personal, academic record, social,
environmental, etc. variables are necessitated for the effective it.
Classification techniques are completely based on machine learning. These techniques
classify each dataset into predefined classes. To classify data in database some mathematical
techniques similar to neural network, decision trees, statistics and linear programming
are used. . With the help of classification techniques we can predict about those students
who may have educational dropout in near future. We can also classify different student
according to their performance in their study, so that to make accurate model can be build up
from this data.