Chapter 3

Methodology 1
Techniques of data analysis 2
Weka tool and data mining 3
Data definition 22
Data preparation 24
Data preprocessing and feature selection 24
Training, validation, and testing set 27
Classification 27
Naïve Bayes 28
BayesNet 28
J48 28
CHAPTER 3 METHODOLOGY
Methodology
Success percentage rate of any institute can be improved by knowing the reasons for dropout
student. In the present study, the primary available data on “Prediction of Dropout Students
from Government Schools using Educational Data Mining (EDM)” is based on various
features/variables that were extracted from school census data that includes school profile,
geographical, infrastructure, human resource and students’ personal information. It is a
classification task where we try to build an accurate classification model to classify student
into two classes such as dropout or continue from composite sample of 858,013 students of
Government Schools of Balochistan (KG to Matric) of the year 2015-2016. Here dropout is
explicit to represent the status of students’ exit from the education system. The study also
provide more insight into dropout behaviour and factors affecting students’ dropout from
policy, planning and decision making perspective.
Classification techniques are completely based on machine learning. These techniques

classify each dataset into predefined classes. To classify data in the database some
mathematical techniques similar to a neural network, decision trees, statistics, and linear
programming are used. With the help of classification techniques, we can predict about those
students who may have educational dropout in near future. We can also classify different
student according to their performance in their study so that to make an accurate model can
be build up from this data.
Nowadays, however, the deployment of Student Information Systems at the institutional level
provides an appropriate infrastructure for student’s data organization and storage as well as
data acquisition and deeper analyses. This data can help model the behavior of dropouts, and
predict future dropouts, therefore giving a chance to counselors to advise and guide students
into success. The demand for education in Pakistan has increased as more and more children
are now attending their schools. But there is lots of problem with the education system
causing many students to drop their study. The schools lack good infrastructure, quality
teachers and poor delivery of course content in India causing people to drop out. It is a
common excuse for the students that they don’t have easy access to educational institutions.
This problem is very true for that student how migrants for different places due to their
family problem. They just face the problem for the issue of transfer certificates, school
leaving certificate and other such formalities. “It is our educational system that is not
encouraging people, creating more and more formalities for the migrant’s students.” Due to
all the formalities needed to fulfill, it looks easier to shift jobs than to shift a schools/colleges
and once a child is out of school for too long, admissions become even more difficult.
The primary focus of this study will be to analyze the prediction technique and identify the
one that has qualified in providing the best results. Secondly, the research would focus on the
utilization of data mining techniques to predict the dropout rates by collecting the data from
the secondary school in Balochistan. Several factors are considered such as gender, district,
session, shifts, level, school building, and its structure, class, laboratory, toilets, teachers, and
many others to ensure which factors affect the student dropout rates the most.
Techniques of data analysis
Two kinds of methods have had been used to have the collection of the datasets
appropriately. The benefit of the collection of real-time data had enabled the researcher to
understand the situation and provide actual prediction that can help the schools to ensure
higher development of the self and the students. Therefore, the original data was used
between the years 2015-2016 concerning the secondary level education in Balochistan.
Measure of dropout rate has had been done differently in several literatures and schools
respectively. The situation of Balochistan had been similar and there are zero appropriate
technique to predict the dropout rates within the schools,which could have had enabled the
authorities to provide better education. Several litratures provided possible reasons for the
dropout rates including socio-economic reasons and academic failures. However, appropriate
prediction has not been provided anywhere concerning the schools of Balochistan. Therefore,
this particular research holds a due importance concerning the students dropout rate. It is,
therefore, considered that the utilization of best techniques for prediction can enable the
schools to get valuable information.
The primary technique that had been used to collect the data was mere observation and the
gathering of information from the actual governmental database concerning the critical
information about the students. This technique enabled the researcher to ensure validity of the
information collected for the benefit of the schools entirely. It was done so that the reason
behind dropout can be better understood and predicted so that the schools can take
appropriate actions beforehand. The data was carefully transformed in the form of datasets
that enabled the researcher to easily predict the information.
Weka tool and data mining
Weka is another powerful tool that would be performing 90% of the entire work done for this
study individually and collectively. Several algorithms, classification methods, decision tree
and other aspects to predict the factors that negative affect the retention rate of the students.
Decision tree and classification methods were specifically used to predict the cause and
actual rate of dropout that happens every year in the Balochistan secondary level schools.
Weka tool has been used to have effective classification mode that can especially be
implemented by the utilization of the weka tool. Several classifiers were used to ensure better
prediction and comparative study was done to find effective classifier that can be used for
prediction purposes in the future. The analysis of attribute was carried out in the feature
selection step by selecting various techniques such as information gain and correlation based
feature selection to ensure the best method to have prediction. Most useful attributes and
features were selected that could provide better prediction and effective tool to find dropout
rates and aspects affecting it. Lastly, the association rule mining method was also considered
for utilization to find a relationship between several factors associated with the large
database. It would provide appropriate discrimination between various aspects such as
instances and classes so that prediction can be done appropriately.
We identity the factor through Data Mining (DM) .furthermore, on student level the
prediction will be done to find out which student dropout in next academic session and to
study the causes of dropout which belongs to the process of knowledge discovery and data
mining. This information will be helpful for the management to reduce the dropout rate in
Districts. In order to achieve the above mentioned objectives the following steps were
followed (Fig. 1):
RAW DATA
DATA
PREPARATION/GATHERING
DATA PRE-PROCESSING
DATA
SELECTION/TRANSFORMATION
FEATURE EXTRACTION
DATA MINING AND EDM
CLASSIFICATION
RESULT EVALUATION
Figure1. Work Methodology

3.1 The Raw data set
The raw data has been taken from EMIS PPIOU department the data is based on student and
school profile of the year 2015-2016.
This data has been taken from two data base which consists of school level and student level
data in which school profile and student profile.
TABLE OF RAW DATA

NAME OF FEATURES NO.OF RECORDS
GENDER
SELECT ALL 858013
BLANKS 1
BOY 506216
GIRL 351797
CLASS
SELECT ALL 351797
BLANKS 1
10TH 12100
11TH 100
12TH 101
2ND 25586
3RD 20985
4TH 17754
5TH 18552
6TH 15302
7TH 7347
8TH 9644
9TH 8814
KACCHI 174415
PAKKI 41053
UN ADMITTED 44
DISTRICT
STATUS
SELECT ALL 44
BLANKS 1
CONTINUE 22
NEW ADMISSION 22
SESSION
SELECT ALL 22
BLANKS 1
2015 22
VCSHOOLNAME
SELECT ALL 22
BLANKS 1
GBPS KILLI KOKAR 1
GGPS UNIVERSITY COLONY 21
SHIFTS
SELECT ALL 858013
BLANKS 5757
DOUBLE SHIFTS SCHOOLS 16508
SINGLE SHIFTS SCHOOLS 835748
LEVEL
SELECT ALL 835748
BLANKS 1
HIGH 248368
HIGH SECONDARY 15487
MIDDLE 140802
PRIMARY 431091
TOTAL AREA
SELECT ALL 431091
BLANKS 2028
SCHOOL BUILDING
SELECT ALL 858013
BLANKS 4469
NO 43325
YES 810219
OWNERSHIPOFBUILDING
SELECT ALL 858013
BLANKS 47860
DONATED 261980
EDUCATION DEPARTMENT 536980
RENTED 12193
STRUCTUREOFBUILDING
SELECT ALL 12193
BLANKS 1
KACCHA 7610
MIX 1234
PAKKA 3349
CONDITIONOFBUILDING
SELECT ALL 3349
BLANKS 1
ADEQUATE 1435
NEED MAJOR REPAIR 160
NEEDS MINNOR REPAIR 1754
WATERFACILITYINLATRIN
SELECT ALL 850813
BLANKS 4469
NO 614851
YES 238693
BOUNDARYWALL
SELECT ALL 858013
BLANKS 4469
NO 283586
YES 569958
CONDITIONOFBW
SELECT ALL 858013
BLANKS 537824
COMPLETE 213574
UNCOMPLETE 106615
ELECTRICITYINAREA
SELECT ALL 858013
BLANKS 4140
NO 134239
YES 719634
ELECTRICITYINSCHOOL
SELECT ALL 719634
BLANKS 1
NO 347572
YES 372062
GASINAREA
SELECT ALL 372062
BLANKS 1
NO 208881
YES 163181
GASINSCHOOL
SELECT ALL 163181

BLANKS 1
NO 82449
YES 80732
ANYWATERSCHEME
SELECT ALL 858013
BLANKS 4140
NO 441004
YES 412869
SOURCEOFWATER
SELECT ALL 858013
BLANKS 4140
NAHAR 67277
NO WATER SOURCE 308793
STREAM 29190
TAP 240452
TUBE WEL 69238
WELL 138923
WATER TANK
SELECT ALL 138923
BLANKS 1
NO 103215
YES 35708
SCIENCE LAB
SELECT ALL 35708
BLANKS 1
NO 21232
YES 14476
CONDITIONOFSCLAB
SELECT ALL 858013
BLANKS 682982
ADEQUATE 27853
NEED REPAIR 82453
NEEDS EXPENTION 64725
PROVISIONOFSCLABITEMS
SELECT ALL 858013
BLANKS 1
COMPLAB
SELECT ALL 858013
BLANKS 1
NO 753312
YES 94407
CONDITIONOFCOMPLAB
SELECT ALL 858013
BLANKS 56
ADEQUATE 753312
NEED EXPANTION 9568
NEED REPAIR 45
COMPUTERSAREAVAILABLE
BLANKS 34
NO 9000
YES 56789
COMPAREINUSE
SELECT ALL 858013
BLANKS 23
NO 78990
YES 94407
IFCOMPLABISNOTFUNCTIONAL
SELECT ALL 858013
BLANKS 7
ELECTRICITY IS NOT AVAILABLE IN LAB 765432
IT TEACHER IS NOT AVAILABLE 235689
OTHER 886
LIBRARY IS AVAILABLE
SELECT ALL 858013
BLANKS 75
NO 7654
YES 08876
CONDITIONOFLIB
SELECT ALL 858013
BLANKS 64
ADEQUATE 8754
NEED EXPANTION 12445
NEED REPAIR 78899
PROVISIONOF BOOKSFOR LIB
SELECT ALL 858013
BLANKS 22
EVERY 2 YEARS 67788
MORE THEN 2 YEAR 98776
YEARLY 900
SEPRATEROOMAVAILABLEFORKACCHI
SELECT ALL 858013

BLANKS 19959
NO 597142
YES 240912
IFSEPCLASSISNOTAVAILABLE
SELECT ALL 345666
BLANKS 7654
BARAMDA 23344
OTHER 987
SAHAN 876
SITS WITH OTHER CLASS STUDENTS
SEPARATETEACHERFORKACCHI
SELECT ALL 858013
BLANKS 54956
NO 448396
YES 354661
PTSMCISAVAILABLE
SELECT ALL 858013
BLANKS 6836
NO 384224
YES 466953
YEARPTSMCFORMED
SELECT ALL 858013
BLANKS 52808
TOTAL KACCHA ROOMS
SELECT ALL 858013
BLANKS 1985
TOTAL PAKKA ROOMS

SELECT ALL 858013
BLANKS 1985
TOTAL KACCHA TOILET
SELECT ALL 718
BLANKS 1
TOTAL PAKKA TOILET
SELECT ALL 718
BLANKS 1
SPACE FOR NEW ROOMS
SELECT ALL 718
BLANKS 95
NO 6
YES 617
NEW ROOMS REQUIRED
SELECT ALL 718
BLANKS 95
EXAMINATION HALL
SELECT ALL 7320
BLANKS 6602
NO 359
YES 359
PLAY GROUND
SELECT ALL 7320
BLANKS 4016
NO 2145
YES 1159
SANCTIONED TEACHING STAFF

SELECT ALL 858013
BLANKS 7320
APPOINTED TEACHING STAFF
SELECT ALL 858013
BLANKS 7320
3.2 DATA PREPARATION/INTERGRATION
In the data preparation process we integrated entire database with the unique code that is
EMIS code the data was of two years from 2015-2016.further more,deleted the missing data
,at first in raw data there was no labeling of dropout therefore we labeled it .The data used in
this study was prepared from the Secretariat of Baluchistan through census bases. The data
has been constructed based on theoretical and empirical grounds about factor affecting
student’s performance and causes of dropout. The data included socio-demographic
indicators ( Age, Date of birth, Geographical location), Educational factors (Performance in
primary school, middle school and Secondary School , Location of Schooling, Type of
Examination Board, Medium of Study etc.), Parental Attitudes, Causes of dropout, and
Institutional factors, etc.
Before the initial visit to review the records, a coding system was created for each variable to
be documented (e.g., rural=0, urban=1). It was not important to document dropout status but
also all withdrawal reasons for the students.
3.2 DATA PRE-PROCESSING
The dataset is formulated for applying the data mining techniques. For further process, the traditional
pre-processing methods that consist of data cleaning, renovation of factors and the data partitioning
have to be applied. Other methods, such as the selecting of the attributes and re-balancing the data
also applied in order to resolve the problems related of the high dimension and the imbalanced data,
which were typically undertaken in the datasets. The data has been taken from Education Information
management System of the year 2015-2016, which comprises of 858013students records under the
following fields.
NAME OF Features
Students id Name Computer are available Science lab
Gender Class If computer lab is not Provision of science
functional lab
District Status Condition of library Condition of computer
lab
Session Vchschoolname Separate room available Computer are in use
Vchdistrict Vchtehsil Separate teacher for kacchi Library is available
class
Shifts Expr1 Year PTSMC formed Provision of books
Level Total area Total pakka rooms If separate class is not
available
Schoolbuildin Ownership of building Total pakka toilet PTSMC is available
Structure of building Condition of building Examination hall Total kaccha rooms
Q27 Water facility in area Sanctioned teacher Total kaccha toilet
Boundary wall Condition of boundary Space for new rooms
wall
Electricity in area Electricity in school Play ground
Gas in area Gas in school Appointed teacher
Table1 of School profile of the year 2015-2016
After data cleaning following fields are left for data processing and then we label the age of
the student having age having 5 to 6 label as A1, age 7 to 8 A2, age 9 to 10 A3, age 11 to 12
A4, age 13 to 14 A5, 15 to 16 A6 with the attribute of DUMMY AGE. Furthermore,
GENDER attribute was considered as, Boy=B Girl=G same as in rest of the field codes has
been defined in status field Dropout =Dout. Certain fields were merged into one attribute due
to the duplication, such as total pakka toilet, total kaccha toilet its was named as total toilets.
Attribute total kaccha room and total pakka rooms were merged into one attribute as total
rooms. Some other fields were named as, New admission=NEWADM In shifts field Single
Shift schools=SSS Double Shift Schools=DSS In ownership of building field Education
department=EDUD In condition of building field Need major repair=NMJR Need Minor
Repair =NMR In if separate class is not available field Sits with other class
students=SWOCS, tube well =TW, no water source =NWS, sanctioned teaching staff =Staff,
appointed teaching staff =AT staff, examination hall =EXAM HALL, new rooms required
=Required, space for new rooms =SForNRooms.Before the data transformation the attributes
were labeled in order to analyze, furthermore we removed the missing values. The data has
been used for the prediction of dropout students and the factors behind it .in the 2015 data is
to check whether the student is enrolled ,repeater, promoted and dropout for this process is to
analysis if the student is available in the same class as in the last year then it is considered
that the student is repeater ,if the student is not in the same class as last year class it means
the student is promoted whereas if the student of the year 2015 is not in any class in the
upcoming year it shows that the student is dropout due to various circumstance or reasons
The students which were in the year 2015 and not present in the year 2016 were considered
as dropout
3.4 DATA SELECTION AND TRANSFORMATION
Only the attributes that were required for the data mining process were selected. All the
predictor and response variables are shown in Table 1 for reference.
FEATURES DESCRIPTION POSSIBLE VALUES
Stud_Id Student ID
Gender Student’s Gender Male, Female
Class Student’s class grade from Kachi, Pakki, 1,2,3,4,5,6,7,8,9,10

Kachi to Paki.
Age Student’s Age { >5, 5-20, <20}
Dummy Age {A1-A6}
District Student’s Location Awaran, Barkhan, Chaghi, Dera

Bugti, Gwadar, Harnai, Jafar
Abad, Kachhi, Kech, Kharan,
Khuzdar,Killa Saifullaf, Kohlu,
Lasbela, Loralai, Musakhel,
Naseer Abad, Pishin, Quetta,
Sherani, Sohbat Pur, Washuk,
Zhob, Ziarat.
Status Student’s Status Continue, Dropout, New

Admission, and Repeater.
Session Student enrolled in which year 2015-2016
Level Student’s Class Level Primary, Middle, High
School Building Building is available for school Yes, No

or not.
Owner_Of_Build Owner Ship of Building Donated, EDUD,rented
Struct_Of_Build Structure Of Building Kachha, Pakka, Mix

Cond_of_Building Condition of Building Adequate, NMOR, NMR, Blanks.
W_F_ In_washroom Water Facility In washroom Yes, No
B_Wall Boundary Wall Yes, No
Elect_In _A Electricity In Area Yes, No
Elect_In_School Electricity In School Yes, No
G_In_Area Gas In Area Yes, No
G_In_School Gas In School Yes, No
A_W_Scheme Any Water Scheme Yes, No
Sour_of_Water Source Of Water Nahar, NWH, Stream, Tap, TW,

Well
W_Tank Water Tank Yes, No
Sci_Lab Science Lab Yes, No
Comp_Lab Computer Lab No
Lib_Is_Avail Library Is Available Yes, No
Sep_R_Avail_for Kacchi Separate Room Available for Yes, No

Kacchi
If Sep_Class_Is_Not_Avail If Separate Class Is Not Baramda, Sahan, SWOCS, other.

Available
SeprateTeacherForKacchi Seprate Teacher For Kacchi Yes, No
PTSMCIsAvailable Yes, No
T_K_Room Total Kacha Room 0,1, 2, 3,4
K_ Rooms Kaccha Rooms Few, Many, More
T_P_Room Total Pakka Room 0, 1, 2, 3, 4, 5, 7, 17
P_Rooms Pakka Rooms Few, Many, None
T_K_Toilet Total Kaccha Toilet 0,1
K_TOILET Kaccha toilet Few, Many, None
T_P_Toilet Total Pakka Toilet 0, 1, 2, 3, 4, 5, 6
P_TOILET Pakka toilet Few, Many, None
SForNRooms Yes, No
NR_Required 0, 1, 2, 3, 4, 5, 6, 10, 26
R_Rooms Required rooms Few, Many, None
E_Hall Exam Hall Yes, No
P_Ground Play Ground Yes, No
ST_Staff 0, 1, 2, 3, 4, 6, 8, 12, 13, 14, 18,

19, 35
AT_Staff 0, 1, 2, 3, 4, 6, 7, 8, 11, 13, 29
T_Status Teachers status 0, 1, 2, 4, 6, 7
Dmmy_T_status Dummy teachers status Sufficient, In Sufficient, Under

pressure
Status Status Continue, DOUT
Dmmy_status Dummy Status Repeater, DOUT
Table2: Student Related Variables
3.6 FEATURE EXTRACTION/SELECTION

In statistics and machine learning, feature selection is called attribute selection, variable
selection, or variable subset selection.it is the method of selecting relevant features or
predictors for building student model. An accurate classification depends on feature selection
for this purpose we studied and analyzed the logs of dropout and the factors which affect the
student’s dropout. We extracted some feature for our student model in table 4.1 that i.e.;
CLASS, DUMMY AGE, DISTRICT, COMPLAB AND IF SEPCLASS IS NOT
AVAILABLE.
3.5 THE GENERATION OF TRAINING

To answer the questions regarding the student dropout, For this task educational data mining
(EDM) techniques with forth various classification techniques were applied, for the purpose
of acquiring a predictive model that can answer with quality and precise ways that which
student dropping out and taking in consideration the data about the preliminary department of
the specified course.
3.6 TRAINING SET /DATA SET

The data of the year 2015 has been taken from EMIS it is an online data management system.
The source of the data is students profile and school profile and the total number of data is
consist of 856994 rows which comprises the following fields
ATTRIBUTES ATTRIBUTES ATTRIBUTES ATTRIBUTES

GENDER CLASS DISTRICT STATUS
UNION COUNCIL SHIFTS LEVELS TOTAL AREA
SCHOOL BUILDING OWNERSHIP STRUCTURE CONDITION
OF BUILDING OF BUILDING OF BUILDING
WATER FACILITY BOUNDARY WALL CONDITION ELECTRICITY
IN LATRINE OF BOUNDARY WALL IN SCHOOL
GAS IN AREA GAS IN SCHOOL ANY SOURCE OF WATER
WATER SCHEME
WATER TANK SCIENCE LAB CONDITION PROVISION
OF SCIENCE LAB OF SCIENCE LAB ITEMS
COMPUTER LAB CONDITION COMPUTERS ARE COMPUTER ARE IN USE
OF COMPUTER LAB AVAILABLE
IF COMPUTER LAB IS LIBRARY CONDITION PROVISION
NOT FUNCTIONAL IS AVAILABLE OF LIBRARY OF BOOKS FOR LIBRARY
SEPRATE SEPARATE YEAR PTSMC FORMED TOTAL KACHA ROOM
ROOM AVAILABLE TEACHER
FOR KACCHI FOR KACCHI
TOTAL PAKKA TOTAL TOTAL SPACE
ROOM KACHA TOILET PAKKA TOILET FOR NEW ROOMS
NEW ROOMS EXAMINATION PLAY GROUND SANCTIONED TEACHING
REQUIRED HALL STAFF
APPOINTTED
TEACHING STAFF
Table3: Data after transformation
The following above fields are further illustrated for data analysis experiments in WEKA
Like in gender fields Boy=B Girl=G same as in rest of the field codes has been defined
In status field Dropout =Dout New admission=NEWADM In shifts field Single Shift
schools=SSS Double Shift Schools=DSS In ownership of building field Education
department=EDUD In condition of building field Need major repair=NMJR Need Minor
Repair =NMR In if separate class is not available field Sits with other class students=SOCS
The data has been used for the prediction of dropout students and the factors behind it .in the
2015 data is to check whether the student is enrolled ,repeater, promoted and dropout for this
process is to analysis if the student is available in the same class as in the last year then it is
considered that the student is repeater ,if the student is not in the same class as last year class
it means the student is promoted whereas if the student of the year 2015 is not in any class in
the upcoming year it shows that the student is dropout due to various circumstance or reason
3.7 SPLITTING OF DATA
In splitting of data we construct training and test data by using our splitting protocol for
this type of problems
1. Randomly select 75% of each district students for training data and remaining
in the test data.
2. Stratification by ensuring that each data set enough examples of each class
gender and grade.
Fif 2. DAIGRAM OF DATA SPILITING IN PREDICTIVE MODEL
We make classification algorithm then made data training set to trained the model and test it
for evaluation.
3.8 DATA MINING
Data mining is the process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database systems (Christopher, 2010).
Data definition
The collection of information was done accordingly from the valuable sources such as
Educational Management Information System (EMIS) by the educational department of
Balochistan. The data included two years of important information about several aspects
regarding the students' education and others, even including the gender and building
information. Other information that the collected data included geographical location, age,
students’ id, toilet, education system, provision of books, library, type of examination, causes
of drop out, and many others. Furthermore, the coding system had been created before the
variables had been documented before the prediction system was initiated. The data
preparation was done using the already available information in the database of the actual
school about the students of secondary level in Balochistan.
FEATURES DESCRIPTION POSSIBLE VALUES
Stud_Id Student ID
Gender Student’s Gender Male, Female
Class Student’s class grade from Kachi, Pakki,
Kachi to Paki. 1,2,3,4,5,6,7,8,9,10
Age Student’s Age { >5, 5-20, <20}
Dummy Age {A1-A6}
District Student’s Location Awaran, Barkhan, Chaghi,
Dera Bugti, Gwadar, Harnai,
Jafar Abad, Kachhi, Kech,
Kharan, Khuzdar,Killa
Saifullaf, Kohlu, Lasbela,
Loralai, Musakhel, Naseer
Abad, Pishin, Quetta, Sherani,
Sohbat Pur, Washuk, Zhob,
Ziarat.
Status Student’s Status Continue, Dropout, New
Admission, and Repeater.
Session Student enrolled in which 2015-2016
year
Level Student’s Class Level Primary, Middle, High
School Building Building is available for Yes, No
school or not.
Owner_Of_Build Ownership of Building Donated, EDUD,rented
Struct_Of_Build Structure Of Building Kachha, Pakka, Mix
Cond_of_Building Condition of Building Adequate, NMOR, NMR,
Blanks.
W_F_ In_washroom Water Facility In washroom Yes, No
B_Wall Boundary Wall Yes, No
Elect_In _A Electricity In Area Yes, No
Elect_In_School Electricity In School Yes, No
G_In_Area Gas In Area Yes, No
G_In_School Gas In School Yes, No
A_W_Scheme Any Water Scheme Yes, No
Source_of_Water Source Of Water Nahar, NWH, Stream, Tap,
TW, Well
W_Tank Water Tank Yes, No
Sci_Lab Science Lab Yes, No
Comp_Lab Computer Lab Yes, No
Lib_Is_Avail Library Is Available Yes, No
Sep_R_Avail_for Kacchi Separate Room Available for Yes, No
Kacchi
If Sep_Class_Is_Not_Avail If Separate Class Is Not Baramda, Sahan, SWOCS,
Available other.
SeprateTeacherForKacchi Separate Teacher For Kacchi Yes, No
PTSMCIsAvailable Yes, No
T_K_Room Total Kacha Room 0,1, 2, 3,4
K_ Rooms Kaccha Rooms Few, Many, More
T_P_Room Total Pakka Room 0, 1, 2, 3, 4, 5, 7, 17
P_Rooms Pakka Rooms Few, Many, None
T_K_Toilet Total Kaccha Toilet 0,1
K_TOILET Kaccha toilet Few, Many, None
T_P_Toilet Total Pakka Toilet 0, 1, 2, 3, 4, 5, 6
P_TOILET Pakka toilet Few, Many, None
SForNRooms Yes, No
NR_Required 0, 1, 2, 3, 4, 5, 6, 10, 26
R_Rooms Required rooms Few, Many, None
E_Hall Exam Hall Yes, No
P_Ground Play Ground Yes, No
ST_Staff 0, 1, 2, 3, 4, 6, 8, 12, 13, 14, 18,
19, 35
AT_Staff 0, 1, 2, 3, 4, 6, 7, 8, 11, 13, 29
T_Status Teachers status 0, 1, 2, 4, 6, 7
Dmmy_T_status Dummy teachers status Sufficient, In Sufficient, Under
pressure
Status Status Continue, DOUT
Dmmy_status Dummy Status Repeater, DOUT
Data preparation
Since the data had not been in the revised format, as needed, the appropriate datasets were
formed by the utilization of the student ID and other information. all the missing values,
features, and attributes were deleted from the dataset in this step. The data was specifically
integrated to ensure that the preprocessing technique would provide appropriate outcomes.
the data field had been transformed and combined for this reason. the dataset originally
included the record of students such as name, education, and other aspects that can provide
the outcomes concerning the dropout rates. The new dropout attribute was thereafter created
as a dummy variable to provide answers to the research questions.
Data preprocessing and feature selection
After having a collection of all the required data, the preparation of dataset was done
appropriately by utilizing the data mining tools and techniques. Several attributes were
arranged to ensure that the classification and the organization of the data obtained in the form
of a dataset. The students belonging to the age group of 5 and 6 had been labeled as A1. The
ones having an age of 7 and 8 belonged to the A2 group. Other than that, the aspects such as
A3, A4, A5, and A6 had belonged to the age group such as 9-10, 11-12, 13-14, and 15-16
respectively. These factors came under the Dummy age attribute. Other than that, the gender
attribute had been given two factors: B and G respectively for boys and girls. Furthermore,
many fields were forced into in single attributes to ensure that the information would not be
found to have duplication. One of the attributes called as total toilets had the collection of
many attributes such as total kaccha toilet, total pakka toilets, and others to ensure that the
prediction can be done effectively without the fear of duplication. Many fields were renamed
such as NEWADM, SSS, EDUD, NMJR, NMR, SWOCS, NWS, TW, Staff, AT staff,
EXAM HALL,Required, SForNRooms, and DSS. These aspects belonged to the fields such
as New Admission, Single Shift Schools, Double Shift Schools, Education Department, Need
major repair, Need Minor Repair, Sits with other class students, tube well, no water source,
sanctioned teaching staff, appointed teaching staff, examination hall, new rooms required,
space for new rooms, and double shift schools respectively. This data had been thoroughly
studied and arranged to predict the student dropout rate and factors that affect it the most.
Before going an ahead with the examination and application of the appropriate models, the
data was processed through the series of preprocessing aspects measures by applying some
filters to the dataset so that the end result can be attained faster and clearer. The attributes that
had been needed for the data mining were chosen whereas the others one had been
eliminated. Along with that, missing values removal, removal of irrelevant values,
smoothening of noisy data, removal or identification of outlier values, and resolution of data
inconsistencies. Furthermore, removal of the irrelevant parameters and variables was also
done such as mother tongue, state of domicile, category, birth, and marital status. Since the
students belonged to Pakistani school having a similar birth and marital status, these
parameters were considered as unnecessary for the paper.
Furthermore, the feature selection technique was undertaken to ensure that the best feature
and variables have been identified that can have the greatest effect on the respective output
and prediction. The primary objective behind the utilization of this aspect to have a reduction
in the number of attributes so that the prediction can be done easily without affecting the
overall classification technique and its reliability. This particular procedure would remove the
irrelevant variables and features that can create difficulties in ensuring effective prediction
after the utilization of the appropriate models. In practical lives, the utilization of various
attributes, including irrelevant ones, is certainly possible that can create difficulties and
redundant information. Therefore, the range of algorithms was selected to have appropriate
feature selection so that such situations can be easily avoided by the researchers and the
schools involved in it.
Therefore the correlation-based feature selection methodology had been used in this research
to ensure that the irrelevant features can be eliminated which could have produced zero
predictive information concerning the dropout rates. It enabled the research to find feature
subsets that have a higher correlation with the prediction and dropout rates and causes of it.
The best first search feature was used that starts with a set of empty features and enables the
researchers to have a generation of new features that can be collected by every iteration. A
single feature is added to the one highest elevation subset. However, if the particular subset
or attribute shows zero improvements concerning the feature expansion, it is rather kept
aside. The search capacity goes back to the unexpanded one and the features addition
continues.
The equation used for this purpose is as follows:
Ms is the merit of the subset S feature containing overall k of the total features within the
database.
has been the correlation aspect of mean feature-class
is the average inter-correlation feature.
The search of finding the best features in the sea of relevant and irrelevant ones behind with
having an empty set of features after the researcher selects the Best-first method. The highest
merit subset is observed such that the reduction in training and testing dimensionality is
determined for the benefit of having an effective prediction. After finding zero improvements
in the subsets concerning the CFS procedure for over five more iterations, the particular
attribute is Set aside so that the next attributes can be given due chance to become appropriate
aspects for the purpose of prediction. After the procedure, the reduced dataset passes through
the machine learning so that the classifier can be built to have an appropriate prediction of the
dropout students and the factors affecting it drastically.
Once the decision has been made and the appropriate inter-correlation has been actively
calculated, the next step can be actively achieved. Such steps are critical because they provide
the general information of the predictiveness of the attribute with respect to another one in
the dataset. The measurement of the quality attribute can only be done when the irrelevant
attribute and feature has been minused from the final dataset. When the consideration of the
pure instance is to be taken, the researcher should provide due consideration to the feature
selection method like this. Each instance differs from one another concerning features and
attribute sets. The decision tree involves the prediction of the aspects where the attributes can
be predicted easily and feature selection can aid in effective measurement.
Training, validation, and testing set
The training and testing sets are equally critical for the development of an appropriate
prediction model in the dataset to allow the researcher to get accuracy. The training set is
considered as a set of examples drawn out of the general dataset information so that it can fit
in with the parameters or the classifiers. The role of validation datasets is to have the
hyperparameters perfectly tuned within the classifiers. In several classifiers such as artificial
neural networks, the recognized hyperparameters are the hidden units. The primary aim of
having validation set is to ensure that the dataset avoids the overfitting. Lastly, the training
dataset is used independently of the training set allowing the probability distribution. For
instance, if a particular model ends up fitting the testing and training sets appropriately, the
aspect can be termed as having lower overfitting issues. Therefore, the testing set is equally
critical within the predictive model to ensure the better and accurate results.
The training set was randomly divided into 75% of the original database that was drawn and
filtered in the preprocessing unit. The procedure took place so that the training set can
determine which model can have the best fitting with the predictive model for the
confirmation of the accuracy of results. The remaining aspects were further given to the
testing set to ensure that the testing set can help in providing accurate answers.
Classification
The data mining has the set of various algorithms that enable the classification of the dataset
to have an accurate prediction. Such aspects include decision tree, rules inductive learning,
artificial neural networks, evolutionary algorithms, and instance-based learning. This
particular paper would deal with the instances such as decision tree, Naive Base, and
BayesNet. These predictive measures create an opportunity for the researcher to make the
decision directly based on the results explained by the algorithms and classification
techniques. Among them, the utilization of a decision tree is usually done to ensure that the
situation is satisfied and solved from the root of the tree. Following aspects were used for
effective classification so that the prediction can be done accurately.
Naïve Bayes
The technique of classification has had been popular since the 1950s and its introduction was
done under a different name in 1960. The Naïve Bayes belong to the probabilistic classifiers
family and it is usually based on the application of Bayes theorem. This algorithm is usually
utilized by the researchers to have the effective construction of the classifier models for
prediction of a certain aspect or instance. However, zero algorithms are present to ensure the
training of the classifiers. However, the classifier is still able to provide effective prediction
concerning the dropout rates.
BayesNet
This aspect comes under the Bayesian network that enables a probabilistic graphical model
that enables the researcher to compute the prediction of the certain aspect. The classifier has
been observed to be closely situated with the J48 model in many instances concerning the
prediction models in several kinds of literature. Various authors have recognized it as the
mandatory classifier that can be utilized after the decision tree for accurate results.
J48
The decision tree algorithm is critical for the development of the predictive model. Several
scholars have recognized the importance of the utilization of the J48 decision tree in the
crucial aspect such as the prediction of dropout rates of the students. In the literal sense, the
utilization of a decision tree can be seen as IF-THEN set rules that can enable a simplified
representation of the data concerning the database. Without the appropriate utilization of the
decision tree or any classifier algorithm, the teachers would fail to recognize the reasons
behind the dropout rate
Success percentage rate of any institute can be improved by knowing the reasons for dropout
student. In present study, the primary available data on “Prediction of Dropout Students from
Government Schools using Educational Data Mining (EDM)” are based on various
parameters that were collected through a census it is a school census data which includes
school profile, geographical infrastructure, students personal information.it is a classification
technique, basis from a composite sample of 858013 students of Government Schools (KG to
Matric) by Asad Ullah (Software Engineer) in Secretariat from PPIOU department for the
student of year 2015-2016. Predicting the students dropout status whether they continue to
their study or not, needs lots of parameters such as personal, academic record, social,
environmental, etc. variables are necessitated for the effective it.
Classification techniques are completely based on machine learning. These techniques
classify each dataset into predefined classes. To classify data in database some mathematical
techniques similar to neural network, decision trees, statistics and linear programming
are used. . With the help of classification techniques we can predict about those students
who may have educational dropout in near future. We can also classify different student
according to their performance in their study, so that to make accurate model can be build up
from this data.
Nowadays, however, the deployment of Student Information Systems at the

institutional level provides an appropriate infrastructure for student's data organization
and storage as well as data acquisition and deeper analyses. This data can help model
the behavior of dropouts, and predict future dropouts, therefore giving a chance to
counselors to advise and guide students into success. The demand for education in
Pakistan has increased as more and more children are now attending their schools. But
there is lots of problem with the education system causing many students to drop their
study. We lack in good infrastructure, quality teachers and poor delivery of course content in
India causing people to drop out. It is a common excuse for the students that they don’t have
easy access to educational institutions. This problem is very true for that student how
migrants for different places due to their family problem. They just face the problem for
the issue of transfer certificates, school leaving certificate and other such formalities. “It is
our educational system that is not encouraging people, creating more and more formalities for
the migrant’s students”. Due to all the formalities needed to fulfill, it looks easier to shift jobs
than to shift a schools/colleges and once a child is out of school for too long, admissions
become even more difficult.

Chapter 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3

Uploaded by

Copyright:

Available Formats

Methodology 1

Techniques of data analysis 2

Weka tool and data mining 3

Data preprocessing and feature selection 24

Training, validation, and testing set 27

Classification techniques are completely based on machine learning. These techniques

Weka tool and data mining

DATA MINING AND EDM

Figure1. Work Methodology

TABLE OF RAW DATA

SELECT ALL 858013

SELECT ALL 351797

GBPS KILLI KOKAR 1

GGPS UNIVERSITY COLONY 21

SELECT ALL 858013

DOUBLE SHIFTS SCHOOLS 16508

SINGLE SHIFTS SCHOOLS 835748

SELECT ALL 835748

HIGH SECONDARY 15487

SELECT ALL 431091

SELECT ALL 858013

EDUCATION DEPARTMENT 536980

SELECT ALL 12193

SELECT ALL 3349

NEED MAJOR REPAIR 160

NEEDS MINNOR REPAIR 1754

SELECT ALL 850813

SELECT ALL 858013

SELECT ALL 858013

SELECT ALL 858013

SELECT ALL 719634

SELECT ALL 372062

SELECT ALL 163181

SELECT ALL 858013

SELECT ALL 858013

NO WATER SOURCE 308793

TUBE WEL 69238

SELECT ALL 138923

SELECT ALL 35708

SELECT ALL 858013

NEED REPAIR 82453

NEEDS EXPENTION 64725

SELECT ALL 858013

SELECT ALL 858013

SELECT ALL 858013

NEED EXPANTION 9568

SELECT ALL 858013

SELECT ALL 858013

ELECTRICITY IS NOT AVAILABLE IN LAB 765432

IT TEACHER IS NOT AVAILABLE 235689

SELECT ALL 858013

SELECT ALL 858013

NEED EXPANTION 12445

NEED REPAIR 78899

PROVISIONOF BOOKSFOR LIB

SELECT ALL 858013

EVERY 2 YEARS 67788

MORE THEN 2 YEAR 98776

SELECT ALL 858013

SELECT ALL 345666

SITS WITH OTHER CLASS STUDENTS

SELECT ALL 858013

SELECT ALL 858013