You are on page 1of 29

Methodology 1

Techniques of data analysis 2

Weka tool and data mining 3

Data definition 22

Data preparation 24

Data preprocessing and feature selection 24

Training, validation, and testing set 27

Classification 27

Naïve Bayes 28

BayesNet 28

J48 28

CHAPTER 3 METHODOLOGY

Methodology

Success percentage rate of any institute can be improved by knowing the reasons for dropout
student. In the present study, the primary available data on “Prediction of Dropout Students
from Government Schools using Educational Data Mining (EDM)” is based on various
features/variables that were extracted from school census data that includes school profile,
geographical, infrastructure, human resource and students’ personal information. It is a
classification task where we try to build an accurate classification model to classify student
into two classes such as dropout or continue from composite sample of 858,013 students of
Government Schools of Balochistan (KG to Matric) of the year 2015-2016. Here dropout is
explicit to represent the status of students’ exit from the education system. The study also
provide more insight into dropout behaviour and factors affecting students’ dropout from
policy, planning and decision making perspective.

Classification techniques are completely based on machine learning. These techniques


classify each dataset into predefined classes. To classify data in the database some
mathematical techniques similar to a neural network, decision trees, statistics, and linear
programming are used. With the help of classification techniques, we can predict about those
students who may have educational dropout in near future. We can also classify different
student according to their performance in their study so that to make an accurate model can
be build up from this data.

Nowadays, however, the deployment of Student Information Systems at the institutional level
provides an appropriate infrastructure for student’s data organization and storage as well as
data acquisition and deeper analyses. This data can help model the behavior of dropouts, and
predict future dropouts, therefore giving a chance to counselors to advise and guide students
into success. The demand for education in Pakistan has increased as more and more children
are now attending their schools. But there is lots of problem with the education system
causing many students to drop their study. The schools lack good infrastructure, quality
teachers and poor delivery of course content in India causing people to drop out. It is a
common excuse for the students that they don’t have easy access to educational institutions.
This problem is very true for that student how migrants for different places due to their
family problem. They just face the problem for the issue of transfer certificates, school
leaving certificate and other such formalities. “It is our educational system that is not
encouraging people, creating more and more formalities for the migrant’s students.” Due to
all the formalities needed to fulfill, it looks easier to shift jobs than to shift a schools/colleges
and once a child is out of school for too long, admissions become even more difficult.

The primary focus of this study will be to analyze the prediction technique and identify the
one that has qualified in providing the best results. Secondly, the research would focus on the
utilization of data mining techniques to predict the dropout rates by collecting the data from
the secondary school in Balochistan. Several factors are considered such as gender, district,
session, shifts, level, school building, and its structure, class, laboratory, toilets, teachers, and
many others to ensure which factors affect the student dropout rates the most.
Techniques of data analysis
Two kinds of methods have had been used to have the collection of the datasets
appropriately. The benefit of the collection of real-time data had enabled the researcher to
understand the situation and provide actual prediction that can help the schools to ensure
higher development of the self and the students. Therefore, the original data was used
between the years 2015-2016 concerning the secondary level education in Balochistan.
Measure of dropout rate has had been done differently in several literatures and schools
respectively. The situation of Balochistan had been similar and there are zero appropriate
technique to predict the dropout rates within the schools,which could have had enabled the
authorities to provide better education. Several litratures provided possible reasons for the
dropout rates including socio-economic reasons and academic failures. However, appropriate
prediction has not been provided anywhere concerning the schools of Balochistan. Therefore,
this particular research holds a due importance concerning the students dropout rate. It is,
therefore, considered that the utilization of best techniques for prediction can enable the
schools to get valuable information.

The primary technique that had been used to collect the data was mere observation and the
gathering of information from the actual governmental database concerning the critical
information about the students. This technique enabled the researcher to ensure validity of the
information collected for the benefit of the schools entirely. It was done so that the reason
behind dropout can be better understood and predicted so that the schools can take
appropriate actions beforehand. The data was carefully transformed in the form of datasets
that enabled the researcher to easily predict the information.

Weka tool and data mining

Weka is another powerful tool that would be performing 90% of the entire work done for this
study individually and collectively. Several algorithms, classification methods, decision tree
and other aspects to predict the factors that negative affect the retention rate of the students.
Decision tree and classification methods were specifically used to predict the cause and
actual rate of dropout that happens every year in the Balochistan secondary level schools.

Weka tool has been used to have effective classification mode that can especially be
implemented by the utilization of the weka tool. Several classifiers were used to ensure better
prediction and comparative study was done to find effective classifier that can be used for
prediction purposes in the future. The analysis of attribute was carried out in the feature
selection step by selecting various techniques such as information gain and correlation based
feature selection to ensure the best method to have prediction. Most useful attributes and
features were selected that could provide better prediction and effective tool to find dropout
rates and aspects affecting it. Lastly, the association rule mining method was also considered
for utilization to find a relationship between several factors associated with the large
database. It would provide appropriate discrimination between various aspects such as
instances and classes so that prediction can be done appropriately.
We identity the factor through Data Mining (DM) .furthermore, on student level the
prediction will be done to find out which student dropout in next academic session and to
study the causes of dropout which belongs to the process of knowledge discovery and data
mining. This information will be helpful for the management to reduce the dropout rate in
Districts. In order to achieve the above mentioned objectives the following steps were
followed (Fig. 1):

RAW DATA

DATA
PREPARATION/GATHERING

DATA PRE-PROCESSING

DATA
SELECTION/TRANSFORMATION

FEATURE EXTRACTION

DATA MINING AND EDM

CLASSIFICATION

RESULT EVALUATION

Figure1. Work Methodology


3.1 The Raw data set

The raw data has been taken from EMIS PPIOU department the data is based on student and
school profile of the year 2015-2016.

This data has been taken from two data base which consists of school level and student level
data in which school profile and student profile.

TABLE OF RAW DATA


NAME OF FEATURES NO.OF RECORDS

GENDER

SELECT ALL 858013

BLANKS 1

BOY 506216

GIRL 351797

CLASS

SELECT ALL 351797

BLANKS 1

10TH 12100

11TH 100

12TH 101

2ND 25586

3RD 20985

4TH 17754

5TH 18552

6TH 15302

7TH 7347

8TH 9644

9TH 8814

KACCHI 174415

PAKKI 41053

UN ADMITTED 44

DISTRICT

STATUS

SELECT ALL 44

BLANKS 1
CONTINUE 22

NEW ADMISSION 22

SESSION

SELECT ALL 22

BLANKS 1

2015 22

VCSHOOLNAME

SELECT ALL 22

BLANKS 1

GBPS KILLI KOKAR 1

GGPS UNIVERSITY COLONY 21

SHIFTS

SELECT ALL 858013

BLANKS 5757

DOUBLE SHIFTS SCHOOLS 16508

SINGLE SHIFTS SCHOOLS 835748

LEVEL

SELECT ALL 835748

BLANKS 1

HIGH 248368

HIGH SECONDARY 15487

MIDDLE 140802

PRIMARY 431091

TOTAL AREA

SELECT ALL 431091

BLANKS 2028

SCHOOL BUILDING
SELECT ALL 858013

BLANKS 4469

NO 43325

YES 810219

OWNERSHIPOFBUILDING

SELECT ALL 858013

BLANKS 47860

DONATED 261980

EDUCATION DEPARTMENT 536980

RENTED 12193

STRUCTUREOFBUILDING

SELECT ALL 12193

BLANKS 1

KACCHA 7610

MIX 1234

PAKKA 3349

CONDITIONOFBUILDING

SELECT ALL 3349

BLANKS 1

ADEQUATE 1435

NEED MAJOR REPAIR 160

NEEDS MINNOR REPAIR 1754

WATERFACILITYINLATRIN

SELECT ALL 850813

BLANKS 4469

NO 614851

YES 238693
BOUNDARYWALL

SELECT ALL 858013

BLANKS 4469

NO 283586

YES 569958

CONDITIONOFBW

SELECT ALL 858013

BLANKS 537824

COMPLETE 213574

UNCOMPLETE 106615

ELECTRICITYINAREA

SELECT ALL 858013

BLANKS 4140

NO 134239

YES 719634

ELECTRICITYINSCHOOL

SELECT ALL 719634

BLANKS 1

NO 347572

YES 372062

GASINAREA

SELECT ALL 372062

BLANKS 1

NO 208881

YES 163181

GASINSCHOOL

SELECT ALL 163181


BLANKS 1

NO 82449

YES 80732

ANYWATERSCHEME

SELECT ALL 858013

BLANKS 4140

NO 441004

YES 412869

SOURCEOFWATER

SELECT ALL 858013

BLANKS 4140

NAHAR 67277

NO WATER SOURCE 308793

STREAM 29190

TAP 240452

TUBE WEL 69238

WELL 138923

WATER TANK

SELECT ALL 138923

BLANKS 1

NO 103215

YES 35708

SCIENCE LAB

SELECT ALL 35708

BLANKS 1

NO 21232

YES 14476
CONDITIONOFSCLAB

SELECT ALL 858013

BLANKS 682982

ADEQUATE 27853

NEED REPAIR 82453

NEEDS EXPENTION 64725

PROVISIONOFSCLABITEMS

SELECT ALL 858013

BLANKS 1

COMPLAB

SELECT ALL 858013

BLANKS 1

NO 753312

YES 94407

CONDITIONOFCOMPLAB

SELECT ALL 858013

BLANKS 56

ADEQUATE 753312

NEED EXPANTION 9568

NEED REPAIR 45

COMPUTERSAREAVAILABLE

BLANKS 34

NO 9000

YES 56789

COMPAREINUSE

SELECT ALL 858013

BLANKS 23
NO 78990

YES 94407

IFCOMPLABISNOTFUNCTIONAL

SELECT ALL 858013

BLANKS 7

ELECTRICITY IS NOT AVAILABLE IN LAB 765432

IT TEACHER IS NOT AVAILABLE 235689

OTHER 886

LIBRARY IS AVAILABLE

SELECT ALL 858013

BLANKS 75

NO 7654

YES 08876

CONDITIONOFLIB

SELECT ALL 858013

BLANKS 64

ADEQUATE 8754

NEED EXPANTION 12445

NEED REPAIR 78899

PROVISIONOF BOOKSFOR LIB

SELECT ALL 858013

BLANKS 22

EVERY 2 YEARS 67788

MORE THEN 2 YEAR 98776

YEARLY 900

SEPRATEROOMAVAILABLEFORKACCHI

SELECT ALL 858013


BLANKS 19959

NO 597142

YES 240912

IFSEPCLASSISNOTAVAILABLE

SELECT ALL 345666

BLANKS 7654

BARAMDA 23344

OTHER 987

SAHAN 876

SITS WITH OTHER CLASS STUDENTS

SEPARATETEACHERFORKACCHI

SELECT ALL 858013

BLANKS 54956

NO 448396

YES 354661

PTSMCISAVAILABLE

SELECT ALL 858013

BLANKS 6836

NO 384224

YES 466953

YEARPTSMCFORMED

SELECT ALL 858013

BLANKS 52808

TOTAL KACCHA ROOMS

SELECT ALL 858013

BLANKS 1985

TOTAL PAKKA ROOMS


SELECT ALL 858013

BLANKS 1985

TOTAL KACCHA TOILET

SELECT ALL 718

BLANKS 1

TOTAL PAKKA TOILET

SELECT ALL 718

BLANKS 1

SPACE FOR NEW ROOMS

SELECT ALL 718

BLANKS 95

NO 6

YES 617

NEW ROOMS REQUIRED

SELECT ALL 718

BLANKS 95

EXAMINATION HALL

SELECT ALL 7320

BLANKS 6602

NO 359

YES 359

PLAY GROUND

SELECT ALL 7320

BLANKS 4016

NO 2145

YES 1159

SANCTIONED TEACHING STAFF


SELECT ALL 858013

BLANKS 7320

APPOINTED TEACHING STAFF

SELECT ALL 858013

BLANKS 7320

3.2 DATA PREPARATION/INTERGRATION

In the data preparation process we integrated entire database with the unique code that is
EMIS code the data was of two years from 2015-2016.further more,deleted the missing data
,at first in raw data there was no labeling of dropout therefore we labeled it .The data used in
this study was prepared from the Secretariat of Baluchistan through census bases. The data
has been constructed based on theoretical and empirical grounds about factor affecting
student’s performance and causes of dropout. The data included socio-demographic
indicators ( Age, Date of birth, Geographical location), Educational factors (Performance in
primary school, middle school and Secondary School , Location of Schooling, Type of
Examination Board, Medium of Study etc.), Parental Attitudes, Causes of dropout, and
Institutional factors, etc.

Before the initial visit to review the records, a coding system was created for each variable to
be documented (e.g., rural=0, urban=1). It was not important to document dropout status but
also all withdrawal reasons for the students.

3.2 DATA PRE-PROCESSING

The dataset is formulated for applying the data mining techniques. For further process, the traditional
pre-processing methods that consist of data cleaning, renovation of factors and the data partitioning
have to be applied. Other methods, such as the selecting of the attributes and re-balancing the data
also applied in order to resolve the problems related of the high dimension and the imbalanced data,
which were typically undertaken in the datasets. The data has been taken from Education Information
management System of the year 2015-2016, which comprises of 858013students records under the
following fields.
NAME OF Features
Students id Name Computer are available Science lab
Gender Class If computer lab is not Provision of science
functional lab
District Status Condition of library Condition of computer
lab
Session Vchschoolname Separate room available Computer are in use
Vchdistrict Vchtehsil Separate teacher for kacchi Library is available
class
Shifts Expr1 Year PTSMC formed Provision of books
Level Total area Total pakka rooms If separate class is not
available
Schoolbuildin Ownership of building Total pakka toilet PTSMC is available
Structure of building Condition of building Examination hall Total kaccha rooms
Q27 Water facility in area Sanctioned teacher Total kaccha toilet
Boundary wall Condition of boundary Space for new rooms
wall
Electricity in area Electricity in school Play ground
Gas in area Gas in school Appointed teacher
Table1 of School profile of the year 2015-2016
After data cleaning following fields are left for data processing and then we label the age of
the student having age having 5 to 6 label as A1, age 7 to 8 A2, age 9 to 10 A3, age 11 to 12
A4, age 13 to 14 A5, 15 to 16 A6 with the attribute of DUMMY AGE. Furthermore,
GENDER attribute was considered as, Boy=B Girl=G same as in rest of the field codes has
been defined in status field Dropout =Dout. Certain fields were merged into one attribute due
to the duplication, such as total pakka toilet, total kaccha toilet its was named as total toilets.
Attribute total kaccha room and total pakka rooms were merged into one attribute as total
rooms. Some other fields were named as, New admission=NEWADM In shifts field Single
Shift schools=SSS Double Shift Schools=DSS In ownership of building field Education
department=EDUD In condition of building field Need major repair=NMJR Need Minor
Repair =NMR In if separate class is not available field Sits with other class
students=SWOCS, tube well =TW, no water source =NWS, sanctioned teaching staff =Staff,
appointed teaching staff =AT staff, examination hall =EXAM HALL, new rooms required
=Required, space for new rooms =SForNRooms.Before the data transformation the attributes
were labeled in order to analyze, furthermore we removed the missing values. The data has
been used for the prediction of dropout students and the factors behind it .in the 2015 data is
to check whether the student is enrolled ,repeater, promoted and dropout for this process is to
analysis if the student is available in the same class as in the last year then it is considered
that the student is repeater ,if the student is not in the same class as last year class it means
the student is promoted whereas if the student of the year 2015 is not in any class in the
upcoming year it shows that the student is dropout due to various circumstance or reasons
The students which were in the year 2015 and not present in the year 2016 were considered
as dropout
3.4 DATA SELECTION AND TRANSFORMATION

Only the attributes that were required for the data mining process were selected. All the
predictor and response variables are shown in Table 1 for reference.

FEATURES DESCRIPTION POSSIBLE VALUES

Stud_Id Student ID

Gender Student’s Gender Male, Female

Class Student’s class grade from Kachi, Pakki, 1,2,3,4,5,6,7,8,9,10


Kachi to Paki.

Age Student’s Age { >5, 5-20, <20}

Dummy Age {A1-A6}

District Student’s Location Awaran, Barkhan, Chaghi, Dera


Bugti, Gwadar, Harnai, Jafar
Abad, Kachhi, Kech, Kharan,
Khuzdar,Killa Saifullaf, Kohlu,
Lasbela, Loralai, Musakhel,
Naseer Abad, Pishin, Quetta,
Sherani, Sohbat Pur, Washuk,
Zhob, Ziarat.

Status Student’s Status Continue, Dropout, New


Admission, and Repeater.

Session Student enrolled in which year 2015-2016

Level Student’s Class Level Primary, Middle, High

School Building Building is available for school Yes, No


or not.

Owner_Of_Build Owner Ship of Building Donated, EDUD,rented

Struct_Of_Build Structure Of Building Kachha, Pakka, Mix


Cond_of_Building Condition of Building Adequate, NMOR, NMR, Blanks.

W_F_ In_washroom Water Facility In washroom Yes, No

B_Wall Boundary Wall Yes, No

Elect_In _A Electricity In Area Yes, No

Elect_In_School Electricity In School Yes, No

G_In_Area Gas In Area Yes, No

G_In_School Gas In School Yes, No

A_W_Scheme Any Water Scheme Yes, No

Sour_of_Water Source Of Water Nahar, NWH, Stream, Tap, TW,


Well

W_Tank Water Tank Yes, No

Sci_Lab Science Lab Yes, No

Comp_Lab Computer Lab No

Lib_Is_Avail Library Is Available Yes, No

Sep_R_Avail_for Kacchi Separate Room Available for Yes, No


Kacchi

If Sep_Class_Is_Not_Avail If Separate Class Is Not Baramda, Sahan, SWOCS, other.


Available

SeprateTeacherForKacchi Seprate Teacher For Kacchi Yes, No

PTSMCIsAvailable Yes, No

T_K_Room Total Kacha Room 0,1, 2, 3,4

K_ Rooms Kaccha Rooms Few, Many, More

T_P_Room Total Pakka Room 0, 1, 2, 3, 4, 5, 7, 17

P_Rooms Pakka Rooms Few, Many, None

T_K_Toilet Total Kaccha Toilet 0,1

K_TOILET Kaccha toilet Few, Many, None

T_P_Toilet Total Pakka Toilet 0, 1, 2, 3, 4, 5, 6

P_TOILET Pakka toilet Few, Many, None

SForNRooms Yes, No

NR_Required 0, 1, 2, 3, 4, 5, 6, 10, 26
R_Rooms Required rooms Few, Many, None

E_Hall Exam Hall Yes, No

P_Ground Play Ground Yes, No

ST_Staff 0, 1, 2, 3, 4, 6, 8, 12, 13, 14, 18,


19, 35

AT_Staff 0, 1, 2, 3, 4, 6, 7, 8, 11, 13, 29

T_Status Teachers status 0, 1, 2, 4, 6, 7

Dmmy_T_status Dummy teachers status Sufficient, In Sufficient, Under


pressure

Status Status Continue, DOUT

Dmmy_status Dummy Status Repeater, DOUT

Table2: Student Related Variables

3.6 FEATURE EXTRACTION/SELECTION


In statistics and machine learning, feature selection is called attribute selection, variable
selection, or variable subset selection.it is the method of selecting relevant features or
predictors for building student model. An accurate classification depends on feature selection
for this purpose we studied and analyzed the logs of dropout and the factors which affect the
student’s dropout. We extracted some feature for our student model in table 4.1 that i.e.;
CLASS, DUMMY AGE, DISTRICT, COMPLAB AND IF SEPCLASS IS NOT
AVAILABLE.

3.5 THE GENERATION OF TRAINING


To answer the questions regarding the student dropout, For this task educational data mining
(EDM) techniques with forth various classification techniques were applied, for the purpose
of acquiring a predictive model that can answer with quality and precise ways that which
student dropping out and taking in consideration the data about the preliminary department of
the specified course.

3.6 TRAINING SET /DATA SET


The data of the year 2015 has been taken from EMIS it is an online data management system.

The source of the data is students profile and school profile and the total number of data is

consist of 856994 rows which comprises the following fields

ATTRIBUTES ATTRIBUTES ATTRIBUTES ATTRIBUTES


GENDER CLASS DISTRICT STATUS
UNION COUNCIL SHIFTS LEVELS TOTAL AREA
SCHOOL BUILDING OWNERSHIP STRUCTURE CONDITION
OF BUILDING OF BUILDING OF BUILDING
WATER FACILITY BOUNDARY WALL CONDITION ELECTRICITY
IN LATRINE OF BOUNDARY WALL IN SCHOOL
GAS IN AREA GAS IN SCHOOL ANY SOURCE OF WATER
WATER SCHEME
WATER TANK SCIENCE LAB CONDITION PROVISION
OF SCIENCE LAB OF SCIENCE LAB ITEMS
COMPUTER LAB CONDITION COMPUTERS ARE COMPUTER ARE IN USE
OF COMPUTER LAB AVAILABLE
IF COMPUTER LAB IS LIBRARY CONDITION PROVISION
NOT FUNCTIONAL IS AVAILABLE OF LIBRARY OF BOOKS FOR LIBRARY
SEPRATE SEPARATE YEAR PTSMC FORMED TOTAL KACHA ROOM
ROOM AVAILABLE TEACHER
FOR KACCHI FOR KACCHI
TOTAL PAKKA TOTAL TOTAL SPACE
ROOM KACHA TOILET PAKKA TOILET FOR NEW ROOMS
NEW ROOMS EXAMINATION PLAY GROUND SANCTIONED TEACHING
REQUIRED HALL STAFF
APPOINTTED
TEACHING STAFF
Table3: Data after transformation

The following above fields are further illustrated for data analysis experiments in WEKA
Like in gender fields Boy=B Girl=G same as in rest of the field codes has been defined
In status field Dropout =Dout New admission=NEWADM In shifts field Single Shift
schools=SSS Double Shift Schools=DSS In ownership of building field Education
department=EDUD In condition of building field Need major repair=NMJR Need Minor
Repair =NMR In if separate class is not available field Sits with other class students=SOCS
The data has been used for the prediction of dropout students and the factors behind it .in the
2015 data is to check whether the student is enrolled ,repeater, promoted and dropout for this
process is to analysis if the student is available in the same class as in the last year then it is
considered that the student is repeater ,if the student is not in the same class as last year class
it means the student is promoted whereas if the student of the year 2015 is not in any class in
the upcoming year it shows that the student is dropout due to various circumstance or reason
3.7 SPLITTING OF DATA

In splitting of data we construct training and test data by using our splitting protocol for

this type of problems

1. Randomly select 75% of each district students for training data and remaining

in the test data.

2. Stratification by ensuring that each data set enough examples of each class

gender and grade.

Fif 2. DAIGRAM OF DATA SPILITING IN PREDICTIVE MODEL

We make classification algorithm then made data training set to trained the model and test it

for evaluation.

3.8 DATA MINING

Data mining is the process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database systems (Christopher, 2010).
Data definition
The collection of information was done accordingly from the valuable sources such as
Educational Management Information System (EMIS) by the educational department of
Balochistan. The data included two years of important information about several aspects
regarding the students' education and others, even including the gender and building
information. Other information that the collected data included geographical location, age,
students’ id, toilet, education system, provision of books, library, type of examination, causes
of drop out, and many others. Furthermore, the coding system had been created before the
variables had been documented before the prediction system was initiated. The data
preparation was done using the already available information in the database of the actual
school about the students of secondary level in Balochistan.

FEATURES DESCRIPTION POSSIBLE VALUES

Stud_Id Student ID
Gender Student’s Gender Male, Female
Class Student’s class grade from Kachi, Pakki,
Kachi to Paki. 1,2,3,4,5,6,7,8,9,10
Age Student’s Age { >5, 5-20, <20}
Dummy Age {A1-A6}
District Student’s Location Awaran, Barkhan, Chaghi,
Dera Bugti, Gwadar, Harnai,
Jafar Abad, Kachhi, Kech,
Kharan, Khuzdar,Killa
Saifullaf, Kohlu, Lasbela,
Loralai, Musakhel, Naseer
Abad, Pishin, Quetta, Sherani,
Sohbat Pur, Washuk, Zhob,
Ziarat.
Status Student’s Status Continue, Dropout, New
Admission, and Repeater.
Session Student enrolled in which 2015-2016
year
Level Student’s Class Level Primary, Middle, High
School Building Building is available for Yes, No
school or not.
Owner_Of_Build Ownership of Building Donated, EDUD,rented
Struct_Of_Build Structure Of Building Kachha, Pakka, Mix
Cond_of_Building Condition of Building Adequate, NMOR, NMR,
Blanks.
W_F_ In_washroom Water Facility In washroom Yes, No
B_Wall Boundary Wall Yes, No
Elect_In _A Electricity In Area Yes, No
Elect_In_School Electricity In School Yes, No
G_In_Area Gas In Area Yes, No
G_In_School Gas In School Yes, No
A_W_Scheme Any Water Scheme Yes, No
Source_of_Water Source Of Water Nahar, NWH, Stream, Tap,
TW, Well
W_Tank Water Tank Yes, No
Sci_Lab Science Lab Yes, No
Comp_Lab Computer Lab Yes, No
Lib_Is_Avail Library Is Available Yes, No
Sep_R_Avail_for Kacchi Separate Room Available for Yes, No
Kacchi
If Sep_Class_Is_Not_Avail If Separate Class Is Not Baramda, Sahan, SWOCS,
Available other.
SeprateTeacherForKacchi Separate Teacher For Kacchi Yes, No
PTSMCIsAvailable Yes, No
T_K_Room Total Kacha Room 0,1, 2, 3,4
K_ Rooms Kaccha Rooms Few, Many, More
T_P_Room Total Pakka Room 0, 1, 2, 3, 4, 5, 7, 17
P_Rooms Pakka Rooms Few, Many, None
T_K_Toilet Total Kaccha Toilet 0,1
K_TOILET Kaccha toilet Few, Many, None
T_P_Toilet Total Pakka Toilet 0, 1, 2, 3, 4, 5, 6
P_TOILET Pakka toilet Few, Many, None
SForNRooms Yes, No
NR_Required 0, 1, 2, 3, 4, 5, 6, 10, 26
R_Rooms Required rooms Few, Many, None
E_Hall Exam Hall Yes, No
P_Ground Play Ground Yes, No
ST_Staff 0, 1, 2, 3, 4, 6, 8, 12, 13, 14, 18,
19, 35
AT_Staff 0, 1, 2, 3, 4, 6, 7, 8, 11, 13, 29
T_Status Teachers status 0, 1, 2, 4, 6, 7
Dmmy_T_status Dummy teachers status Sufficient, In Sufficient, Under
pressure
Status Status Continue, DOUT
Dmmy_status Dummy Status Repeater, DOUT

Data preparation
Since the data had not been in the revised format, as needed, the appropriate datasets were
formed by the utilization of the student ID and other information. all the missing values,
features, and attributes were deleted from the dataset in this step. The data was specifically
integrated to ensure that the preprocessing technique would provide appropriate outcomes.
the data field had been transformed and combined for this reason. the dataset originally
included the record of students such as name, education, and other aspects that can provide
the outcomes concerning the dropout rates. The new dropout attribute was thereafter created
as a dummy variable to provide answers to the research questions.
Data preprocessing and feature selection
After having a collection of all the required data, the preparation of dataset was done
appropriately by utilizing the data mining tools and techniques. Several attributes were
arranged to ensure that the classification and the organization of the data obtained in the form
of a dataset. The students belonging to the age group of 5 and 6 had been labeled as A1. The
ones having an age of 7 and 8 belonged to the A2 group. Other than that, the aspects such as
A3, A4, A5, and A6 had belonged to the age group such as 9-10, 11-12, 13-14, and 15-16
respectively. These factors came under the Dummy age attribute. Other than that, the gender
attribute had been given two factors: B and G respectively for boys and girls. Furthermore,
many fields were forced into in single attributes to ensure that the information would not be
found to have duplication. One of the attributes called as total toilets had the collection of
many attributes such as total kaccha toilet, total pakka toilets, and others to ensure that the
prediction can be done effectively without the fear of duplication. Many fields were renamed
such as NEWADM, SSS, EDUD, NMJR, NMR, SWOCS, NWS, TW, Staff, AT staff,
EXAM HALL,Required, SForNRooms, and DSS. These aspects belonged to the fields such
as New Admission, Single Shift Schools, Double Shift Schools, Education Department, Need
major repair, Need Minor Repair, Sits with other class students, tube well, no water source,
sanctioned teaching staff, appointed teaching staff, examination hall, new rooms required,
space for new rooms, and double shift schools respectively. This data had been thoroughly
studied and arranged to predict the student dropout rate and factors that affect it the most.
Before going an ahead with the examination and application of the appropriate models, the
data was processed through the series of preprocessing aspects measures by applying some
filters to the dataset so that the end result can be attained faster and clearer. The attributes that
had been needed for the data mining were chosen whereas the others one had been
eliminated. Along with that, missing values removal, removal of irrelevant values,
smoothening of noisy data, removal or identification of outlier values, and resolution of data
inconsistencies. Furthermore, removal of the irrelevant parameters and variables was also
done such as mother tongue, state of domicile, category, birth, and marital status. Since the
students belonged to Pakistani school having a similar birth and marital status, these
parameters were considered as unnecessary for the paper.
Furthermore, the feature selection technique was undertaken to ensure that the best feature
and variables have been identified that can have the greatest effect on the respective output
and prediction. The primary objective behind the utilization of this aspect to have a reduction
in the number of attributes so that the prediction can be done easily without affecting the
overall classification technique and its reliability. This particular procedure would remove the
irrelevant variables and features that can create difficulties in ensuring effective prediction
after the utilization of the appropriate models. In practical lives, the utilization of various
attributes, including irrelevant ones, is certainly possible that can create difficulties and
redundant information. Therefore, the range of algorithms was selected to have appropriate
feature selection so that such situations can be easily avoided by the researchers and the
schools involved in it.
Therefore the correlation-based feature selection methodology had been used in this research
to ensure that the irrelevant features can be eliminated which could have produced zero
predictive information concerning the dropout rates. It enabled the research to find feature
subsets that have a higher correlation with the prediction and dropout rates and causes of it.
The best first search feature was used that starts with a set of empty features and enables the
researchers to have a generation of new features that can be collected by every iteration. A
single feature is added to the one highest elevation subset. However, if the particular subset
or attribute shows zero improvements concerning the feature expansion, it is rather kept
aside. The search capacity goes back to the unexpanded one and the features addition
continues.

The equation used for this purpose is as follows:

Ms is the merit of the subset S feature containing overall k of the total features within the
database.
has been the correlation aspect of mean feature-class

is the average inter-correlation feature.

The search of finding the best features in the sea of relevant and irrelevant ones behind with
having an empty set of features after the researcher selects the Best-first method. The highest
merit subset is observed such that the reduction in training and testing dimensionality is
determined for the benefit of having an effective prediction. After finding zero improvements
in the subsets concerning the CFS procedure for over five more iterations, the particular
attribute is Set aside so that the next attributes can be given due chance to become appropriate
aspects for the purpose of prediction. After the procedure, the reduced dataset passes through
the machine learning so that the classifier can be built to have an appropriate prediction of the
dropout students and the factors affecting it drastically.
Once the decision has been made and the appropriate inter-correlation has been actively
calculated, the next step can be actively achieved. Such steps are critical because they provide
the general information of the predictiveness of the attribute with respect to another one in
the dataset. The measurement of the quality attribute can only be done when the irrelevant
attribute and feature has been minused from the final dataset. When the consideration of the
pure instance is to be taken, the researcher should provide due consideration to the feature
selection method like this. Each instance differs from one another concerning features and
attribute sets. The decision tree involves the prediction of the aspects where the attributes can
be predicted easily and feature selection can aid in effective measurement.
Training, validation, and testing set
The training and testing sets are equally critical for the development of an appropriate
prediction model in the dataset to allow the researcher to get accuracy. The training set is
considered as a set of examples drawn out of the general dataset information so that it can fit
in with the parameters or the classifiers. The role of validation datasets is to have the
hyperparameters perfectly tuned within the classifiers. In several classifiers such as artificial
neural networks, the recognized hyperparameters are the hidden units. The primary aim of
having validation set is to ensure that the dataset avoids the overfitting. Lastly, the training
dataset is used independently of the training set allowing the probability distribution. For
instance, if a particular model ends up fitting the testing and training sets appropriately, the
aspect can be termed as having lower overfitting issues. Therefore, the testing set is equally
critical within the predictive model to ensure the better and accurate results.

The training set was randomly divided into 75% of the original database that was drawn and
filtered in the preprocessing unit. The procedure took place so that the training set can
determine which model can have the best fitting with the predictive model for the
confirmation of the accuracy of results. The remaining aspects were further given to the
testing set to ensure that the testing set can help in providing accurate answers.
Classification
The data mining has the set of various algorithms that enable the classification of the dataset
to have an accurate prediction. Such aspects include decision tree, rules inductive learning,
artificial neural networks, evolutionary algorithms, and instance-based learning. This
particular paper would deal with the instances such as decision tree, Naive Base, and
BayesNet. These predictive measures create an opportunity for the researcher to make the
decision directly based on the results explained by the algorithms and classification
techniques. Among them, the utilization of a decision tree is usually done to ensure that the
situation is satisfied and solved from the root of the tree. Following aspects were used for
effective classification so that the prediction can be done accurately.
Naïve Bayes

The technique of classification has had been popular since the 1950s and its introduction was
done under a different name in 1960. The Naïve Bayes belong to the probabilistic classifiers
family and it is usually based on the application of Bayes theorem. This algorithm is usually
utilized by the researchers to have the effective construction of the classifier models for
prediction of a certain aspect or instance. However, zero algorithms are present to ensure the
training of the classifiers. However, the classifier is still able to provide effective prediction
concerning the dropout rates.

BayesNet
This aspect comes under the Bayesian network that enables a probabilistic graphical model
that enables the researcher to compute the prediction of the certain aspect. The classifier has
been observed to be closely situated with the J48 model in many instances concerning the
prediction models in several kinds of literature. Various authors have recognized it as the
mandatory classifier that can be utilized after the decision tree for accurate results.

J48
The decision tree algorithm is critical for the development of the predictive model. Several
scholars have recognized the importance of the utilization of the J48 decision tree in the
crucial aspect such as the prediction of dropout rates of the students. In the literal sense, the
utilization of a decision tree can be seen as IF-THEN set rules that can enable a simplified
representation of the data concerning the database. Without the appropriate utilization of the
decision tree or any classifier algorithm, the teachers would fail to recognize the reasons
behind the dropout rate
Success percentage rate of any institute can be improved by knowing the reasons for dropout
student. In present study, the primary available data on “Prediction of Dropout Students from
Government Schools using Educational Data Mining (EDM)” are based on various
parameters that were collected through a census it is a school census data which includes
school profile, geographical infrastructure, students personal information.it is a classification
technique, basis from a composite sample of 858013 students of Government Schools (KG to
Matric) by Asad Ullah (Software Engineer) in Secretariat from PPIOU department for the
student of year 2015-2016. Predicting the students dropout status whether they continue to
their study or not, needs lots of parameters such as personal, academic record, social,
environmental, etc. variables are necessitated for the effective it.
Classification techniques are completely based on machine learning. These techniques
classify each dataset into predefined classes. To classify data in database some mathematical
techniques similar to neural network, decision trees, statistics and linear programming
are used. . With the help of classification techniques we can predict about those students
who may have educational dropout in near future. We can also classify different student
according to their performance in their study, so that to make accurate model can be build up
from this data.

Nowadays, however, the deployment of Student Information Systems at the


institutional level provides an appropriate infrastructure for student's data organization
and storage as well as data acquisition and deeper analyses. This data can help model
the behavior of dropouts, and predict future dropouts, therefore giving a chance to
counselors to advise and guide students into success. The demand for education in
Pakistan has increased as more and more children are now attending their schools. But
there is lots of problem with the education system causing many students to drop their
study. We lack in good infrastructure, quality teachers and poor delivery of course content in
India causing people to drop out. It is a common excuse for the students that they don’t have
easy access to educational institutions. This problem is very true for that student how
migrants for different places due to their family problem. They just face the problem for
the issue of transfer certificates, school leaving certificate and other such formalities. “It is
our educational system that is not encouraging people, creating more and more formalities for
the migrant’s students”. Due to all the formalities needed to fulfill, it looks easier to shift jobs
than to shift a schools/colleges and once a child is out of school for too long, admissions
become even more difficult.

You might also like