Data Mining Report PDF

SQIT 3033
KNOWLEDGE ACQUISITION IN DECISION MAKING
A182, Semester 2, Sesi 2018/19
Group A
FINAL GROUP PROJECT
Prepared by: Matrix No.
Lim Kui Chin 240760

Lio Yi Kee 253068
Ng Hwee Wei 253191
Chin Pei Yi 256778
Lecturer: Dr. Jastini Binti Mohd. Jamil
Submission Date: 22 May 2019

Table of Contents
No. Topic Page

1.0 Introduction 1
1.1 Background of Problem 1
1.2 Problem Statement 2
1.3 Aims and Objectives 3
1.4 Significance of Work 3
2.0 Literature Review 4
2.1 Introduction 4
2.2 Application of large data mining technology in Colleges and 4
Universities
2.3 Real World Implementation of Data Modelling Techniques 4
2.3.1 An Analysis of Student Representation, Representative Features 4
and Classification Algorithms to Predict Degree Dropout
2.3.2 Analyze and Predict Student Dropout from Online Programs 5
2.3.3 Survival Analysis based Framework for Early Prediction of 5
Student Dropouts
2.4 Data Mining In Education 6
2.5 Summary 6
3.0 Methodology 8
3.1 Process Flow 8
3.2.1 Sample 9
3.2.2 Explore 19
3.2.2.1 Stat Explore 20
3.2.3 Modify 20
3.2.3.1 Data Partition 20
3.2.3.2 Data Cleaning 20
3.2.3.3 Data Reduction 21
3.2.4 Model 22
3.2.5 Assess 25
4.0 Results and Discussion 27
4.1 Data Mining Technique One: Regression 27
4.1.1 Logistic Regression without Imputation (Default Selection) 27
4.1.2 Logistic Regression with Imputation (Default Selection) 30
4.1.3 Logistic Regression with Imputation and Transformation 33
(Backward Selection)
4.1.5 Logistic Regression with Imputation and and Transformation 40
(Forward Selection)
(Stepwise Selection)
4.1.7 Logistic Regression Models Comparison 49
4.2 Data Mining Technique Two: Decision Tree 49
4.2.1 Scoring ranking overlay 50
4.2.2 Decision Tree: 2 Branches with Chi - Square as Target Criterion 52
4.2.3 Decision Tree: 2 Branches with Gini as Target Criterion 54
4.2.4 Decision Tree: 2 Branches with Entropy as Target Criterion 56
4.2.5 Decision Tree: 3 Branches with Chi - Square as Target Criterion 58
4.2.6 Decision Tree: 3 Branches with Gini as Target Criterion 60
4.2.7 Decision Tree: 3 Branches with Entropy as Target Criterion 62
4.2.8 Decision Tree Models Comparison 64
4.3 Data Mining Technique Three: Neural Network 65
4.3.1 Neural Network with 2 Hidden Units 66
4.3.2 Neural Network with 3 Hidden Units 70
4.3.3 Neural Network Models Comparison 74
5.0 Model Comparison 76
6.0 Conclusion 78
References 79
Appendixes 80
1.0 INTRODUCTION
1.1 Background of Problem
In recent years, university students’ dropout rate has been drastically increasing which
leads to a serious issue among the community. Next to impaired wellbeing, students’ dropout
is a rising problem. The problem of university students’ dropout has been recognized in many
countries around the world. Given that academic success is an asset in the job market, these
problems deserve our attention. Therefore, the aim of the present study was to contribute to the
understanding of the relationship between student dropout rate under various conditions.
Predicting student dropout in high school is an important issue in education because it
concerns too many students in individual schools and institutions over the entire world, and it
usually results in overall financial loss, lower graduation rates and an inferior school reputation
in the eyes of all involved. The definition of dropout differs among researchers, but in any
event, if an institution loses a student by whatever means, the institution has a lower retention
rate. The early identification of vulnerable students who are prone to drop their courses is
crucial for the success of any school retention strategy. And, in order to try to reduce the
aforementioned problem, it is necessary to detect students who are at risk as early as possible
and thus provide some care in order to prevent these students from quitting their studies and
intervene early to facilitate student retention.
High school dropout has been associated with negative outcomes, including increased
rates of unemployment, incarceration, and mortality. Dropout rates vary significantly
depending on individual and environmental factors. Researchers, practitioners, and
policymakers have emphasized the importance of improving high school graduation rates
because school dropout is associated with various negative outcomes. Students who dropout of
school experience decreased employment rates, lower income levels, increased criminal
involvement and incarceration (Bjerk, 2012), and higher mortality rates than students who
graduate. In fact, increasing high school graduation rates could decrease health disparities and
save more than 1.3 million lives over 6 years, eight times as many lives as those saved by
medical advancements over the same time period. In economic terms, high school dropouts
between the ages of 20 – 67 contri4bute $50 billion less in federal and state income taxes than
do individuals with a high school diploma (Rouse, 2005). The negative health, societal, and
economic outcomes make dropout a critical public health and economic concern.
Calculating dropout can be problematic due to differences in definitions and criteria.
Because measuring dropout longitudinally is often prohibitively expensive, dropout is typically
reported as an event or status rate. Event rate refers to the proportion of students who drop out
1
in a single year without completing high school. Status rate is the proportion of students,
typically between 16 and 24, who have not completed high school and who are not, at a given
point in time, enrolled in a high school program. Dropout event rates typically yield a smaller
statistic than status event rates.
In an attempt to identify students at risk for dropout, create targeted interventions, and
implement relevant public policy to improve graduation rates, researchers have sought to
identify factors associated with dropout. While many studies have focused on either student or
environmental variables, we examined both levels concurrently using comparative analysis to
better understand factors associated with dropout.
Comparative analysis of several classification methods (neural networks, decision trees,
logistic regression) was used in order to develop early models of students who are most likely
to drop out. The dataset of this study come from Universiti Teknologi Petronas which is located
in Perak, Malaysia.
1.2 Problem Statement

Currently, Universiti Teknologi Petronas (UTP) do not have a proper system to
determine the preconditions or traits among its students who are likely to dropout. As a result,
there is no specific way for professors to tell whether which students fulfills the underlying
trait to drop out in order to give attention and helps to clear the barrier before problem solidifies.
To identify the relationship between different environmental variables (age, number of
dependents, type of school, area of living, class grade, SPM grade, income category, state,
program, and sponsorship) and status of study, Universiti Teknologi Petronas (UTP) generated
a dataset among its students’ record.
To date, there is no specific model on any system that can help professors in UTP to
monitor its students. Therefore, the existence of this study is to determine the conditions and
relationships based on its dataset using three different data mining techniques.
2
1.3 Objectives
The aim of this study is to construct a model of student dropout pattern of Universiti
Teknologi Petronas (UTP). In order to ensure that the generated output aligns with the purpose
of the study, three of the objectives has to be fulfilled:
1. To identify the most significant factor that causes student to drop out.
2. To gain insights on the pattern among the factor between students who drop out and
students who endures.
3. To determine the best model to predict the behavior and precondition of a student before
dropping out of university.
1.4 Significance of Work

The result from the study has produced a set of rules which represents the pattern of
dropout among students in Universiti Teknologi Petronas (UTP). Based on the pattern
produced, a predictive modeling technique can be developed to identify the traits of student
who tend to dropout from university. The findings of the study contribute guidance to
knowledge based system developer to develop an application based on the rules found in this
study.
The study will also assist the university’s management and lecturers to identify the
underlying problem beforehand. As a result, they can provide attention and care towards the
student as well as encouraging them to move forward.
3
2.0 LITERATURE REVIEW
2.1 Introduction
Data mining is a process of discovering useful information, hidden pattern or rules in large
quantities of data. It is also known as a knowledge discovery, knowledge extraction,
information discovery, information harvesting and data analytics. The purpose of this technique
is to discover meaningful knowledge like commercially valuable, exploitable pattern and novel
from data. Once the patterns are found, it can be further used to help in decision making
(Bharati, n.d.).
2.2 Application of large data mining technology in Colleges and Universities

With the enlargement of the enrollment scale and the flexibility of the educational methods,
almost every year colleges and universities are confronted with the contradiction between the
sharp increase in the number of students and the increasingly tense teaching resources. At the
same time, some colleges and universities are constantly reforming and changing. All these
have brought unprecedented development and challenges to the management of colleges and
universities. Under such circumstances, how to achieve the greatest development at the least
cost has become a new issue to be solved.
The study conducted by Wang Deze (2018) applied data mining technology to teaching
management in colleges and universities, extracts useful information from the data collected
by educational administration management system, and provided correct and powerful data
support and guarantee for college teaching managers to make relevant decisions.
2.3 Real World Implementation of Predictive Modelling

2.3.1 An Analysis of Student Representation, Representative Features and Classification
Algorithms to Predict Degree Dropout
Identifying and monitoring students who are likely to dropout is a vital issue for
universities. Early detection allows institutions to intervene, addressing problems and retaining
students. Prior research into the early detection of at-risk students has opted for the use of
predictive models, but a comprehensive assessment of the suitability of different algorithms
and approaches is complicated by the large number of variable features that constitute a
student's educational experience. Predictive models vary in terms of their amplitude,
temporality and the learning algorithms employed. While amplitude refers to the ability of the
model to operate on multiple degrees, temporality is often considered due to the natural
temporal aspect of the data.
4
In the absence of a comparative framework of learning algorithms, the aim of the study
conducted by Rubén Manrique, Bernardo Pereira Nunes, Olga Marino, Marco Antonio
Casanova, and Terhi Nurmikko-Fuller (2019) has been to provide such an analysis, based on a
proposed classification of strategies for predicting dropouts in Higher Education Institutions.
Three different student representations are implemented (namely Global Feature-Based, Local
Feature-Based, and Time Series) in conjunction with the appropriate learning algorithms for
each of them. A description of each approach, as well as its implementation process, are
presented in this paper as technical contributions.
An experiment based on a dataset of student information from two degrees, namely
Business Administration and Architecture, acquired through an automated management system
from a university in Brazil is used.
The paper’s findings can be summarized as: (i) of the three proposed student
representations, the Local Feature-Based was the most suitable approach for predicting dropout.
In addition to providing high quality results, the Local Feature-Based representations are
simple to build, and the construction of the model is less expensive when compared to more
complex ones; (ii) as a conclusion of the results obtained via Local Feature-Based, dropout can
be said to be accurately predicted using grades of a few core courses, so there is no need for a
complex features extraction process; (iii) considering temporal aspects of the data does not
seem to contribute to the prediction performance although it increases computational costs as
the model complexity increases.
2.3.2 Analyze and Predict Student Dropout from Online Programs

Increasing student retention rates in higher education is an important goal because of
its pertinence to the institution's core mission and its financial well-being. However, the
consistent low graduation rates during the past half-century demonstrate the persistence of this
challenge. Online programs in higher educations are generally afflicted with even lower
retention rates than on-campus programs. With increasing availability of institutional data in
the past decade, we can apply data mining approaches to analyze those data to help with the
retention problems. In the context of retaining more students in online programs, we develop
an educational data mining framework to analyze the institutional data and predict the potential
students who might leave online programs before the new term begins (Kyehong Kang, Sujing
Wang, 2018).
The goal of the project is to provide the administrators, instructors, and staff members
an opportunity to take actions to intervene in the students' dropout process before it takes place
5
to improve the online program retention rate. With student enrollment data and academic
performance information, the researchers able to build prediction models using logistic
regression method. In addition, we apply other classifiers including k nearest neighbor (kNN),
decision tree, naïve Bayes, support vector machine, and random forest to predict the dropout
for comparison purposes.
2.3.3 Survival Analysis based Framework for Early Prediction of Student Dropouts
Retention of students at colleges and universities has been a concern among educators
for many decades. In the paper conducted by Sattar Ameri, Mahtab J. Fard, Ratna B. Chinnam,
Chandan K. Reddy (2016), the researchers developed a survival analysis framework for early
prediction of student dropout using Cox proportional hazards model (Cox). Other than that,
they also applied time-dependent Cox (TD-Cox), which captures time-varying factors and can
leverage those information to provide more accurate prediction of student dropout.
The model of the research utilizes different groups of variables such as demographic,
family background, financial, high school information, college enrollment and semester-wise
credits.
The proposed framework has the ability to address the challenge of predicting dropout
students as well as the semester that the dropout will occur. This study enabled the researchers
to perform proactive interventions in a prioritized manner where limited academic resources
are available. This is critical in the student retention problem because not only correctly
classifying whether a student is going to dropout is important but also when this is going to
happen is crucial for a focused intervention. The method are evaluated on real student data
collected at Wayne State University. Results show that the proposed Cox-based framework can
predict the student dropouts and semester of dropout with high accuracy and precision
compared to the other state-of-the-art methods.
6
2.4 Data Mining In Education
According to a study in Malaysia, it indicated that data mining techniques are widely
used in higher educational system in order to increase the effectiveness of the traditional
method to provide a guideline to improve the decision-making process. Data mining techniques
were used to analyze the existing work, to identify existing gaps and to plan future works
(Beikzadeh & Delavari, 2004).
In the educational sector, data mining is often defined as the process of converting raw
data to useful information in order to extract crucial insights. Data mining is an analytic
approach that capitalizes on the advances of technology and the extreme of richness of data in
higher education to improve research and decision making through uncovering hidden trends
and patterns that lead to predictive modelling using a combination of explicit knowledge base.
When data mining systems were compared with other education systems, it emphasizes
the role of expert to interpret the findings obtained from analyzing the data retrieved from the
course. The results show that data mining systems help improve exercise, scheduling the course,
and identifying potential dropouts at an early phase (Hamalainen, 2004).
2.5 Summary
In this research, we will focus on developing a classification model similar covered in
section 2.3. As a whole, three data mining technique will be used. The techniques used are
decision tree, logistic regression, and neural network. Decision tree model for our study are
separated into 2 parts, which consists of 2 and 3 branches. To further the insights, we set 2
different maximum branches of 2 and 3. The benchmark of the study is to acquire the best
predictive model with the lowest misclassification rate to identify the underlying factor of
dropout among students.
7
3.0 METHODOLOGY
3.1 Process Flow
Data mining analysis involves a series of processes. it's essential to follow the quality processes
in order that data mining analysis is conducted in an exceedingly consistently manner. There
are several approaches to carry out data mining such as CRISP-DM (Cross – Industry Standard
Process for Data Mining) and SEMMA. In this study, SEMMA which is Sample, Explore,
Modify, Model and Assess is applied to conduct the data mining analysis on the student dropout
from Universiti Teknologi Petronas by using the data mining software which is the SAS
Enterprise Miner Workstation 15.1. SEMMA is well – known data mining methodology
developed by the SAS Institute. A pictorial representation of the process flow of SEMMA is
summarized as shown in Figure 1 below.
Figure 1. Process flow of data mining analysis
8
3.2.1 Sample
A historical data on the list of a total of 7606 students (7606 observations) for the
student dropout from Universiti Teknologi Petronas is collected. Figure 2 and Figure 3 shows
an overview of the historical data. Based on Figure 4, the dataset consists of a total of 15
variables which are BIL, UMUR, JANTINA, PROGRAM, NEGERI,
KATEGORI_KAWASAN_TINGGAL, PENDAPATAN_KELUARGA,
KUMPULAN_PENDAPATAN_KELUARGA, BIL_TANGGUNGAN, TAJAAN,
KELAS_GRADE_PELAJAR, KELAS_GRADE_SPM, JENIS_SEKOLAH, STATUS, and
STATUS_NEW.
Figure 2. Overview of the historical data 1
Figure 3. Overview of the historical data 2
9
Figure 4. Sample statistics of data imported
These 15 variables can be categorized into three main model roles : input attribute,
output attribute (target), and ID. From the 15 variables, the variables that are categorized as the
input attributes are UMUR, JANTINA, PROGRAM, NEGERI,
KATEGORI_KAWASAN_TINGGAL, PENDAPATAN_KELUARGA,
KUMPULAN_PENDAPATAN_KELUARGA, BIL_TANGGUNGAN, TAJAAN,
KELAS_GRADE_PELAJAR, KELAS_GRADE_SPM, JENIS_SEKOLAH, and STATUS.
And the output attribute or the target is STATUS_NEW. This output variable (target variable)
namely Status New refers to whether the students have dropout from the University Teknologi
Petronas.
In this phase before data exploration, the main idea is to acquire related and specific
scope from the large dataset by using appropriate sampling technique to select the suitable size
of sample so that the process of knowledge discovery or data mining analysis can speed up.
However in this study, all the 7606 observations in the dataset are considered and included in
the data mining analysis since the dataset is not very large. The variables along with their model
role, measurement level, and description are shown in the table below.
10
Variable Model Measurement Level Description
Name Role
BIL Rejected Interval Number of students.
UMUR Input Interval Age of students that measured in years.
JANTINA Rejected Binary Gender of students that is either Male
or Female.
PROGRAM Input Nominal Program of students taking.
NEGERI Input Nominal State of students living.
KATEGORI_ Input Interval Category living area of students.
KAWASAN_
TINGGAL
PENDAPATA Rejected Interval Income of student’s family measure in
N_KELUARG RM.
A
KUMPULAN Input Ordinal Category of student’s family income.
_PENDAPAT
AN_KELUAR
GA
BIL_TANGG Input Interval Family burden.
UNGAN
TAJAAN Input Nominal Sponsorship of the students.
KELAS_GRA Input Ordinal Student’s grade.
DE_PELAJAR
KELAS_GRA Input Ordinal Student’s SPM grade.
DE_SPM
JENIS_SEKO Input Nominal Type of school.
LAH
STATUS Rejected Nominal Students’ Status.
STATUS_NE Target Binary Students’ status.
W
Table 2. The model role, measurement level, and description of the variables.
11
Figure 5. Bil_Tanggungan
Based on figure 5, the first bar which consists of 213 data is considered as missing
data. The highest Frequency of this data is 3197 between the range of 3.9 and 5.2. The lowest
frequency is 11 between the range 11.7 and 13.
Figure 6. Jantina
Figure 6 above refers to the gender variable which consists of male and female. From
the figure above, it is obvious that the number of female students is more than male students
who are 4242 female students and 3364 male students.
12
Figure 7. Jenis Sekolah
Figure 7 above shows that type of school SK (Sekolah Kebangsaan) has the highest
frequency which is 4551 students, while the lowest is SB (Sekolah Berasrama) and SBK
(Sekolah Bantuan Kebangsaan) which only have one student. It also has 2 missing value on the
variable Jenis Sekolah.
Figure 8. Kategori_Kawasan_Tinggal
Figure 8 shows that there are 3707 students that stay in Bandar and Pinggir Bandar.
3479 students stay at area Luar Bandar and few students who stay at Bandar and Luar Negara.
13
Figure 9. Kelas_Grade_Pelajar
The figure above shows that the dataset has 2887 students who get great grades. It is
also have 786 students who get fail on their class performance.
Figure 10. Kelas_Grade_SPM

Figure 10 shows the Kelas Grade SPM, there have 3055 students who achieve a higher
great. Meanwhile, there also a few 717 students who get minimal grade for SPM. The bar chart
above also consists of 94 missing values.
14
Figure 11. Kumpulan_Pendapatan_Keluarga
The figure above shows that 5591 student’s family income have below RM40,000.
Figure 12. Negeri

Based on the figure above, most of the students live in Perak area which is 5373 students.
15
Figure 13. Pendapatan_Keluarga
Figure 13 shows the family income, most of the family having the income between
RM0 to RM34355. It is also have 240 missing values.
Figure 14. Program

Based on the figure above, most of the student study Diploma Pengajian Islam which
have 1965 students. And the least students study program is Diploma Sains Komputer dan
Rangkaian which have only 75 students.
16
Figure 15. STATUS_NEW
Figure 15 shows the status that students still continue study or dropout from study.
From the chart above there have 2319 students dropout from university, and 5287 students still
continue study.
Figure 16. Status

Figure above shows 2965 students status is AKTIF, which means these students still
continue study. Then follow by 2322 students which is ALUMNI, 1225 students BERHENTI,
828 students GAGAL and 266 students who TANGGUH PENGAJIAN.
17
Figure 17. Tajaan
Figure above shows most of the students get their sponsorship from PTPTN which have
5901 students. It also have 1052 students did not get any sponsorship.
Figure 18. Umur

Based on the figure above, 7464 students are aged between 18 to 22. While others
students are aged between 22 to 26.
18
3.2.2 Explore
Figure 19. Statistics table on the variables

The table above shows the simple descriptive statistics of the variables computed by
the SAS Enterprise Miner after the dataset collected has been imported from Microsoft Excel
into the software via “File Import”. By referring to the statistics table, several significant
descriptive statistics in which minimum value, maximum value, mean, standard deviation,
percentage of missing values, number of levels, skewness, and kurtosis of the variables in the
input data are investigated.
According to the statistics table shown in the above, it is observed that there are five
variables that consist of missing values in the dataset records: the variable Bil_Tanggungan
(2.80%), Jenis_Sekolah (0.03%), Kelas_Grade_SPM (1.24%), Negeri (0.39%) and
Pendapatan_Keluarga (3.16%). From these five variables, the variable Pendapatan_Keluarga
has the highest percentage of missing values indicates it has the highest number of missing
values in the dataset records.
In term of skewness for the interval input variables, the skewness of variable Bil,
Bil_Tanggungan, Kategori_Kawasan_Tinggal, Pendapatan_Keluarga, and Umur is 0.001347,
0.256349, 0.006337, 85.82285 and 10.52424 respectively. The acceptable range of skewness
is between -3 and 3 for normal distributed data. It is observed that the skewness of the three
variables Bil, Bil_Tanggungan, and Kategori_Kawasan_Tinggal are within the range between
-3 and 3. Therefore, it can be concluded that indicates that the data records consisted in the
three variables Bil, Bil_Tanggungan, and Kategori_Kawasan_Tinggal are normally distributed.
Since another two variables Pendapatan_Keluarga, and Umur are out of the range -3 and 3, so
we use transform variable. Overall, there is no any unusual minimum or maximum values.
19
3.2.2.1 Stat Explore
In order to further explore the dataset given, Stat Explore node had been executed. The
variable worth plot shows orders the input variables by their worth in predicting the target
variable. Results obtained shows that Kelas_Grade_Pelajar rank the highest which followed by
Tajaan and Kelas_Grade_SPM. These three variables are the most important factor that affects
the output of whether a student will dropout from university or retain. According to the output
report, none of the variables have relatively large standard deviation.
3.2.3 Modify
3.2.3.1 Data Partition
The “Data Partition” node at the Sample tab in SAS Enterprise Miner is used for dataset
allocations. The dataset is allocated into two types: training data and testing data. The training
data is used for model construction whereas the testing data is used for model evaluation. To
construct the model for data mining analysis on the student dropout from Universiti Teknologi
Petronas, 70% of the dataset is allocated for data training. And the rest of 30% of the dataset is
allocated for data testing to evaluate the generated model.
Figure 20. Dataset allocations
3.2.3.2 Data Cleaning

The existence of missing values for the four variables Bil_Tanggungan, Jenis_Sekolah,
Kelas_Grade_SPM, and Negeri can be due to improper method of data collection been used.
To overcome this problem by replacing the missing records for each of the variables, the
“Impute” node at the Modify tab in SAS Enterprise Miner is applied for handling the
incomplete data using imputation method. Replacing the missing data can improve the
accuracy of the data mining analysis on the student dropout from Universiti Teknologi Petronas.
The figure 21 below shows the result of imputation been conducted. Since we rejected the
“pendapatan keluarga”, so it didn’t show out after imputation.
20
Figure 21. Imputation Summary
According to the Figure 21, the impute method for replacing the 148 missing values in
interval input variables Bil_Tanggungan is by using median method in which the missing
values will be replaced with the 50th percentile, which is either the middle value or the
arithmetic mean of the two middle values for a set of numbers arranged in ascending order.
And the impute method for replacing the 2 missing values in nominal input variable
Jenis_Sekolah, 21 missing values in nominal input variable Negeri and 68 ordinal input
variable Kelas_Grade_Value is by using count method and the missing value is replaced by the
most frequently occurring value for the variable. In this case, mean is the preferred statistic for
replacing the missing values for interval input variable Bil_Tangguangan since the variables’
values are at least approximately symmetric (input variables Bil_Tanggungan has no skewed
distribution).
3.2.3.3 Data Reduction

In data reduction, the purpose is to increase the efficiency of data mining analysis by
reducing the dimensionality of the data since the data mining analysis will become harder as
the dimensionality of data increases. In this study, one of the common methods is used which
is through attribute selection or elimination of irrelevant variables to reduce the complexity of
data dimension. According to Figure 22 below, in this case one input variable namely Bil and
Jantina is set as rejected to be part of the predictive model development since the student
dropout from university does not depend on the number of students and gender of students.
The input variables Pendapatan_Keluarga and Status is also rejected because it’s have already
transform to another category.
21
Figure 22. Overview on the variable roles and levels
3.2.4 Model
In this study, three types of predictive modelling methods for data mining analysis on
the student dropout from Universiti Teknologi Petronas are applied. The three types of
prediction methods are Regression, Decision Tree, and Neural Network. Firstly, for regression,
the suitable regression model used is the logistic regression since the target variable,
STATUS_NEW is a binary variable that will classify or predict whether a student will student
dropout from Universiti Teknologi Petronas.
Several logistic regression models are developed to investigate and determine which
one will be resulting in the best and desired outcome. In this study, the logistic regression
models are initially divided into two categories which are logistic regression with imputation
and without imputation. For the logistic regression with imputation, it is then divided into
another four categories which are logistic regression with all input’s selection, backward
selection, forward selection, and stepwise selection. The following Figure 23 illustrates the
choices of model selection available in SAS Enterprise Miner for regression modelling.
22
Figure 23. Model selection for regression modelling
Based on the Figure 23, the model selection method for “Backward” is referred to the
variables selection that start with all variables consisted in the model and then systematically
eliminate the variables that are not significantly associated with the target until no other
variables in the model meets the significance level of 0.05 (5%).
For “Forward” method of model selection, it is referred to the selection of variables that
start with none variables in the model and then systematically add the chosen variables into the
model that are significantly associated with the target until no other variables in the model
meets the significance level of 0.05 (5%).
For “Stepwise” model selection method, variable selection is started with none variable
in the model and then systematically adds variables that are significantly associated with the
target. After a variable is added to the model, it can be removed if it is deemed that the variable
is no longer significantly associated with the target.
The model selection method for “None” is referred to all inputs (input variables) will
be selected (default selection) and included in the final model to fit the regression model if it
is chosen. In overall, a total of five logistic regression models are built in this study. The overall
conceptual framework for the logistic regression models is clearly summarized in the Figure
24 below.
Figure 24. Conceptual framework for logistic regression models

Next, for decision tree, a total of 6 decision tree models are developed in this study.
The construction of the 6 decision tree models for data mining analysis on the student dropout
23
from Universiti Teknologi Petronas is based on two main splitting rules which are in terms of
maximum branch and nominal target criterion.
Maximum branch refers to the maximum number of branches or subset to be split in a
decision tree whereas nominal target criterion refers to the method of searching for and
evaluating candidate splitting rules in the presence of a nominal target. The nominal target
criterion can be categorized into three main types which are Entropy, Gini, and Classification
Error (ProbChisq).
In this study, the decision tree model is firstly divided into two categories which are
decision tree of 2 branches and 3 branches. And then for each number of maximum branches,
the decision tree models are divided into three types of nominal target criterion which are
decision tree with Entropy as target criterion, decision tree with Gini as target criterion, and
decision tree with Classification Error (ProbChisq) as target criterion. The overall conceptual
framework for the decision tree models is clearly summarized in the Figure 25 below.
Figure 25. A conceptual framework for decision tree models

For a neural network, a total of two different neural network models are constructed in
this study for data mining analysis on the students will dropout from Universiti Teknologi
Petronas. The first one is a neural network with 2 hidden units and the second one is a neural
network with 3 hidden units. For each type of the neural network model, the output will be
translated into one of the most popular neural network architectures which is the Multilayer
Perceptron (MLP) which is a feed-forward network that composed of an input layer, hidden
layers, and an output layer.
24
Figure 26. Conceptual framework for neural network models.
3.2.5 Assess
In this study, firstly misclassification rate is assessed for comparing how well each of
the models constructed under each type of predictive modelling (Logistic Regression, Decision
Tree, and Neural Network) since the target variable STATUS_NEW is a binary variable. After
that, second comparison is conducted by using the “Model Comparison” node at the Assess tab
in SAS Enterprise Miner to compare which type of predictive modelling method is the best one
for accurate prediction on the student dropout from Universiti Teknologi Petronas. Figure 27
below illustrates the overview diagram of all the model constructed.
25
Figure 27. Overview diagram of all model constructed.
26
4.0 RESULTS AND DISCUSSION
4.1 Data Mining Technique One: Regression
There are five types of regression analysis for data mining analysis on the students’
background from one of the universities in Perak for prediction of students dropout through the
regression node contained in the Model tab in SAS Enterprise Miner as shown in Figure 28
below. In this section, the results of the predictive modelling by regression analysis in terms of
the score rankings overlay for target variable, fit statistics and analysis of maximum likelihood
estimates are illustrated and explained.
Figure 28. Regression node that can be accessed from the Model tab
4.1.1 Logistic Regression without Imputation (Default Selection)

Logistic regression without imputation refers to the regression modelling that involves
with the dataset that consists of missing values without taking any action on it.
Figure 29. Scoring rankings overlay of logistic regression without imputation (default
selection) model at depth = 60%
27
Figure 30. Scoring rankings overlay of logistic regression with imputation (default selection)
model at depth = 100%
Figure 29. and above shows the outputs of scoring ranking overlay. According to the
scoring ranking overlay in Figure 29 , it is observed that at depth = 60% in which around 4564
students study in the university from the total 7606 observations being included, the logistic
regression model predicts that 50.78% of the 4564 students dropout from the university. While
according to the scoring ranking overlay in Figure 30, it is observed that at depth = 100% in
which all the 7606 observations are included, the model successfully predicts that 30.48% of
all the 7606 students study in the university. When comparing to the actual data in which 2319
out of 7606 that is equivalent to 30.49% of the students dropout from the university which is
close to the percentage of students dropout from the university being predicted by the model,
it is concluded that the model is accurate and therefore it is useful to perform prediction.
28
Figure 31. Fit statistics table for logistic regression without imputation (default selection)
Figure 32. Analysis of maximum likelihood estimates for logistic regression without
imputation (default selection) 1
29
Figure 33. Analysis of maximum likelihood estimates for logistic regression without
imputation (default selection) 2
Figure 31 shows the fit statistics table of the model. It is observed that the
misclassification rate is 11.48% for train error and 9.98% for test error. Figure 32, 33 shows
the output of analysis of maximum likelihood estimates. Based on Figure 32, 33, it is observed
that there are several variables which have the significant values that are less than the
significance level of 5% which indicates that these variables are significant for the prediction
of the students will dropout from the university. These variables are BIL_TANGGUNGAN
and UMUR. It is noticed that the coefficient value for variable BIL_TANGGUNGAN is greater
than the coefficient value for variable UMUR indicates that the age of a student is significantly
bring more impact on the prediction of the dropout of the student than the family burden.
Therefore, the equation for this logistic regression that is without imputation and
transformation the model selection method is set as default is:
𝑒 0.1398(𝐵𝐼𝐿𝑇𝐴𝑁𝐺𝐺𝑈𝑁𝐺𝐴𝑁 )+0.0687(𝑈𝑀𝑈𝑅)−2.9657
𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 =
1 + 𝑒 0.1398(𝐵𝐼𝐿𝑇𝐴𝑁𝐺𝐺𝑈𝑁𝐺𝐴𝑁 )+0.0687(𝑈𝑀𝑈𝑅)−2.9657
4.1.2 Logistic Regression with Imputation (Default Selection)

Logistic regression with imputation refers to the regression modelling that involves the
use of dataset that consists of zero missing values after taking the action on it which is the
imputation to replace the missing value via the “Impute” node in SAS Enterprise Miner. In this
case, the model selection method is set as default.
30
Figure 34 and Figure 35 shows the outputs of scoring ranking overlay. According to
the scoring ranking overlay in Figure 34 , it is observed that at depth = 60% in which around
4564 students study in the university from the total 7606 observations being included, the
logistic regression model predicts that 50.78% of the 4564 students dropout from the university.
While according to the scoring ranking overlay in Figure 35, it is observed that at depth = 100%
in which all the 7606 observations are included, the model successfully predicts that 30.48%
of all the 7606 students study in the university. When comparing to the actual data in which
31
2319 out of 7606 that is equivalent to 30.49% of the students dropout from the university which
is close to the percentage of students dropout from the university being predicted by the model,
Figure 36. Fit statistics table for logistic regression with imputation (default selection)
Figure 37. Analysis of maximum likelihood estimates for logistic regression with imputation
(default selection) 1
32
(default selection) 2
misclassification rate is 8.98% for train error and 7.18% for test error. Figure 37 and 38 shows
the output of analysis of maximum likelihood estimates. Based on Figure 37 and 38, it is
observed that there are several variables which have the significant values that are less than the
of the students dropout from the university. These variables are KELAS_GRADE_PELAJAR
and UMUR. It is noticed that the coefficient value for variable UMUR is greater than the
coefficient value for variable KELAS_GRADE_PELAJAR indicates that the age of students
is significantly bring more impact on the prediction of the students dropout from the university
compared the students’ performance. Therefore, the equation for this logistic regression that is
with imputation the model selection method is set as default is:
𝑒 0.000575(𝐾𝐸𝐿𝐴𝑆_𝐺𝑅𝐴𝐷𝐸_𝑃𝐸𝐿𝐴𝐽𝐴𝑅)+0.0039(𝑈𝑀𝑈𝑅)−3.1062
1 + 𝑒 0.000575(𝐾𝐸𝐿𝐴𝑆_𝐺𝑅𝐴𝐷𝐸_𝑃𝐸𝐿𝐴𝐽𝐴𝑅)+0.0039(𝑈𝑀𝑈𝑅)−3.1062
4.1.3 Logistic Regression with Imputation and Transformation (Default Selection)

Logistic regression with imputation and transformation refers to the regression
modelling that involves the use of dataset that consists of zero missing values after taking the
action on it which is the imputation to replace the missing value via the “Impute” node and the
minimum memory allocation via the “Transform Variable” in SAS Enterprise Miner. In this
case, the model selection method is set as default.
33
Figure 39. Scoring rankings overlay of logistic regression with imputation and
transformation (default selection) model at depth = 60%
transformation (default selection) model at depth = 100%
Figure 40 and above shows the outputs of scoring ranking overlay. According to the
scoring ranking overlay in Figure 39, it is observed that at depth = 60% in which around 4564
34
Figure 41. Fit statistics table for logistic regression with imputation and transformation
(default selection)
and transformation (default selection) 1
35
and transformation (default selection) 2
of the students dropout. These variables are KELAS_GRADE_PELAJAR and UMUR. It is
noticed that the coefficient value for variable UMUR is greater than the coefficient value for
variable KELAS_GRADE_PELAJAR and indicates that the age of students is significantly
bring more impact on the prediction of the students dropout from university compared the
students’ performance. Therefore, the equation for this logistic regression that is with
imputation and transformation and the model selection method is set as default is:
𝑒 0.000575(𝐾𝐸𝐿𝐴𝑆_𝐺𝑅𝐴𝐷𝐸_𝑃𝐸𝐿𝐴𝐽𝐴𝑅)+0.0039(𝑈𝑀𝑈𝑅)−3.1062
1 + 𝑒 0.000575(𝐾𝐸𝐿𝐴𝑆_𝐺𝑅𝐴𝐷𝐸_𝑃𝐸𝐿𝐴𝐽𝐴𝑅)+0.0039(𝑈𝑀𝑈𝑅)−3.1062
4.1.4 Logistic Regression with Imputation and Transformation (Backward Selection)

In this case, the model selection method is set as “Backward” at the Model Selection
property table in SAS Enterprise Miner in order to execute the new model built for logistic
regression with imputation and transformation.
36
transformation (backward selection) model at depth = 60%
transformation (backward selection) model at depth = 100%
Figure 45 and above shows the outputs of scoring ranking overlay. According to the
scoring ranking overlay in Figure 44 , it is observed that at depth = 60% in which around 4564
37
(backward selection)
Figure 47. Summary of backward elimination for logistic regression with imputation and
transformation (backward selection) model
38
and transformation (forward selection) model 1
and transformation (backward selection) model 2
of the students dropout from the university. These variables are IMP_BIL_TANGGUNGAN,
and KELAS_GRADE_PELAJAR. And the rest of the input variables that have significance
39
value greater than 0.05 as shown in Figure 48 and 49 are eliminated. It is noticed that the
coefficient value for variable KELAS_GRADE_PELAJAR is greater than the coefficient value
for variable IMP_BIL_TANGGUNGAN indicates that the students’ performance is
significantly bring more impact on the prediction of the student dropout from the university
compared the family and. Therefore, the equation for this logistic regression that is with
imputation and transformation and the model selection method is set as backward is:
0.0001(𝐼𝑀𝑃𝐵𝐼𝐿𝑇𝐴𝑁𝐺𝐺𝑈𝑁𝐺𝐴𝑁 )+0.000625(𝐾𝐸𝐿𝐴𝑆𝐺𝑅𝐴𝐷𝐸𝑃𝐸𝐿𝐴𝐽𝐴𝑅 )−2.8216
𝑒
0.0001(𝐼𝑀𝑃𝐵𝐼𝐿𝑇𝐴𝑁𝐺𝐺𝑈𝑁𝐺𝐴𝑁 )+0.000625(𝐾𝐸𝐿𝐴𝑆𝐺𝑅𝐴𝐷𝐸𝑃𝐸𝐿𝐴𝐽𝐴𝑅 )−2.8216
1+𝑒
4.1.5 Logistic Regression with Imputation and Transformation (Forward Selection)

In this case, the model selection method is set as “Forward” at the Model Selection
transformation (forward selection) model at depth = 60%
40
transformation (forward selection) model at depth = 100%
Figure 51 above shows the outputs of scoring ranking overlay. According to the scoring
ranking overlay in Figure 50, it is observed that at depth = 60% in which around 4564 students
study in the university from the total 7606 observations being included, the logistic regression
model predicts that 50.78% of the 4564 students dropout from the university. While according
to the scoring ranking overlay in Figure 51, it is observed that at depth = 100% in which all the
7606 observations are included, the model successfully predicts that 30.48% of all the 7606
students study in the university. When comparing to the actual data in which 2319 out of 7606
that is equivalent to 30.49% of the students dropout from the university which is close to the
percentage of students dropout from the university being predicted by the model, it is concluded
that the model is accurate and therefore it is useful to perform prediction.
41
(forward selection)
Figure 53. Summary of forward selection for logistic regression with imputation and
transformation (forward selection) model
42
43
of the student dropout from the university. These variables are IMP_BIL_TANGGUNGAN,
KELAS_GRADE_PELAJAR and UMUR. And the rest of the input variables that have
significance value greater than 0.05 as shown in Figure 54 and 55 are eliminated. It is noticed
that the coefficient value for variable UMUR is greater than the coefficient value for variable
IMP_BIL_TANGGUNGAN and KELAS_GRADE_PELAJAR indicates that the age of
students is significantly bring more impact on the prediction of the students dropout from the
university compared the family burden and the students’ performance. Therefore, the equation
for this logistic regression that is with imputation and transformation and the model selection
method is set as forward is:
0.0001(𝐼𝑀𝑃𝐵𝐼𝐿𝑇𝐴𝑁𝐺𝐺𝑈𝑁𝐺𝐴𝑁 )+0.0006(𝐾𝐸𝐿𝐴𝑆𝐺𝑅𝐴𝐷𝐸𝑃𝐸𝐿𝐴𝐽𝐴𝑅 )+0.0034(𝑈𝑀𝑈𝑅)−3.5469
𝑒
0.0001(𝐼𝑀𝑃𝐵𝐼𝐿𝑇𝐴𝑁𝐺𝐺𝑈𝑁𝐺𝐴𝑁 )+0.0006(𝐾𝐸𝐿𝐴𝑆𝐺𝑅𝐴𝐷𝐸𝑃𝐸𝐿𝐴𝐽𝐴𝑅 )+0.0034(𝑈𝑀𝑈𝑅)−3.5469
1+𝑒
4.1.6 Logistic Regression with Imputation and Transformation (Stepwise Selection)

In this case, the model selection method is set as “Stepwise” at the Model Selection
transformation (stepwise selection) model at depth = 60%
44
transformation (stepwise selection) model at depth = 100%
Figure 56 and 57 above shows the outputs of scoring ranking overlay. According to the
scoring ranking overlay in Figure 56, it is observed that at depth = 60% in which around 4564
45
(stepwise selection)
Figure 59. Summary of stepwise selection for logistic regression with imputation and
transformation (stepwise selection) model
46
and transformation (stepwise selection) model 1
47
misclassification rate is 8.94% for train error and 7.22% for test error. Figure shows the output
of analysis of maximum likelihood estimates. Based on Figure 60, 61 and 62, it is observed
that there are several variables which have the significant values that are less than the
of the students dropout from the university. These variables are KELAS_GRADE_PELAJAR
and IMP_BILANGAN_TANGGUNGAN. And the rest of the input variables that have
significance value greater than 0.05 as shown in Figure 60 and 61 are eliminated. It is noticed
that the coefficient value for variable KELAS_GRADE_PELAJAR is greater than the
coefficient value for variable IMP_BILANGAN_TANGGUNGAN indicates that the students’
performance is significantly bring more impact on the prediction of the students dropout from
the university compared the family burden. Therefore, the equation for this logistic regression
that is with imputation and transformation and the model selection method is set as stepwise is:
0.0001(𝐼𝑀𝑃_𝐵𝐼𝐿𝑇𝐴𝑁𝐺𝐺𝑈𝑁𝐺𝐴𝑁 )+0.0006(𝐾𝐸𝐿𝐴𝑆𝐺𝑅𝐴𝐷𝐸𝑃𝐸𝐿𝐴𝐽𝐴𝑅 )−2.8216
𝑒
𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 = 0.0001(𝐼𝑀𝑃_𝐵𝐼𝐿𝑇𝐴𝑁𝐺𝐺𝑈𝑁𝐺𝐴𝑁 )+0.0006(𝐾𝐸𝐿𝐴𝑆𝐺𝑅𝐴𝐷𝐸𝑃𝐸𝐿𝐴𝐽𝐴𝑅 )−2.8216
1+𝑒
48
4.1.7 Logistic Regression Models Comparison
Regression Model Model Selection Misclassification Rate
Method Train (%) Test (%)
Logistics Regression Default 11.48 9.98
without Imputation
Logistic Regression Default 8.98 7.18
with Imputation
Logistic Regression Default 8.98 7.18
with Imputation and Backward 8.98 7.22
Transformation Forward 8.98 7.18
Stepwise 8.94 7.22
Table 3. Comparison between different types of logistic regression models in term of
misclassification rate.
Table 3 above shows a comparison is conducted between different types of logistic
regression models in term of misclassification rate. According to the table, it is observed that
the lowest value of test error is 7.18%. Therefore, logistic regression without imputation
(default selection) model and logistic regression with imputation (default selection) model are
no longer in our consideration of choosing the best regression model for prediction of the
students will dropout from Universiti Teknologi Petronas. Meanwhile, it is observed that
logistic regression with imputation model for the three types of model selection: backward,
forward, and stepwise share the same misclassification rate.
4.2 Data Mining Technique Two: Decision Tree

In this study, there are six different decision tree modeling conducted based on different
splitting rules in terms of maximum branch and nominal target criterion for data mining
analysis on background from Universiti Teknologi Petronas (UTP) for predicting the status of
students dropout.
In this case, the decision tree models are constructed by using the decision tree node
contained in the Model tab in SAS Enterprise Miner as shown in figure below. In this section,
the results of the predictive modeling through classification decision tree analysis in terms of
the score ranking overlays, entire decision tree, splitting criteria and the number of splits are
examined and analyzed
49
Figure 63. Decision tree node that can be accessed from the Model tab.
4.2.1 Scoring ranking overlay
Figure 64. Scoring rankings overlay of decision tree model at depth= 60%
50
Figure 65. Scoring rankings overlay of decision tree model at depth= 100%
Figure 64 and 65 above shows the outputs of scoring ranking overlay for decision tree
model. The output of the scoring ranking overlay for each decision tree no matter what is the
maximum branches and the nominal target criterion is same. According to the scoring ranking
overlay in Figure 64, it 4.2.2 Decision Tree: 2 branches with Chi-Square as target criterionis
observed that at depth = 60% in which around 4565 students study in the university from the
total 7606 observation been included, the decision tree model predicts that 50.78% of the 4564
students dropout from the university . While according to the score ranking overlay in Figure
65, it is observed that at depth = 100% in which all the 7606 observation are included, the
model successfully predicts that 30.48% of all the 7606 students study in the university. When
comparing to the actual data in which 2319 out of 7606 that is equivalent to 30.49% of the
students dropout from the university which is close to the percentage of students dropout from
the university being predicted by the model, it is conclude that the model is accurate and
therefore it is useful to perform prediction.
51
4.2.2 Decision Tree: 2 branches with Chi-Square as target criterion
Figure 66. Decision tree diagram of 2 branches with Chi-Square as target criterion
Figure 66 above shows the overall decision tree diagram of 2 branches with Chi-
Square as target criterion. Based on the decision tree model above, the parent node or root
node is the variable is Tajaan. The total of the leaf nodes which is also known as terminal
nodes where it has exactly one incoming edge and no outgoing edge is 19 nodes. The set of
the rules that is represented by this decision model will be presented in Appendix 1.
Figure 67. Fit statistics table for decision tree model of 2 branches with Chi-Square as target
criterion.
The Figure 67 above show the fit statistics table for decision tree model of 2 branches
with Chi-Square as target criterion. Due to the decision tree is binary, then we will look for the
misclassification rate. The misclassification plot displays how many observations were
52
correctly and incorrectly classified. A significant number of misclassification might indicate
that the model does not fit the data. From the table above, we can notice that the
misclassification rate is 0.0748. It’s mean that the error rate for the 2 branches with Chi-Square
as target criterion only has 7.48%.
Figure 68. Variable important of the input variable for the decision tree model of 2 branches
with Chi-Square as target criterion
For Figure 68, it illustrates the number of splitting rules and important of the input
variables. It is observed that the variable Kelas_Grade_Pelajar has the highest importance
where the value of important equals to 1 indicates that the variable is the most significant in
this model and the numbered of splitting rules is equals to 8. This denotes that the variable is
split into 5 types of rules in this model which are shown in the Figure 68, first, less than berhenti
and greater or equal to berhenti or missing; second, less than gagal or missing and greater equal
to gagal; third, less than sederhana and greater equal to sederhana or missing; fourth, less than
gagal and grater equal to gagal or missing; fifth, less than berhenti or missing and greater equal
to berhenti; sixth, less than sederhana or missing and greater or equal to serderhana; and the
seventh and eighth are same which is less than cemerlang and greater equal to sederhana or
missing.
The second importance variable is Tajaan where the value of importance is equals to
0.4515, and the number of splitting rules is 2. The third importance variable is Bil_Tanggungan
with the value of importance equals to 0.1230 and the number of splitting rules equals to 2.
The fourth importance variable is Kategori_Kawasan_Tinggal where the importance value is
53
0.1056, and the number of splitting rules is 3. Next, Umur is the fifth importance variable with
the value of importance equals to 0.0795 and the number of splitting rules is 2. Lastly, the
Kumpulan_Pendapatan_Keluarge has the importance value of 0.0681 with the number of
splitting rules equals to 1. From the table, we can observed that the variable Program,
Kelas_Grade_SPM, Negeri and Jenis_Sekolah has the number equals to 0 in part number of
splitting and importance value. It means that, these variables are not importance to this decision
model of 2 branches with Chi-Square as target criterion.
4.2.3 Decision Tree: 2 branches with Gini as target criterion
Figure 69. Decision tree diagram of 2 branches with Gini as target criterion
Figure 69 above shows the overall decision tree diagram of 2 branches with Gini as
target criterion. Based on the decision tree model above, the parent node or root node is the
variable is Tajaan. The total of the leaf nodes which is also known as terminal nodes where it
has exactly one incoming edge and no outgoing edge is 19 nodes. The set of the rules that is
represented by this decision model will be presented in Appendix 2
Figure 70. Fit statistics table for decision tree model of 2 branches with Gini as target
criterion
54
with Gini as target criterion. Due to the decision tree is binary, then we will look for the
misclassification rate is 0.0814. It’s mean that the error rate for the 2 branches with Gini as
target criterion only has 8.14%.
with Gini as target criterion
split into 8 types of rules in this model which are shown in the Figure 71; first, less than berhenti
missing.
The fourth importance variable is Program where the importance value is 0.0906, and the
55
number of splitting rules is 2. Next,the fifth importance variable is
Kumpulan_Pendapatan_Keluarga with the value of importance equals to 0.0681 and the
number of splitting rules is 1. The sixth importance variable is Katergori_Kawasan_Tinggi
which has the importance value of 0.0512 and the number of splitting rules equals to 1. The
rest following importance variable is Kelas_Grade_SPM , Umur, and Jenis Sekolah with the
importance value of 0.0489, 0.0468 ,0.0324, and the number of splitting rules is equals to 1.
From the table, we can observed that the variable Negeri has the number equals to 0 in part
number of splitting and importance value. It means that, this variable is not significant to this
decision model of 2 branches with Gini as target criterion.
4.2.4 Decision Tree: 2 branches with Entropy as target criterion
Figure 72. Decision tree diagram of 2 branches with Entropy as target criterion
Figure 72 above shows the overall decision tree diagram of 2 branches with Entropy as
variable is Tajaan. The total of the leaf nodes which is also known as terminal nodes where it
has exactly one incoming edge and no outgoing edge is 21 nodes. The set of the rules that is
represented by this decision model will be presented in Appendix 3
56
Figure 73. Fit statistics table for decision tree model of 2 branches with Entropy as target
criterion
with Entropy as target criterion. Due to the decision tree is binary, then we will look for the
misclassification rate is 0.0858. It’s mean that the error rate for the 2 branches with Entropy
with Entropy as target criterion
split into 8 types of rules in this model which are shown in the Figure 74; first, less than berhenti
57
missing.
number of splitting rules is 3. Next, the fifth importance variable is
Kategory_Kawasan_Pendapatan with the value of importance equals to 0.0785 and the number
of splitting rules is 2. The rest following importance variable is
Kumpulan_Pendapatan_Keluarga, Kelas_Grade_SPM, and Umur with the importance value of
0.0681, 0.0489, 0.0468, and the number of splitting rules is equals to 1. From the table, we can
observed that the variable Jenis_Sekolah and Negeri has the number equals to 0 in part number
of splitting and importance value. It means that, these variables are not significant to this
decision model of 2 branches with Entropy as target criterion.
4.2.5 Decision Tree: 3 branches with Chi-Square as target criterion
Figure 75. Decision tree diagram of 3 branches with Chi-Square as target criterion
Figure 75 above shows the overall decision tree diagram of 3 branches with Chi-Square
as target criterion. Based on the decision tree model above, the parent node or root node is the
variable is Kelas_Grade_Pelajar. The total of the leaf nodes which is also known as terminal
58
nodes where it has exactly one incoming edge and no outgoing edge is 19 nodes. The set of the
rules that is represented by this decision model will be presented in Appendix 4.
Figure 76. Fit statistics table for decision tree model of 3 branches with Chi-Square
as target criterion.
misclassification rate is 0.0748. It’s mean that the error rate for the 3 branches with Chi-Square
with Chi-Square as target criterion
split into 2 types of rules in this model which are shown in the Figure 76; first, less than gagal
59
or missing and greater or equal to gagal but less than sederhana; second, greater or equal to
sederhana.
The fourth importance variable is Katergory_Kawasan_Tinggal where the importance value is
0.0864, and the number of splitting rules is 2. Next, the fifth importance variable is Umur with
the value of importance equals to 0.0800 and the number of splitting rules is 2.
Kumpulan_Pendapatan_Keluarga as the sixth importance variable with the important value
equals to 0.0634 with the number of splitting rules equal to 1. The following variables which
are Program, Kelas_Grade_SPM, Negeri, and Jenis_Sekolah have no significant to this
decision tree model of 3 branches with Chi-Square as target criterion. It is because the value of
important and number of splitting rules for these variables is all equals to 0.
4.2.6 Decision Tree: 3 branches with Gini as target criterio
Figure 78. Decision tree diagram of 3 branches with Gini as target criterion
Figure 78 above shows the overall decision tree diagram of 3 branches with Gini as target
criterion. Based on the decision tree model above, the parent node or root node is the variable
is Kelas_Grade_Pelajar. The total of the leaf nodes which is also known as terminal nodes
where it has exactly one incoming edge and no outgoing edge is 19 nodes. The set of the rules
that is represented by this decision model will be presented in Appendix 5
60
Figure 79. Fit statistics table for decision tree model of 3 branches with Gini as target
criterion
misclassification rate is 0.0783. It’s mean that the error rate for the 3 branches with Gini as
with Gini as target criterion
split into 2 types of rules in this model which are shown in the Figure 78; first, greater or equal
61
to cemerlang or missing; second, less than gagal or missing and greater or equal to gagal but
less than sederhana.
0.1737, and the number of splitting rules is 3. The third importance variable is Program with
the value of importance equals to 0.1504 and the number of splitting rules equals to 6. The
fourth importance variable is Bil_Tanggungan where the importance value is 0.1495, and the
number of splitting rules is 5. Next, the fifth importance variable is Umur with the value of
importance equals to 0.0995 and the number of splitting rules is 5.
Kumpulan_Pendapatan_Keluarga is the sixth importance variable with the important value
are Kelas_Grade_SPM, and Katergori_Kawasan_Tinggal is the seventh and eighth importance
variable where the importance value equals to 0.0729, 0.0688 and the number of splitting rules
equals to 4 and 2. Based on the table we can found that the importance value and the number
of splitting rules for variable Negeri and Jenis_Sekolah is equals to 0. It means that these two
variables have no significant to this decision tree model of 3 branches with Gini as target
criterion.
4.2.7 Decision Tree: 3 branches with Entropy as target criterion
Figure 81. Decision tree diagram of 3 branches with Entropy as target criterion
Figure 81 above shows the overall decision tree diagram of 3 branches with Entropy as
variable is Kelas_Grade_Pelajar. The total of the leaf nodes which is also known as terminal
62
nodes where it has exactly one incoming edge and no outgoing edge is 19 nodes. The set of the
rules that is represented by this decision model will be presented in Appendix 6.
Figure 82. Fit statistics table for decision tree model of 3 branches with Entropy as target
criterion.
with Entropy as target criterion. Due to the decision tree is binary, then we will look for the
misclassification rate is 0.0805. It’s mean that the error rate for the 3 branches with Entropy as
with Entropy as target criterion
63
split into 2 types of rules in this model which are shown in the Figure 81; first, greater or equal
to cemerlang or missing; second, less than gagal or missing and greater or equal to gagal but
less than sederhana.
number of splitting rules is 5. Next, the fifth importance variable is Umur with the value of
importance equals to 0.0992 and the number of splitting rules is 5.
Kumpulan_Pendapatan_Keluarga is the sixth importance variable with the important value
are Kelas_Grade_SPM, and Katergori_Kawasan_Tinggal is the seventh and eighth importance
variable where the importance value equals to 0.0676, 0.0599 and the number of splitting rules
equals to 3 and 1. Based on the table we can found that the importance value and the number
of splitting rules for variable Negeri and Jenis_Sekolah is equals to 0. It means that these two
variables have no significant to this decision tree model of 3 branches with Entropy as target
criterion.
4.2.8 Decision Tree Model Comparison

Decision Tree Splitting Rules Misclassification Rate
Maximum Nominal Target Train (%) Test (%)
Branch Criterion
1 With 2 branches Chi-Square 8.75 7.48
2 With 2 branches Gini 8.83 8.14
3 With 2 branches Entropy 8.81 8.58
5 With 3 branches Gini 8.06 7.83
6 With 3 branches Entropy 8.23 8.05
Table 4. Comparison between different types of decision tree models in terms of
misclassification rate.
Table 4 above show the comparison between different types of decision tree models in term of
misclassification rate after all the models are executed and analyzed. If a comparison is made
between the model among the decision tree with 2 branches only, based on the percentage f
64
testing error, we can observed that the decision tree model of 2 branches with Chi-Square as
target criterion has the lowest percent of testing error which is 7.48%. On the other hand, if a
comparison is made between the model among the decision tree with 3 branches only based on
the percentage of testing error, we can noticed that, the lowest percent of testing error which is
7.48% belongs to the decision tree model of 3 branches with Chi-Square as target criterion
which is same as the comparison among 2 branches decision tree. Based on the result
comparing the percentage of testing, we can notice that the decision tree model with Chi-Square
as target criterion has the lowest percent of testing error which is 7.48%, no matter the decision
tree is in 2 branches or 3 branches which will shown in the table 5 below:
Decision Tree Splitting Rules Misclassification Rate
Maximum Nominal Target Train (%) Test (%)
Branch Criterion
Table 5. Comparison between the decision tree models
Due to the same percentage of testing, we will look at the train to find the more preferred
decision tree. Based on the Table 5 above, we can noticed that the train for 2 branches with
Chi-Square as target criterion is 8.75%, however, the train for 3 branches with Chi-Square as
target criterion is 8.86%. By comparing these two decision tree models, the decision tree model
of 2 branches with Chi-Square as target criterion is more preferred as shown in Table 6 below.
Decision Tree Decision Tree Misclassification Rate Preferred
Model Train (%) Test (%)
1 With 2 8.75 7.48 ✓
branches
2 With 3 8.86 7.48
branches
Table 6. Comparison between the chosen decision tree models
4.3 Data Mining Technique Three: Neural Network

In this section, two neural network models are constructed with different number of
hidden units to gather output from the analysis. To construct a neural network, we have to pick
the neural network node above the Model tab in SAS Enterprise Miner below. The results of
65
the predictive modelling through classification by neural network analysis in terms of the score
ranking overlays, fit statistics, and Multilayer Perceptron (MLP) are analysed.
Figure 84. Neural Network node inside Model tab

4.3.1 Neural Network with 2 Hidden Units
By default, the neural network model are pre set in 3 hidden units. To change the
number of hidden units, simply go to Property table to click at the Train section as shown below.
Figure 85. Number of hidden units in Property Table
66
Figure 86. Score rankings overlay of 2 hidden units at depth = 30%
Figure 87. Score rankings overlay of 2 hidden units at depth = 100%

Figure 86 and above shows the outputs of score rankings overlay. According to Figure
86, it is observed that at depth = 30% in which around 2282 out of 7606 students drop out from
Universiti Teknologi Petronas (UTP). According to Figure 87, we can tell that at depth = 100%
in which all 7606 students are included, the model successfully classifies that 30.48% of all the
7606 students recorded drop out of Universiti Teknologi Petronas (UTP). Due to the close
value of actual data being recorded, the model is concluded and can be used to perform
predictions.
67
Figure 88. Multilayer perceptron (MLP) of neural network model with 2 hidden
units.
Figure 88 above shows the neural network architecture which is the MLP for neural
network model. Referring to the figure above, there are three layers which consists of the input
layer, hidden layer, and output layer. As default, the hidden layer is set to 1. However, the
number of nodes in the first layer are determined by the number of input variables listed inside
the model. In this model, there are 10 nodes which consists of UMUR, PROGRAM, NEGERI,
KATEGORI_KAWASAN_TINGGAL, KUMPULAN_PENDAPATAN_KELUARGA,
BIL_TANGGUNGAN, TAJAAN, KELAS_GRADE_PELAJAR, KELAS_GRADE_SPM,
JENIS_SEKOLAH, and STATUS. The hidden layer that consists of node i and node j is
68
represented by a set of 2 hidden units: hidden unit 1 and hidden unit 2. Lastly, output layer
consists of one node which is node k. Each node has a weighted connection to the other nodes
in adjacent layers as shown in Figure 89 below which the estimates represent the weight values.
The output values are computed by default.
Figure 89. Optimization Result of Neural Network Model with 2 hidden units
Figure 89 above shows the optimization results of the neural network model with 2
hidden units. The Estimates column indicates the value of weight for each of the connected
nodes. For example, IMP_Bil_Tanggungan_H11 (node 1) shows that the imputed variable
BIL_TANGGUNGAN is connected to the first hidden node (node i) with the weight of 0.063
whereas IMP_Bil_Tanggungan_H12 (node 1) implies that the imputed variable
BIL_TANGGUNGAN is connected to another hidden unit where it is located in hidden layer
(node j) with the weight of about 0.023. Figure 90 below shows the fit statistics table for neural
69
network model of 2 hidden units. It can be concluded that the misclassification rate is 9.4% for
Train dataset and 7.6% for Test dataset.
Figure 90. Fit statistics table for neural network model with 2 hidden units
4.3.2 Neural Network Model with 3 Hidden Units

To build a neural network model with 3 hidden units, the number of hidden unit is set
as 3 (default) at the Train section of the Property table in SAS Enterprise Miner as shown in
Figure 91 below in order to execute the new model build for neural network model with 3
hidden units.
Figure 91. Setting the number of Hidden Units to “3”
70
Figure 92. Score Rankings Overlay of Neural Network Model with 3 Hidden Units at Depth
= 30%
Figure 93. Score Rankings Overlay of Neural Network Model with 3 Hidden Units at Depth
= 100%
Figure 92 and 93 above shows the output of score rankings overlay of the neural
network model constructed. According to the figure, it is observed that at depth = 30% in which
around 2282 out of 7606 students, the model classifies that as much as 82.97% of the students
recorded drop out from university. On the second figure, it can be seen that at depth = 100% in
which all 7606 observations are being made, the model successfully classify that 30.48% of
71
the 7606 students recorded drop out from university. In comparison to the actual data, we can
conclude that the model is accurate and hence is useful to perform prediction.
Figure 94. Multilayer perceptron (MLP) of neural network model with 3 hidden units
Figure 94 above demonstrates the neural network architecture which is the multilayer
perceptron (MLP) for neural network model with 3 hidden units in the hidden layer. Referring
to the figure above, there are three layers which consists of the input layer, hidden layer, and
output layer. As default, the hidden layer is set to 1. However, the number of nodes in the first
72
layer are determined by the number of input variables listed inside the model. In this model,
there are 10 nodes which consists of UMUR, PROGRAM, NEGERI,
KATEGORI_KAWASAN_TINGGAL, KUMPULAN_PENDAPATAN_KELUARGA,
BIL_TANGGUNGAN, TAJAAN, KELAS_GRADE_PELAJAR, KELAS_GRADE_SPM,
JENIS_SEKOLAH, and STATUS. The hidden layer that consists of node h, node i, and node
j is represented by a set of 3 hidden units: hidden unit 1, hidden unit 2 and hidden unit 3. Lastly,
output layer consists of one node which is node k. Each node has a weighted connection to the
other nodes in adjacent layers as shown in Figure 95 below which the estimates represent the
weight values. The output values are computed by default.
Figure 95. Optimization Result of Neural Network Model with 3 hidden units
73
Figure 95 above shows the optimization result as part of the output of the neural
network model with 3 hidden units. The Estimates column represents the values of weighted
for each of the connected nodes. For example, IMP_Bil_Tanggungan_Hl1 (node 1) implies
that the imputed variable BIL_TANGGUNGAN is connected to the first hidden units (node h)
with the weight value of approximately -0.04 whereas IMP_Bil_Tanggungan_Hl2 (node 1)
implies that the imputed variable BIL_TANGGUNGAN is connected to another hidden units
which is the second hidden units in the hidden layer (node i) with the weight value of
approximately -0.06. And then for IMP_Bil_Tanggungan_Hl3 (node 1) implies that the
imputed variable BIL_TANGGUNGAN is again connected but to the third hidden units (node
j) with the weight value of approximately -0.15. Figure 96 below shows the fit statistics table
for neural network model of 2 hidden units. It is observed that the misclassification rate is
12.17% for train error and 23.76% for test error.
Figure 96. Fit statistics table for neural network model of 3 hidden units
4.3.3 Neural Network Mode

Misclassification Rate
Neural Network Preferred
Train (%) Test (%)
with 2 Hidden Units 9.40% 7.70%
with 3 Hidden Units 9.10% 7.50% ✓
Table 6. Comparison between two neural network models in term of misclassification rate
Table 6 above highlights the comparison between two neural network models in term
of misclassification rate after the model is being executed. Based on the table 6, the rate of
misclassification of neural network model with 3 hidden units are lower than neural network
model with 2 hidden units in terms of Train and Test dataset. Therefore, we can conclude that
74
neural network model with 3 hidden units is preferable when compared to neural network
model with 2 hidden units in this particular case study.
75
5.0 MODEL COMPARISON
Model Misclassification Rate
Training(%) Testing(%)
1. Logistics Logistic Regression without 11.48 9.98
Regression Imputation (Default Selection)
2. Logistic Regression with Imputation 8.98 7.18
(Default Selection)
and Transformation (Default
Selection)
and Transformation (Backward
Selection)
and Transformation (Forward
Selection)
and Transformation (Stepwise
Selection)
7. Decision Decision Tree with Chi-Square 2 8.75 7.48
Tree Branch
8. Decision Tree with Gini 2 Branch 8.83 8.14
9. Decision Tree with Entropy 2 Branch 8.81 8.58
10. Decision Tree with Chi-Square 3 8.86 7.48
Branch
11. Decision Tree with Gini 3 Branch 8.06 7.83
12. Decision Tree with Entropy 3 Branch 8.23 8.05
13. Neural Neural Network 2 Hidden Layer 9.43 7.66
14. Network Neural Network 3 Hidden Layer 9.06 7.53
Table 7. Overview on the misclassification rate for all the models constructed.
Table 7 above shows overall view on the misclassification rate for all the models
constructed for logistic regression, decision tree, and neural network. Model comparison in
term of misclassification rate refers to the comparison between models constructed by
76
assessing the percentage level of both the train error and test error. First, choose the model with
the least percent of test error which it has the lowest percentage level will be considered as the
best model. If more than one model has the same percent of test error, then the percentage level
of train error of those models needed to be compared. In this study, two phases of comparisons
between models constructed is to determine the best model for Universiti Teknologi Petronas
to know a more accurate student dropout from the university. At the first phase, a comparison
is conducted within each data mining prediction method: logistic regression, decision tree, and
neural network, and then the preferred models are selected to be proceeded for the next
comparison, the best model from all the 14 models will be selected. In this case, since three of
the model have the same error of test and training so there are three best model that has the
least percent of test error which are Logistic Regression with Imputation (Default Selection),
Logistic Regression with Imputation and Transformation (Default Selection) and Logistic
Regression with Imputation and Transformation (Forward Selection).
77
6.0 CONCLUSION
After going through the process of identifying, data processing, trial-and-error, and
discovery, the proposed methodology has shown to be valid for predicting university student’s
dropout. To carry out this research, 3 data mining technique has been chosen to execute and
examine the behaviour of university students’ dropout, it consists of 1) Regression, 2) Decision
Tree, and 3) Neural Network. Diving deeper to each of the techniques, several sub-methods
had been implemented in order to cover a wider perspective and gain useful insights.
To obtain the best data mining technique for this particular case study, ‘Model
Comparison’ function has been executed. As a result, there are three technique that has the
equal least percentage of misclassification of 7.18% rate under Test dataset which makes it
qualified as the best model for this study. They are: 1) Logistic Regression with Imputation
(Default selection), 2) Logistic Regression with Imputation and Transformation (Default
selection), and 3) Logistic Regression with Imputation and Transformation (Forward selection).
To put things into perspective, the result obtained shows that in future, Universiti
Teknologi Petronas (UTP) could carry out any of the best modelling technique stated above to
understand and predict the behaviour of a particular student from the university. By knowing
the underlying pattern among university students, staff in Universiti Teknologi Petronas (UTP)
could recognise the potential of dropout in any of its students and hence provide care and
attention into solving the matter before it solidifies.
According to the results of regression technique, the most significant variable for the
prediction of the students dropout from the university is students’ performance and followed
by family burden and age. Besides that, according to the results of the 6 decision tree models,
we know that students’ performance is the most important point to identify whether the students
dropout or do not dropout from the university and followed by sponsorship, family burden,
living area, age and family income. Thus, we can conclude that students’ performance is the
main reason that students dropout from Universiti Teknologi Petronas.
In this line, it is important to realize that identifying students at risk of dropping out by
using the best method in this study is only the first step in truly addressing the issue of school
dropout. The next step is to identify the specific needs and problems of each individual student
who is in danger of dropping out and then to implement programmes to provide effective and
appropriate dropout ‐ prevention strategies. Therefore, stakeholders should be able to attend to
students' needs to help them in time to avoid dropout. As the pace of this society is increasing,
it is important for everyone to be a knowledgeable person not only to keep up but also to make
themselves useful in the journey of making this world a better place.
78
REFERENCE
Bjerk, D. (2012). Re-examining the impact of dropping out on criminal and labor outcomes in
early adulthood. Economics of Education Review, 31, 110 –122. Retrieved from
http://dx.doi.org/10.1016/j.econedurev .2011.09.003
Rouse, C. E. (2005). The labor market consequences of an inadequate education. Symposium

on the Social Costs of Inadequate Education, Teachers College, Columbia University.
Wang Deze. (October, 2018). Application of large data mining technology in Colleges
and Universities. ICBDR 2018 Proceedings of the 2nd International Conference on Big
Data Research, p86-89.
Rubén Manrique, Bernardo Pereira Nunes, Olga Marino, Marco Antonio Casanova, and Terhi
Nurmikko-Fuller. (Mar, 2019). An Analysis of Student Representation, Representative
Features and Classification Algorithms to Predict Degree Dropout. LAK19
Proceedings of the 9th International Conference on Learning Analytics & Knowledge,
p401-410.
Kyehong Kang, Sujing Wang. (Mar, 2018). Analyze and Predict Student Dropout from Online
Programs. ICCDA 2018 Proceedings of the 2nd International Conference on Compute
and Data Analysis, p6-12.
Sattar Ameri, Mahtab J. Fard, Ratna B. Chinnam, Chandan K. Reddy. (2016). Survival Analysis
based Framework for Early Prediction of Student Dropouts. CIKM '16 Proceedings of
the 25th ACM International on Conference on Information and Knowledge
Management, p903-912.
79
APPENDIX
Appendix 1: Decision Tree Rule of 2 Branches with Chi-Square as Target Criterion
*------------------------------------------------------------*
Node = 8
*------------------------------------------------------------*
if Umur < 18.5 or MISSING
AND Tajaan IS ONE OF: SENDIRI, FELDA
AND Kelas_Grade_Pelajar <= BAIK
then
Tree Node Identifier = 8
Number of Observations = 87
Predicted: STATUS__NEW=1 = 0.17
*------------------------------------------------------------*
Node = 10
*------------------------------------------------------------*
if Tajaan IS ONE OF: SENDIRI, FELDA
AND Kelas_Grade_Pelajar >= BERHENTI AND Kelas_Grade_Pelajar <= BERHENTI
then
*------------------------------------------------------------*
Node = 14
*------------------------------------------------------------*
if Tajaan IS ONE OF: MAIPK, PTPTN, PTPTN/MAIPK, MAIS or MISSING
AND Kelas_Grade_Pelajar >= GAGAL AND Kelas_Grade_Pelajar <= GAGAL
then
*------------------------------------------------------------*
Node = 16
*------------------------------------------------------------*
if Umur >= 18.5
AND Kategori_Kawasan_Tinggal < 2.5 or MISSING
80
then
*------------------------------------------------------------*
Node = 17
*------------------------------------------------------------*
if Umur >= 18.5
AND Kategori_Kawasan_Tinggal >= 2.5
then
*------------------------------------------------------------*
Node = 20
*------------------------------------------------------------*
AND Kelas_Grade_Pelajar <= BAIK or MISSING
then
*------------------------------------------------------------*
Node = 21
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 22
81
*------------------------------------------------------------*
AND Kelas_Grade_Pelajar >= CEMERLANG AND Kelas_Grade_Pelajar <=
CEMERLANG or MISSING
AND Bil_Tanggungan equals All Values
then
*------------------------------------------------------------*
Node = 23
*------------------------------------------------------------*
AND Bil_Tanggungan equals Missing
then
*------------------------------------------------------------*
Node = 26
*------------------------------------------------------------*
AND Kelas_Grade_Pelajar >= SEDERHANA or MISSING
AND Bil_Tanggungan < 5.5
then
*------------------------------------------------------------*
Node = 28
*------------------------------------------------------------*
CEMERLANG
82
then
*------------------------------------------------------------*
Node = 30
*------------------------------------------------------------*
AND Kelas_Grade_Pelajar >= GAGAL AND Kelas_Grade_Pelajar <= GAGAL or
MISSING
then
*------------------------------------------------------------*
Node = 36
*------------------------------------------------------------*
if Tajaan IS ONE OF: PTPTN
AND Bil_Tanggungan >= 5.5 or MISSING
then
*------------------------------------------------------------*
Node = 38
*------------------------------------------------------------*
CEMERLANG
then
83
*------------------------------------------------------------*
Node = 39
*------------------------------------------------------------*
if Umur >= 18.5
CEMERLANG
then
*------------------------------------------------------------*
Node = 40
*------------------------------------------------------------*
AND Kumpulan_Pendapatan_Keluarga <= M40 or MISSING
AND Kelas_Grade_Pelajar >= SEDERHANA
then
*------------------------------------------------------------*
Node = 41
*------------------------------------------------------------*
AND Kumpulan_Pendapatan_Keluarga >= T20
then
*------------------------------------------------------------*
Node = 42
*------------------------------------------------------------*
if Tajaan IS ONE OF: MAIPK, PTPTN/MAIPK or MISSING
84
then
*------------------------------------------------------------*
Node = 43
*------------------------------------------------------------*
then
85
Appendix 2: Decision Tree Rule of 2 Branches with Gini as Target Criterion
*------------------------------------------------------------*
Node = 8
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA PERAKAUNAN, DIPLOMA PERBANKAN
DAN KEWANGAN I, DIPLOMA PENGAJIAN ISLAM, DIPLOMA KAUNSELING
ISLAMI, DIPLOMA PENTADBIRAN PERNIAGAAN, DIPLOMA MULTIMEDIA
DAN DAKWAH
then
*------------------------------------------------------------*
Node = 10
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 14
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 15
*------------------------------------------------------------*
86
then
*------------------------------------------------------------*
Node = 18
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA TEKNOLOGI MAKLUMAT or MISSING
AND Kelas_Grade_SPM <= BAIK
then
*------------------------------------------------------------*
Node = 19
*------------------------------------------------------------*
AND Kelas_Grade_SPM >= CEMERLANG or MISSING
then
*------------------------------------------------------------*
Node = 23
*------------------------------------------------------------*
then
87
*------------------------------------------------------------*
Node = 24
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 25
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 36
*------------------------------------------------------------*
MISSING
then
*------------------------------------------------------------*
Node = 38
*------------------------------------------------------------*
88
then
*------------------------------------------------------------*
Node = 52
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA SYARIAH ISLAMIYYAH, DIPLOMA
TEKNOLOGI MAKLUMAT, DIPLOMA PERAKAUNAN, DIPLOMA PERBANKAN
DAN KEWANGAN I, DIPLOMA PENGURUSAN MUAMALAT, DIPLOMA
PENTADBIRAN PERNIAGAAN, DIPLOMA MULTIMEDIA DAN DAKWAH or
MISSING
CEMERLANG
then
*------------------------------------------------------------*
Node = 53
*------------------------------------------------------------*
MISSING
CEMERLANG
then
89
*------------------------------------------------------------*
Node = 54
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA PENGAJIAN ISLAM, DIPLOMA
KAUNSELING ISLAMI
CEMERLANG
then
*------------------------------------------------------------*
Node = 55
*------------------------------------------------------------*
if Umur >= 18.5
KAUNSELING ISLAMI
CEMERLANG
then
*------------------------------------------------------------*
Node = 56
*------------------------------------------------------------*
then
*------------------------------------------------------------*
90
Node = 57
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 60
*------------------------------------------------------------*
AND Jenis_Sekolah IS ONE OF: SK or MISSING
then
*------------------------------------------------------------*
Node = 61
*------------------------------------------------------------*
AND Jenis_Sekolah IS ONE OF: SA
then
91
Appendix 3: Decision Tree Rule of 2 Branches with Entropy as Target Criterion
*------------------------------------------------------------*
Node = 8
*------------------------------------------------------------*
ISLAMI, DIPLOMA PENTADBIRAN PERNIAGAAN, DIPLOMA MULTIMEDIA
DAN DAKWAH
then
*------------------------------------------------------------*
Node = 10
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 14
*------------------------------------------------------------*
then
92
*------------------------------------------------------------*
Node = 18
*------------------------------------------------------------*
AND Kelas_Grade_SPM <= BAIK
then
*------------------------------------------------------------*
Node = 19
*------------------------------------------------------------*
AND Kelas_Grade_SPM >= CEMERLANG or MISSING
then
*------------------------------------------------------------*
Node = 22
93
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 23
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 24
*------------------------------------------------------------*
then
94
*------------------------------------------------------------*
Node = 25
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 28
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 36
*------------------------------------------------------------*
MISSING
then
95
*------------------------------------------------------------*
Node = 52
*------------------------------------------------------------*
MISSING
CEMERLANG
then
*------------------------------------------------------------*
Node = 53
*------------------------------------------------------------*
MISSING
CEMERLANG
96
then
*------------------------------------------------------------*
Node = 54
*------------------------------------------------------------*
KAUNSELING ISLAMI
CEMERLANG
then
*------------------------------------------------------------*
Node = 55
*------------------------------------------------------------*
if Umur >= 18.5
KAUNSELING ISLAMI
CEMERLANG
then
97
*------------------------------------------------------------*
Node = 56
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 57
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 70
*------------------------------------------------------------*
98
USULUDDIN, DIPLOMA PENGURUSAN MUAMALAT, DIPLOMA KAUNSELING
ISLAMI, DIPLOMA KOMUNIKASI ISLAM
then
*------------------------------------------------------------*
Node = 71
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA BAHASA DAN KESUSASTERAAN, DIPLOMA
DAN KEWANGAN I, DIPLOMA PENGAJIAN ISLAM, DIPLOMA PENTADBIRAN
PERNIAGAAN, DIPLOMA MULTIMEDIA DAN DAKWAH or MISSING
then
*------------------------------------------------------------*
Node = 72
*------------------------------------------------------------*
99
then
*------------------------------------------------------------*
Node = 73
*------------------------------------------------------------*
then
100
Appendix 4: Decision Tree Rule of 3 Branches with Chi-Square as Target Criterion
*------------------------------------------------------------*
Node = 3
*------------------------------------------------------------*
if Kelas_Grade_Pelajar >= BERHENTI AND Kelas_Grade_Pelajar <= BERHENTI
then
*------------------------------------------------------------*
Node = 6
*------------------------------------------------------------*
if Tajaan IS ONE OF: MAIPK
then
*------------------------------------------------------------*
Node = 7
*------------------------------------------------------------*
if Tajaan IS ONE OF: PTPTN, PTPTN/MAIPK
then
101
*------------------------------------------------------------*
Node = 9
*------------------------------------------------------------*
if Kelas_Grade_Pelajar >= GAGAL AND Kelas_Grade_Pelajar <= GAGAL
then
*------------------------------------------------------------*
Node = 11
*------------------------------------------------------------*
AND Tajaan IS ONE OF: SENDIRI or MISSING
then
*------------------------------------------------------------*
Node = 23
*------------------------------------------------------------*
if Tajaan IS ONE OF: MAIPK, PTPTN/MAIPK
then
102
*------------------------------------------------------------*
Node = 24
*------------------------------------------------------------*
if Umur >= 18.5
then
*------------------------------------------------------------*
Node = 25
*------------------------------------------------------------*
if Umur >= 18.5
then
*------------------------------------------------------------*
Node = 33
*------------------------------------------------------------*
if Tajaan IS ONE OF: SENDIRI
AND Kategori_Kawasan_Tinggal < 1.5
103
then
*------------------------------------------------------------*
Node = 34
*------------------------------------------------------------*
AND Kategori_Kawasan_Tinggal < 2.5 AND Kategori_Kawasan_Tinggal >= 1.5 or
MISSING
then
*------------------------------------------------------------*
Node = 36
*------------------------------------------------------------*
if Tajaan IS ONE OF: MAIPK, PTPTN, PTPTN/MAIPK or MISSING
then
104
*------------------------------------------------------------*
Node = 37
*------------------------------------------------------------*
if Tajaan IS ONE OF: MAIPK, PTPTN, PTPTN/MAIPK or MISSING
then
*------------------------------------------------------------*
Node = 38
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 39
*------------------------------------------------------------*
then
105
*------------------------------------------------------------*
Node = 40
*------------------------------------------------------------*
if Tajaan IS ONE OF: PTPTN or MISSING
then
*------------------------------------------------------------*
Node = 41
*------------------------------------------------------------*
AND Bil_Tanggungan >= 4.5
then
*------------------------------------------------------------*
Node = 42
*------------------------------------------------------------*
106
then
*------------------------------------------------------------*
Node = 45
*------------------------------------------------------------*
AND Tajaan IS ONE OF: SENDIRI
then
*------------------------------------------------------------*
Node = 46
*------------------------------------------------------------*
if Umur >= 18.5
then
107
108
Appendix 5: Decision Tree Rule of 3 Branches with Gini as Target Criterion
*------------------------------------------------------------*
Node = 3
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 7
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 9
*------------------------------------------------------------*
then
*------------------------------------------------------------*
109
Node = 11
*------------------------------------------------------------*
if Tajaan IS ONE OF: SENDIRI or MISSING
AND Program IS ONE OF: DIPLOMA MULTIMEDIA DAN DAKWAH
then
*------------------------------------------------------------*
Node = 12
*------------------------------------------------------------*
ISLAMI, DIPLOMA PENTADBIRAN PERNIAGAAN
then
*------------------------------------------------------------*
Node = 14
*------------------------------------------------------------*
AND Bil_Tanggungan < 2.5 or MISSING
then
110
*------------------------------------------------------------*
Node = 15
*------------------------------------------------------------*
AND Bil_Tanggungan < 4.5 AND Bil_Tanggungan >= 2.5
then
*------------------------------------------------------------*
Node = 22
*------------------------------------------------------------*
if Tajaan IS ONE OF: PTPTN/MAIPK or MISSING
then
*------------------------------------------------------------*
Node = 30
*------------------------------------------------------------*
111
AND Kelas_Grade_SPM <= BAIK or MISSING
then
*------------------------------------------------------------*
Node = 31
*------------------------------------------------------------*
AND Kelas_Grade_SPM <= MINIMA AND Kelas_Grade_SPM >= CEMERLANG
then
*------------------------------------------------------------*
Node = 32
*------------------------------------------------------------*
AND Kelas_Grade_SPM >= SEDERHANA
then
112
*------------------------------------------------------------*
Node = 33
*------------------------------------------------------------*
if Umur < 18.5
AND Tajaan IS ONE OF: MAIPK
then
*------------------------------------------------------------*
Node = 34
*------------------------------------------------------------*
if Umur < 22.5 AND Umur >= 18.5 or MISSING
then
*------------------------------------------------------------*
Node = 35
*------------------------------------------------------------*
if Umur >= 22.5
113
then
*------------------------------------------------------------*
Node = 48
*------------------------------------------------------------*
if Tajaan IS ONE OF: MAIPK, PTPTN
then
*------------------------------------------------------------*
Node = 49
*------------------------------------------------------------*
then
114
*------------------------------------------------------------*
Node = 50
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 53
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA SYARIAH ISLAMIYYAH or MISSING
then
*------------------------------------------------------------*
Node = 55
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA PERBANKAN DAN KEWANGAN I, DIPLOMA
PENTADBIRAN PERNIAGAAN
then
115
*------------------------------------------------------------*
Node = 59
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 91
*------------------------------------------------------------*
TEKNOLOGI MAKLUMAT, DIPLOMA PERAKAUNAN or MISSING
AND Kumpulan_Pendapatan_Keluarga >= M40 AND Kumpulan_Pendapatan_Keluarga
<= M40
then
116
*------------------------------------------------------------*
Node = 92
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 93
*------------------------------------------------------------*
PENGURUSAN MUAMALAT, DIPLOMA PENTADBIRAN PERNIAGAAN,
DIPLOMA MULTIMEDIA DAN DAKWAH
then
*------------------------------------------------------------*
Node = 95
117
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 96
*------------------------------------------------------------*
KAUNSELING ISLAMI
then
*------------------------------------------------------------*
Node = 97
*------------------------------------------------------------*
if Umur >= 18.5
118
KAUNSELING ISLAMI
then
*------------------------------------------------------------*
Node = 108
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA TEKNOLOGI MAKLUMAT, DIPLOMA
PERAKAUNAN, DIPLOMA PENGAJIAN ISLAM, DIPLOMA KAUNSELING
ISLAMI, DIPLOMA MULTIMEDIA DAN DAKWAH
then
*------------------------------------------------------------*
Node = 112
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA PENGURUSAN MUAMALAT, DIPLOMA
PENGAJIAN ISLAM, DIPLOMA KAUNSELING ISLAMI, DIPLOMA BAHASA
ARAB DENGAN PENDI or MISSING
119
then
*------------------------------------------------------------*
Node = 113
*------------------------------------------------------------*
DAN KEWANGAN I, DIPLOMA PENTADBIRAN PERNIAGAAN, DIPLOMA
KOMUNIKASI ISLAM, DIPLOMA SAINS KOMPUTER DAN RANGK, DIPLOMA
MULTIMEDIA DAN DAKWAH
then
*------------------------------------------------------------*
Node = 114
*------------------------------------------------------------*
ISLAMI, DIPLOMA PENTADBIRAN PERNIAGAAN, DIPLOMA KOMUNIKASI
ISLAM
120
then
*------------------------------------------------------------*
Node = 116
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA SAINS KOMPUTER DAN RANGK, DIPLOMA
then
*------------------------------------------------------------*
Node = 117
*------------------------------------------------------------*
AND Tajaan IS ONE OF: PTPTN or MISSING
then
121
*------------------------------------------------------------*
Node = 118
*------------------------------------------------------------*
if Umur >= 18.5
then
*------------------------------------------------------------*
Node = 123
*------------------------------------------------------------*
AND Tajaan IS ONE OF: MAIPK, PTPTN/MAIPK
AND Kategori_Kawasan_Tinggal < 3.5 AND Kategori_Kawasan_Tinggal >= 2.5
then
*------------------------------------------------------------*
Node = 124
*------------------------------------------------------------*
if Umur >= 21.5
122
then
*------------------------------------------------------------*
Node = 125
*------------------------------------------------------------*
AND Kelas_Grade_SPM <= CEMERLANG
then
*------------------------------------------------------------*
Node = 126
*------------------------------------------------------------*
AND Kelas_Grade_SPM >= MINIMA or MISSING
then
123
*------------------------------------------------------------*
Node = 157
*------------------------------------------------------------*
AND Kumpulan_Pendapatan_Keluarga <= B40 or MISSING
then
*------------------------------------------------------------*
Node = 158
*------------------------------------------------------------*
then
*------------------------------------------------------------*
124
Node = 159
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 165
*------------------------------------------------------------*
AND Kumpulan_Pendapatan_Keluarga <= B40
then
*------------------------------------------------------------*
125
Node = 166
*------------------------------------------------------------*
AND Kumpulan_Pendapatan_Keluarga >= M40 or MISSING
then
*------------------------------------------------------------*
Node = 187
*------------------------------------------------------------*
AND Kategori_Kawasan_Tinggal < 2.5
then
*------------------------------------------------------------*
126
Node = 188
*------------------------------------------------------------*
AND Kategori_Kawasan_Tinggal >= 2.5 or MISSING
then
*------------------------------------------------------------*
Node = 189
*------------------------------------------------------------*
<= M40
AND Kelas_Grade_SPM <= CEMERLANG or MISSING
then
*------------------------------------------------------------*
127
Node = 190
*------------------------------------------------------------*
<= M40
AND Kelas_Grade_SPM >= MINIMA
then
*------------------------------------------------------------*
Node = 195
*------------------------------------------------------------*
USULUDDIN, DIPLOMA TEKNOLOGI MAKLUMAT
then
*------------------------------------------------------------*
Node = 196
128
*------------------------------------------------------------*
AND Kelas_Grade_SPM >= MINIMA AND Kelas_Grade_SPM <= MINIMA
then
*------------------------------------------------------------*
Node = 197
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 207
*------------------------------------------------------------*
129
DAN KEWANGAN I, DIPLOMA PENGAJIAN ISLAM or MISSING
then
*------------------------------------------------------------*
Node = 208
*------------------------------------------------------------*
if Umur < 20.5 AND Umur >= 19.5
then
*------------------------------------------------------------*
Node = 209
*------------------------------------------------------------*
if Umur >= 20.5
130
then
*------------------------------------------------------------*
Node = 216
*------------------------------------------------------------*
if Umur < 18.5
AND Program IS ONE OF: DIPLOMA SYARIAH ISLAMIYYAH
then
*------------------------------------------------------------*
Node = 217
*------------------------------------------------------------*
if Umur < 18.5
AND Program equals Missing
131
then
132
Appendix 6: Decision Tree Rule of 3 Branches with Entropy as Target Criterion
*------------------------------------------------------------*
Node = 3
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 7
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 9
*------------------------------------------------------------*
then
*------------------------------------------------------------*
133
Node = 11
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA MULTIMEDIA DAN DAKWAH
then
*------------------------------------------------------------*
Node = 12
*------------------------------------------------------------*
ISLAMI, DIPLOMA PENTADBIRAN PERNIAGAAN
then
*------------------------------------------------------------*
Node = 14
*------------------------------------------------------------*
AND Bil_Tanggungan < 2.5 or MISSING
then
134
*------------------------------------------------------------*
Node = 15
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 22
*------------------------------------------------------------*
if Tajaan IS ONE OF: PTPTN/MAIPK or MISSING
then
*------------------------------------------------------------*
Node = 30
*------------------------------------------------------------*
135
AND Kelas_Grade_SPM <= BAIK or MISSING
then
*------------------------------------------------------------*
Node = 31
*------------------------------------------------------------*
AND Kelas_Grade_SPM <= MINIMA AND Kelas_Grade_SPM >= CEMERLANG
then
*------------------------------------------------------------*
Node = 32
*------------------------------------------------------------*
then
136
*------------------------------------------------------------*
Node = 33
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 34
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 35
*------------------------------------------------------------*
if Umur >= 22.5
137
then
*------------------------------------------------------------*
Node = 43
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 45
*------------------------------------------------------------*
then
138
*------------------------------------------------------------*
Node = 46
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 47
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 51
*------------------------------------------------------------*
139
AND Bil_Tanggungan < 7.5 AND Bil_Tanggungan >= 6.5 or MISSING
then
*------------------------------------------------------------*
Node = 52
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 56
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 75
140
*------------------------------------------------------------*
<= M40
then
*------------------------------------------------------------*
Node = 76
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 79
*------------------------------------------------------------*
141
KAUNSELING ISLAMI
then
*------------------------------------------------------------*
Node = 80
*------------------------------------------------------------*
if Umur >= 18.5
KAUNSELING ISLAMI
then
*------------------------------------------------------------*
Node = 88
*------------------------------------------------------------*
<= M40
142
then
*------------------------------------------------------------*
Node = 89
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 91
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA PENGURUSAN MUAMALAT, DIPLOMA
PENGAJIAN ISLAM, DIPLOMA KAUNSELING ISLAMI, DIPLOMA BAHASA
ARAB DENGAN PENDI or MISSING
then
143
*------------------------------------------------------------*
Node = 92
*------------------------------------------------------------*
DAN KEWANGAN I, DIPLOMA PENTADBIRAN PERNIAGAAN, DIPLOMA
KOMUNIKASI ISLAM, DIPLOMA SAINS KOMPUTER DAN RANGK, DIPLOMA
then
*------------------------------------------------------------*
Node = 93
*------------------------------------------------------------*
ISLAMI, DIPLOMA PENTADBIRAN PERNIAGAAN, DIPLOMA KOMUNIKASI
ISLAM
then
144
*------------------------------------------------------------*
Node = 95
*------------------------------------------------------------*
AND Program IS ONE OF: DIPLOMA SAINS KOMPUTER DAN RANGK, DIPLOMA
then
*------------------------------------------------------------*
Node = 96
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 97
*------------------------------------------------------------*
if Umur >= 18.5
145
then
*------------------------------------------------------------*
Node = 102
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 103
*------------------------------------------------------------*
if Umur >= 21.5
then
146
*------------------------------------------------------------*
Node = 104
*------------------------------------------------------------*
AND Kelas_Grade_SPM <= CEMERLANG
then
*------------------------------------------------------------*
Node = 105
*------------------------------------------------------------*
AND Kelas_Grade_SPM >= MINIMA or MISSING
then
*------------------------------------------------------------*
Node = 131
*------------------------------------------------------------*
147
then
*------------------------------------------------------------*
Node = 132
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 133
*------------------------------------------------------------*
148
then
*------------------------------------------------------------*
Node = 155
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 156
*------------------------------------------------------------*
then
149
*------------------------------------------------------------*
Node = 157
*------------------------------------------------------------*
AND Bil_Tanggungan < 6.5 AND Bil_Tanggungan >= 3.5 or MISSING
then
*------------------------------------------------------------*
Node = 162
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 163
*------------------------------------------------------------*
150
AND Kelas_Grade_SPM >= MINIMA AND Kelas_Grade_SPM <= MINIMA
then
*------------------------------------------------------------*
Node = 164
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 174
*------------------------------------------------------------*
151
then
*------------------------------------------------------------*
Node = 175
*------------------------------------------------------------*
then
*------------------------------------------------------------*
Node = 176
*------------------------------------------------------------*
if Umur >= 20.5
152
then
*------------------------------------------------------------*
Node = 183
*------------------------------------------------------------*
if Umur < 18.5
AND Program IS ONE OF: DIPLOMA SYARIAH ISLAMIYYAH
then
*------------------------------------------------------------*
Node = 184
*------------------------------------------------------------*
if Umur < 18.5
AND Program equals Missing
153
then
154

Data Mining Report PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Report PDF

Uploaded by

Copyright:

Available Formats

SQIT 3033

KNOWLEDGE ACQUISITION IN DECISION MAKING

A182, Semester 2, Sesi 2018/19

FINAL GROUP PROJECT

Prepared by: Matrix No.

Lim Kui Chin 240760

Lecturer: Dr. Jastini Binti Mohd. Jamil

Submission Date: 22 May 2019

No. Topic Page

1.2 Problem Statement

1.4 Significance of Work

2.2 Application of large data mining technology in Colleges and Universities

2.3 Real World Implementation of Predictive Modelling

2.3.2 Analyze and Predict Student Dropout from Online Programs

Figure 1. Process flow of data mining analysis

Figure 2. Overview of the historical data 1

Figure 3. Overview of the historical data 2

Figure 10. Kelas_Grade_SPM

Figure 12. Negeri

Figure 14. Program

Figure 16. Status

Figure 18. Umur

Figure 19. Statistics table on the variables

Figure 20. Dataset allocations

3.2.3.2 Data Cleaning

3.2.3.3 Data Reduction

Figure 24. Conceptual framework for logistic regression models

Figure 25. A conceptual framework for decision tree models

4.1.1 Logistic Regression without Imputation (Default Selection)

4.1.2 Logistic Regression with Imputation (Default Selection)

4.1.3 Logistic Regression with Imputation and Transformation (Default Selection)

4.1.4 Logistic Regression with Imputation and Transformation (Backward Selection)

4.1.5 Logistic Regression with Imputation and Transformation (Forward Selection)

4.1.6 Logistic Regression with Imputation and Transformation (Stepwise Selection)

4.2 Data Mining Technique Two: Decision Tree

4.2.1 Scoring ranking overlay

4.2.3 Decision Tree: 2 branches with Gini as target criterion

4.2.4 Decision Tree: 2 branches with Entropy as target criterion

4.2.5 Decision Tree: 3 branches with Chi-Square as target criterion

4.2.6 Decision Tree: 3 branches with Gini as target criterio

4.2.7 Decision Tree: 3 branches with Entropy as target criterion

4.2.8 Decision Tree Model Comparison

4.3 Data Mining Technique Three: Neural Network

Figure 84. Neural Network node inside Model tab

Figure 85. Number of hidden units in Property Table

Figure 87. Score rankings overlay of 2 hidden units at depth = 100%

4.3.2 Neural Network Model with 3 Hidden Units

Figure 91. Setting the number of Hidden Units to “3”

4.3.3 Neural Network Mode

Rouse, C. E. (2005). The labor market consequences of an inadequate education. Symposium

You might also like