You are on page 1of 9

Using Machine

Learning to Predict
Student
Performance

Rohit Shivaji Adnaik

Department Of Computer science,

Shivaji University , Kolhapur

adnaik.rohit@gmail.com

This thesis examines the application of


machine learning algorithms to predict were applied to both the raw version and the
whether a student will be successful or not. feature engineered version of the data sets,
The specific focus of the thesis is the to predict the student's success. The thesis
comparison of machine learning methods comes to the same conclusion as the earlier
and feature engineering techniques in terms studies: The results show that it is possible
of how much they improve the prediction to predict student performance successfully
performance. Three different machine by using machine learning. The best
learning methods were used in this thesis. algorithm was naïve Bayes classification for
They are linear regression, decision trees, the first data set, with 98 percent accuracy,
and naïve Bayes classification. Feature and decision trees for the second data set,
engineering, the process of modification and with 78 percent accuracy. Feature
selection of the features of a data set, was engineering was found to be more important
used to improve predictions made by these factor in prediction performance than
learning algorithms. Two different data sets method selection in the data used in this
containing records of student information study. Keywords and terms: student
were used. The machine learning methods performance, machine learning, regression,
naïve Bayes classification, decision trees.
Acknowledgements learning to better construct relations between
search phrases and web pages. By analyzing the
I would like to express my sincere gratitude to content of the websites, search engines can
DR.URAMILA POL for his supervision, special define which words and phrases are the most
guidance, suggestions, and encouragement important in defining a certain web page, and
through the development of this thesis. they can use this information to return the most
relevant results for a given search phrase (Witten
et al., 2016). Image recognition technologies
Introduction also use machine learning to identify particular
objects in an image, such as faces (Alpaydin,
2004). First, the machine learning algorithm
With the wide usage of computers and internet,
analyzes images that contain a certain object. If
there has recently been a huge increase in
given enough images to process, the algorithm is
publicly available data that can be analyzed. Be
able to determine whether an image contains that
it online sales information, website traffic, or
object or not (Watt et al., 2016). In addition,
user habits, data is generated everyday. Such a
machine learning can be used to understand the
large amount of data present both a problem and
kind of products a customer might be interested
an opportunity. The problem is that it is difficult
in. By analyzing the past products that a user has
for humans to analyze such large data. The
bought, the computer can make suggestions
opportunity is that this type of data is ideal for
about the new products that the customer might
computers to process, because it is stored
want to buy (Witten et al., 2016). All these
digitally in a well-formatted way, and computers
examples have the same basic principle. The
can process data much faster than humans.
computer processes data and learns to identify
The concept of machine learning is something this data, and then 2 uses this knowledge to
born out of this environment. Computers can make decisions about future data. The increase
analyze digital data to find patterns and laws in in data has made these applications more
ways that is too complex for a human to do. The effective, and thus more common in use.
basic idea of machine learning is that a computer
Depending on the type of input data, machine
can automatically learn from experience
learning algorithms can be divided into
(Mitchell, 1997). Although machine learning
supervised and unsupervised learning. In
applications vary, its general function is similar
supervised learning, input data comes with a
throughout its applications. The computer
known class structure (Mohri et al., 2012;
analyzes a large amount of data, and finds
Mitchell, 1997). This input data is known as
patterns and rules hidden in the data. These
training data. The algorithm is usually tasked
patterns and rules are mathematical in nature,
with creating a model that can predict one of the
and they can be easily defined and processed by
properties by using other properties. After a
a computer. The computer can then use those
model is created, it is used to process data that
rules to meaningfully characterize new data. The
has the same class structure as input data. In
creation of rules from data is an automatic
unsupervised learning, input data does not have
process, and it is something that continuously
a known class structure, and the task of the
improves with newly presented data.
algorithm is to reveal a structure in the data
Applications of machine learning cover a wide (Sugiyama, 2015; Mitchell, 1997).
range of areas. Search engines use machine
Methods- 13 M 2.4 7 0
15 M 3 6 1
Machine learning basics- Table 1 -Example adata

The aim is to predict if the student has passed


Definition-
the exam or not by looking at the other variables
(the column of the table). In this case, the
A common definition of machine learning is column “Passed” is called the dependent
(Mitchell, 1997): “A computer program is said variable, and every other variable is called the
to learn from experience E with respect to some independent variable. In the “Passed” column,
class of tasks T and performance measure P if its “1” means student has passed the exam and “0”
performance at tasks in T, as measured by P, means failure in the exam. By applying a
improves with experience E.” Basically, machine learning algorithm to this data, a
machine learning is the ability of a computer to function can be created, also known as the
learn from experience (Mitchell, 1997). prediction model, that gives the value for the
Experience is usually given in the form of input dependent variable as output, and takes every
data. Looking at this data, the computer can find other variable as input. The act of creating a
dependencies in the data that are too complex prediction model from previously known data is
for a human to form. Machine learning can be called training, and such data is called the
used to reveal a hidden class structure in an training data or a training set. After the model is
unstructured data, or it can be used to find created, it must be applied to another data set to
dependencies in a structured data to make test its effectiveness. Data used for such purpose
predictions. Latter is the main focus of the is called test data or test set. The reason for
thesis. using two different sets is to ensure that the
model is flexible enough to be used on data sets
Predictive analytics other than the one it was built with. Otherwise,
Predictive analytics is the act of predicting the problem of overfitting may occur, which is
future events and behaviors present in when a model is accurate with its original data
previously unseen data, using a model built from set, but performs poorly on other data sets,
similar past data (Nyce, 2007; Shmueli, 2011). It because it is overly complicated (Srivastava,
has a wide range of applications in different 2014). A common method to avoid overfitting is
fields, such as finance, education, healthcare, to divide the input data set into training and test
and law (Sas, 2017). The method of application sets. To evaluate the model with test data, the
in all these fields is similar. Using previously model is used to predict the dependent variable
collected data, a machine learning algorithm in the test set. Then, the predicted values and
finds the relations between different properties actual values of the dependent variable are
of the data. The resulting model is able to compared. Evaluation is more complicated than
predict one of the properties of future data based looking at the number of correct predictions.
on properties (Eckerson, 2007). Table 1 shows There are multiple different evaluation criteria .
example data about students who passed or
failed at an exam, along with other information
about students

age gender gpa absence passed


14 F 3.2 5 1
occurring, respectively. P(A|B) is the probability
of A occurring given that B has already

P(A| B)= P( B| A) P(A) P(B) , 12


Selected methods-
occurred (Islam et al., 2007). This equation is
used to calculate the probability of dependent
There are numerous algorithms to create a
variable having a certain value. In the following,
prediction model. This thesis uses three different
the Bayes' theorem is applied to classify a
algorithms: linear regression, decision trees, and
student having 3 GPA and age of 12, using data
naïve Bayes classifier. While they all essentially
given in Table
have the same task, which is predicting a
dependent variable based on independent age gpa passed
variables, they are based on different 14 3.2 1
mathematical methods. 13 2.4 0
15 3 1

Linear regression-
First, the probability of “Pass” being 1 with the
Regression method takes a finite set relations
specified conditions must be calculated. This is
between dependent variable and independent
denoted by
variables, and creates a continuous function
generalizing these relations (Watt et al., 2016).
P(Pass=1 |GPA=3, Age=12)
Table 2 shows another data set containing
Using the Bayes' formula, this is equal to
information about students.

age passed
P(Pass=1|GPA=3, Age=12)= P(GPA=3,
14 1 Age=12| Pass=1) P(Pass=1) P(GPA=3,
13 0 Age=12) .
15 1 Using the chain rule of conditional probability,
Student data the first part of numerator can be expanded to
produce this equation:
For the sake of simplicity, the data has only one
independent variable. Figure 1 depicts a two
P(Pass=1|GPA=3, Age=12)= P(GPA=3|
dimensional graph that shows the relation
Pass=1) P(Age=12| Pass=1) P(Pass=1)
between the student age and the dependent P(GPA=3, Age=12)
variable indicating whether they have passed the
exam or not.
Now, the numerator can be calculated:

Naïve Bayes classifier - P(Pass=1|GPA=3, Age=12)=


Naïve Bayes classification is a machine learning 0.5 0.5 0.4 P(GPA=3, Age=12) = 0.1
method relying on the Bayes' Theorem P(GPA=3, Age=12)
: where A and B are two different events, P(A)
and P(B) are the probability of A and B Taking the same steps for “Pass” value being 0,
resulting equation is:
P(Pass=0|GPA=3, Age=12)= 0.33 Deciding which variable to remove can be done
0.33 0.6 P(GPA=3, Age=12) = 0.067 manually using domain knowledge or it can be
done automatically (Domingos, 2012). In the
P(GPA=3, Age=12)
case of this thesis, feature selection was done by
Last, the probability of “Pass” being 0 is
observing the output of the linear regression
compared to the probability of “Pass” being 1.
model to find how much correlation each
Since the expression P(GPA=3, Age=12) is the
variable has with the dependent variable. The
same in both formulas, comparing
second use of feature engineering in the thesis is
the numerator is sufficient. Predicted “Pass” the modification of variables. This can refer to
value is 1 for the “Age” value 12 and “GPA” combining multiple variables to create a new
value 3, because the probability for passing (0.1) variable, calculating a variable differently so that
is greater than that of failing (0.067). This is a it can be used better in classification, or
very basic example, where only some value categorizing a variable so that it has a limited
combinations have a probability. In actual range of possible values. An example of variable
implementation, a distribution needs to be used. modification can be made with a student data set
This classification method assumes that features, containing the native language of the student as
in this case “Age” and “GPA”, are independent one of the variables. Table 5 contains the the
from each other, meaning that occurrence of one data
does not affect the probability of the other. The
independence assumption is the reason for the
phrase “naïve” (Friedman, 2001).
Materials-
Two different data sets were used for this
Feature engineering-
research. The first data set was originally used in
In machine learning, feature engineering is the
research made at the University of Jordan
process of selecting or creating features
(Amrieh et al., 2016). It contains information
(variables) in a data set to improve machine
about 480 students from various countries,
learning results (Domingos, 2012). Feature
mostly in the Middle East. The data has a total
selection can include removing unnecessary or
of 17 variables
redundant features. The process of removing
unnecessary variables requires assessing the Variables of the data has two types. Nominal
relevance of the variable. This can be done by types have a specific set of values, while
creating a model to test the correlation of the quantitative types can have values which can be
variable with the dependent variable. Feature ordered (Card, 1997). Variable “Class” is the
creation includes modifying the variables and dependent variable, meaning it is the variable
creating new ones by combining multiple that the model is trying to predict. It can have
different variables (Kern, 2014). The first use of three different values, which are “L”,“M”, and
feature engineering in the thesis is the selection “H”. Value “L” means low, which represents a
of the relevant variables. Input data may contain grade score between 0 and 69. Value “M” means
too many variables, some of which do not medium, which 17 represents a grade score
improve the prediction performance, and thus between 70 and 89. The final value “H” means
make the predictive model overly complicated. high, and it represents a grade score between 90
In such a case, unnecessary variables must be and 100. The second data set was originally used
removed to make the model more efficient. in a research done at the University of Minho,
Portugal (Cortez and Silva, 2008). It contains
information about 395 students amd has 31 Predicted Predicted
False True
different variables
Actual False 26 6

Implementation and results- Actual True 2 86


Table . Confusion matrix for linear regression
The first data set used on the first raw data set
The first step was to apply the machine learning
The accuracy calculated from the confusion
methods to the raw data. In this case, the only
matrix is 93 percent. This is an improvement
processing done to the data was the modification
over the baseline accuracy of 73 percent. Results
of the dependent variables to make them binary.
are further evaluated in next chapter. The second
A total of 353 students out of 480 has performed
model was created using the decision tree
well or satisfactory, and therefore, the majority
method. The R function for this model is an
rule has accuracy of 73 percent. This is the
implementation of CART (Strobl, 2009),
baseline accuracy for this data set, to which the
classification and regression tree. Apart from the
accuracy of prediction models built on this data
function used, the procedure is the same as with
set were compared to see, if the models can
the previous model. Training and test sets were
make useful predictions. After calculating the
created, model was built using the training set,
baseline accuracy, the next step was to divide
and then applied to the test set. Table 10 shows
the data into the training and test sets. Training
the confusion matrix for this model.
set (75 percent of data) was used to build the
prediction model and test set (25 percent of data)
was used to test the model. While building Predicted Predicted
training and test sets, an important thing to False True
consider is that both sets must contain similar
Actual False 28 4
ratios of students from both classes. The R
language has a built-in functionality ensuring Actual True 1 87
that the cases of different classes are spread Table . Confusion matrix for the naïve Bayes
proportionally among the training and test sets. classification used on the first raw data
Next step after creating the test and training sets,
was to build the models. The first model was
created using the linear regression method. The second data set
Building a model using the R language is a The second data set contains 395 students. In
straightforward process which mainly includes this data set, 265 students out of 395 have a
defining the input data, dependent variable, and passing grade. This means the baseline accuracy
independent variables. After the model is for this data set is 67 percent. As in the first data
created, it is applied to the test data set. Output set, the training and test sets were created, and
of this process that concerns the thesis is a models were built using the machine learning
confusion matrix. It contains data about methods. Then, models were tested and
predicted values and actual values. Table 9 confusion matrices were produced as the
shows the confusion matrix for the first relevant output. Tables 12-14 show the
prediction model. confusion matrices for the linear regression,
CART, and naïve Bayes classifier methods
respectively
second method of modification was custom
Predicted Predicted feature creation, where important variables are
False True
combined into custom variables to make the
Actual False 14 19 decision trees more efficient

Actual True 7 59
Table . Confusion matrix for linear regression
used on the second raw data set.

Discussion and conclusions-


The success of machine learning in predicting
Predicted Predicted
False True student performance relies on the good use of
the data and machine learning algorithms.
Actual False 8 24 Selecting the right machine learning method for
the right problem is necessary to achieve the
Actual True 7 59
best results. However, the algorithm alone can
Table . Confusion matrix for CART used on the
not provide the best prediction results. Feature
second raw data set.
engineering, the process of modifying data for
Accuracy values for the models are 74 percent, machine learning, is also an important factor in
68 percent, and 73 percent respectively. getting the best prediction results. The aim of
Although results are further evaluated next this thesis was to compare method selection and
chapter, it can be already noticed that machine feature engineering, in terms of their ability to
learning methods does not offer much improve the prediction results. Two different
improvement over the baseline method in this data sets were analyzed with three different
data set. machine learning methods, and their results were
compared using four evaluation measures.
Methods used were linear regression, decision
trees, and naïve Bayes classification. For the
evaluation of feature engineering, machine
Results from engineered data- learning methods were applied to the raw and
To improve the prediction performance, the data modified versions of the data separately. The
sets were modified. The first modification main method of feature engineering was feature
method in the thesis was feature selection. To selection. In the case of classification and
find the important variables in the data sets, the regression trees, additional feature engineering
process of variable ranking can be used (Guyon was done in the form of custom feature creation.
and Elisseeff, 2003). In this thesis, it is done by Feature engineering was done both with
using the output of linear regression model, automatic functionality and manual
which shows the correlation of each dependent interpretation of the data. In addition, fine tuning
variable with the independent variable. The of features was done with a trial and error
process of selecting variables is done by a trial approach. Results of both data sets show
and error approach, where the machine learning similarities and differences with their use in the
model was built multiple times with different original studies. In the first data set, similarity is
sets of relevant variables, and the best that recall values were consistently higher than
combination of variables is identified. The precision values. Difference was in the accuracy
values. The accuracy reached in this thesis was provided the best results. In both data sets, best
higher than in the original research (Amrieh et possible accuracy values were a clear
al., 2016). This can be attributed to the improvement over the baseline accuracy values.
difference in dependent variables. In original This shows that using machine learning is an
research, dependent variable was not converted effective way of predicting the student
to binary, and it has three values instead of two. performance
Generalizing the dependent variable might have
made the predictions easier in this thesis. In the
second data set, original research used additional Future work-
variables that indicate the past exam grades, and This research has certain limitations that must be
achieved better accuracy than in this thesis noted. There was not an access to a dedicated
(Cortez and Silva, 2008). However, once those student data set, and the study relies on public
variables are omitted, accuracy values were data sources. In addition, both data sets were
similar to those of this thesis. 31 The models small, having less than thousand records. A
that used the first data set gave much better research that has access to more comprehensive
results compared to the models built with the data may offer more conclusive results. Another
second data set. Accuracy values for the first area that future research can improve is the
data set ranged from 93 percent to 98 percent, variety of the machine learning methods. This
while accuracy values for the second data set research used linear regression, decision trees,
were between 68 percent and 78 percent. and the naïve Bayes classification. Other
Although the second data set contained more methods, such as clustering and artificial neural
features than the first one, results imply that networks can be used to have a better
features in the first data set were more related to understanding of the importance of method
the student success. This shows the importance selection. Final area that can be improved is the
of data when it comes to prediction process of feature creation. Since the data is
performance. Methods used for both data sets limited, the amount of feature modification that
were nearly identical, but the results where very can be made is also limited. Both data sources
different. This indicates that better methods can used in this research consists of a single table,
not offset the limitations of the data. The results and custom variables were created using
of this study indicate that feature engineering variables from the same table. With a more
provides more improvement to prediction results comprehensive data set that spans multiple
than method selection. Despite feature tables, there will be more potential to create new
engineering was done in a limited capacity, it custom variables, while keeping in mind that the
made a bigger difference in prediction more a custom variable is, the more difficult it is
performance. Furthermore, biggest leap in to interpret the relation between it and the
improvement was made in the case of decision dependent variable.
trees, where both feature selection and feature
modification is applied to the data. When trying
to improve the prediction of student
performance, the modification of input data is an
important factor besides selecting the right
method for the data. Although feature
engineering was more effective than method
selection, the combination of both approaches
References-
Ethem Alpaydin. 2004. Introduction to Machine
Learning. Cambridge, MA

Elaf Abu Amrieh, Thair Hamtini, and Ibrahim


Aljarah. 2016. Mining educational data to
predict student’s academic performance using
ensemble methods. International Journal of
Database Theory and Application 9(8), 119-136

S. K. Card and J. Mackinlay. 1997. The structure


of the information visualization design space. In:
Proceedings of the 1997 IEEE Symposium on
Information Visualization. IEEE, 92-99.

Paulo Cortez and Alice Maria Gonçalves Silva.


2008. Using data mining to predict secondary
school student performance. In: Proceedings of
5th Annual Future Business Technology
Conference, Porto, 5-12.

G. Dekker, M. Pechenizkiy, and J.


Vleeshouwers. 2009. Predicting students drop
out: A case study. In: Educational Data Mining
2009, 41-50.

Pedro Domingos. 2012. A few useful things to


know about machine learning. Communications
of the ACM 55(10), 78-87.

You might also like