You are on page 1of 11

Machine Learning algorithm in educational data

Sushil Shrestha and Manish Pokharel


Department of Computer Science and Engineering, Kathmandu University, Nepal
Corresponding author: sushil@ku.edu.np

ABSTRACT -- Educational Data Mining (EDM) is one of the concern areas of data mining used for gathering, analyzing, and
presenting information. The purpose of this paper is to analyze online learners’ activities to extract hidden information using
clustering and classification techniques. The data were collected from learners enrolled in a MOOC course called C programming
offered by Kathmandu University of Nepal. For clustering, K-means algorithm was used for grouping of the student with similar
characteristic to understand the learners’ behavior and for classification, Support Vector Machine (SVM) classifier was
implemented to develop predictive model that predicts the students’ performance labeled with a class such as low, medium and
high. The extracted knowledge can be used by the academic institution to improve teaching and learning processes and improve
learner’s performance which consequently helps in academic achievement. This research helps in early identification of weak
students such that timely decision making can be done to improve learner’s performance and reduce online learner’s dropout rates.

Keywords: Online Learning (OL), Educational Data Mining (EDM), K-means clustering, SVM model

1. INTRODUCTION
MOOC (Massive Open Online Course) is an online course offered free and open registration to the users
such that online information is delivered to them via web. MOOC offers students for self-paced learning that they can
access the system at distance, at anywhere, at any time and at anyplace using the internet. MOOC offers well designed
online courses that users can enroll in their interest course for learning [1]. Due to this, the MOOC system generates
a huge amount of data about learner’s behavior and other activities. MOOC not only offers a platform to host video
and text lectures, assignment, and quizzes but also offer collaborative learning with online discussion forum and keeps
track of student activity via its logging system [2][1]. There are various kinds of students’ activities related data that
can be extracted from the MOOC platform. Some examples of data stored in the MOOC system are logs that keep
track of clicks made in the course, course activity reports and individual activity reports. By utilizing the reports that
MOOC keeps, it is possible to analyze and to build predictive models about the activity of the students that are using
the system.
In this research, data related to MOOC activity and Quiz grades of students were analyzed and a predictive
model of student performance was built. First, 6 datasets containing the above-mentioned information (MOOC activity
and Quiz grades) were downloaded from the MOOC system. Then, data mining techniques were applied to the merged
dataset in order to extract knowledge and insights. Finally, a predictive model was built using a support vector machine
(SVM) in order to predict the student grades.

2. RELATED WORKS
In [3], author used k-means clustering technique and decision tree technique for the analysis of students’
academic performance. This study collected data of 200 students from their exam results and applied k means
clustering method to group the students into three class categories (i.e. low, medium and high) based on students’
performance in percentage. The result of clustering showed low-performance group whose percentage is less than 60,
the medium performance group whose percentage is greater than equal to 60 and less than 85, and a high-performance
group whose percentage is greater than equal to 85. Finally, this research applied a decision tree to classify the patterns
of students’ performance in order to obtain the specific knowledge so as to improve both the educational system and
the learners’ performance.
Similarly, the author [4] surveyed and showed the comparative study of the application of data mining using
clustering techniques. This research showed there are different application areas such as crime pattern detection in
social science, travel package recommender system in business, medical image segmentation in medicine and
student’s performance detection in education where k-means clustering algorithms can be applied. This study revealed
that applying k-means clustering technique “improves the student’s performance and enhances the academic planners
to monitor the performance and progression level of each student”.
The author applied one of the supervised learning algorithms called support vector machine (SVM) for the
classification task [5]. The main goal of this research was to predict the students’ placement result in a labeled class
as Yes or No. This study collected the data of 200 students with six independent attributes such as Attendance, GPA,
Reasoning Aptitude, Quantitative Aptitude, Communication Skills, Technical Skills and one dependent attribute such

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
as Placement. This study showed how classification result of the student’s placement gives a better perception about
how they should perform and what they should target on new educational trends so as to get placed in future.
Similarly, in another research, author applied both Support Vector Machine (SVM) and K Nearest Neighbor
(KNN) for the classification task [6]. The main goal of this study was to find the best prediction model based on their
accuracy. The model developed was used to predict the students’ grade. For this study, data of 375 students were
collected from the University of Minho in Portugal during the school years 2005-2006. From the experimented result,
this study showed that SVM achieved the slightly better result of 96% accuracy than KNN with 95% accuracy.

3. METHODS
This section discusses the research methods that need to be followed to achieve the primary objective of
implementing educational data mining technique in OL data stored in MOOC system.
3.1. Data Collection
This study collected the data of online learners enrolled in the course “C Programming” in MOOC at
Kathmandu University, Nepal. The types of datasets used for the analysis were from the learners’ online activities
such as system log, activity completion report and four quiz grades. These datasets were discussed below.
3.1.1. System log
The system log consists of data pertaining to each click made in the system by a user. This dataset contained
46524 records with 9 variables of 375 active users in the system.

Figure 1 A sample of system log

3.1.2. Activity completion report


Activity completion report consists of data pertaining to activities completed or not completed by each user.
This dataset consists of 375 observations and 63 variables (activities). Each observation represents whether a given
user completed the activity or not.

Figure 2Figure
A sample
3 A of activity
sample completion
of activity report report
completion

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
3.1.3. Quiz grades
The four datasets related to quizzes contains information about quiz scores of each student. If a student
attempts a quiz more than once, another observation with the same student’s information is added to the dataset, with
a different starting time. Students can attempt each quiz 3 times at most. Each quiz is graded over 10. The first dataset
has 353 observations, the second dataset has 286, the third dataset has 192 and the fourth dataset has 168 observations
in total.

Figure 3 A sample of quiz grades

3.2. Data Preprocessing


Data preprocessing is the process of converting the raw data into an understandable format such that it can
be used by a particular data mining algorithm. It should be applied before applying EDM as data in the real world may
be noisy, incomplete and inconsistent. This step includes data cleaning, data integration, data transformation, data
reduction and data discretization for the preprocessing tasks [7].

3.3. Data Visualization


Data visualization is the graphical representation of complex data that makes it simple and easy to understand
[7]. Through the visualization of the data, the educational researcher gets the viewpoint of a concept as it clearly shows
the patterns of the data such that they can apply different EDM technique to extract hidden information of students
stored in a huge educational database which consequently helps better understand the learners and solve the problems
related to the learners and can enhance academic achievement. Visualization of MOOC dataset was done with different
graphs such as bar graph, box plot, scatter plot, polar plot, etc. 6 datasets containing information about the MOOC
course, namely the users’ log in course, activity completion report and quiz grades were inspected using data mining
techniques. 375 users were active in the system, 215 users completed at least 1 activity on the system and 195 users
attempted at least 1 quiz.

3.4. Applying EDM


Educational data mining (EDM) is one of the application areas of data mining used for gathering, analyzing
and presenting useful information in order to resolve the problems related to education. EDM process converts the
raw data collected from different educational repositories into meaningful and useful information such that using this
information, academic research gets the academic achievement by improving students’ performance [8]. There are
different EDM techniques such as association, clustering, classification, regression, etc. that can be applied to
educational data to extract hidden information. However, this research mainly focuses to apply only two EDM
techniques i.e. clustering technique for grouping the similar characteristic of the student that helps better understand
the learners and classification for developing a predictive model that helps predict student academic performance.
These two EDM techniques were discussed below.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
3.4.1. Clustering
A clustering is a data mining technique used for grouping or collecting the data of same kinds in a particular
class i.e. characteristic of data in that class is same and different from those that belong to another grouping or class
[9]. It is also called unsupervised learning algorithm commonly used for statistical data analysis. It is used in many
different application areas such as pattern recognition, machine learning, bioinformatics, information retrieval, etc.
There are different clustering methods such as k-means clustering, Expectation Maximization (EM) clustering, Fuzzy
clustering, model-based clustering, etc. that can be used to group the students to identify their similar skill profiles
[8][9]. However, k-means clustering is the most widely used clustering algorithm as it is one of the simple
unsupervised learning algorithms, computationally faster and produces better cluster results comparing other
clustering algorithms [3][4]. Hence, this research focuses to use k-means clustering algorithm for effective grouping
of students that demonstrate similar activities such as online learning (OL) activities and quiz scores. K-means
clustering discovers interesting patterns such that student with low, medium and high performance can be
distinguished [8].

3.4.2. Classification
A classification is another data mining technique used for predicting a class or group for instances of the new
dataset based on a training set of previously labeled class [8]. It is also called supervised learning algorithm commonly
used for classifying students’ performance into the group in the academic field. There are several types of classification
algorithms such as Logistic Regression, Naïve Bayes, K Nearest Neighbor, Neural Networks, Random Forest, Support
Vector Machine (SVM), etc. which can be used for predicting students’ performance and prevent student’s dropout in
online learning. However, from last decade, it had attracted many researchers with greater attention towards
classification task and had been actively applied to many research domains such as classification of images,
bioinformatics, face detection, text and hypertext categorization, etc. [10]. SVM is one of the most popular supervised
learning algorithms for handling high dimensional data typically used for classification and regression [5]. It solves
the prediction problems related to the performance of the students that are categorized into three classes as Low,
Medium, and High based on users’ average quiz grade. Hence, this research focuses to use SVM for predicting
students’ performance as it predicts accurate results for most of the classification and prediction problem.

4. RESULTS
Basically, the outcome of this research is divided into two parts. First is the visualization of the data and
second is the implementation of algorithms (Clustering and classification).

4.1. Visualization of data


This section shows the graphical representation of data. Figure 4 shows the user activity frequency vs day of
the week. Daily aggregations of the log data were created to visualize this graph. Friday is the most active day of the
week, with Sunday being the least active.

Figure 4 User activity frequency per day of the week

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
Figure 5 Daily activity frequency per hour of the day

Figure 5 shows users daily activity frequency vs an hour of the day. Hourly aggregations were created and
weekends were separated in order to create this graph. 9 PM on weekends and 7 PM on weekdays are the most active
hours of the day, with 3 AM and 4 AM being the least active with almost no activity.

Figure 6 Distinct number of active users per day in a year

Figure 6 shows a distinct number of active user’s vs day of the year. A distinct number of active users were
counted from the MOOC log dataset to create this graph. The distinct number of active users’ peaks on the week of
November 20, 2016, to November 27, 2016.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
Figure 7 activity distribution per user

Figure 7 shows activity distribution vs user. Each user’s total activity in the MOOC log is represented in
a reordered fashion with a dot on the plot. The distribution is exponential with some outliers.

Figure 8 Component aggregation

Figure 8 shows the aggregation of log component in a polar graph. A component aggregation was created to
depict the number of events occurring in each log component where we found users are more active in Quiz component
and then System & Book components.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
Figure 9 Attempts vs Grade (Quiz1)

Figure 9 shows the graphical representation of the number of quiz attempts vs quiz grade for Quiz 1. The size
of the point on the graph shows the frequency of the occurrence. There is no pronounced pattern in the way attempts
or grades are distributed for this quiz.

Figure 10 Attempts vs Grade (Quiz2)

Figure 10 shows a graphical representation of the number of attempts vs grade for Quiz 2. The size of the
point on the graph shows the frequency of the occurrence. Less number of users attempted this quiz twice than the
number of users who attempted once or three times.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
Figure 11 Attempts vs Grade (Quiz3)

Figure 11 shows the graphical representation of the number of attempts vs grade for Quiz 3. The size of the
point on the graph shows the frequency of the occurrence. It can be concluded from the graph that fewer people take
part in quizzes as the course progresses, however more people start scoring higher.

Figure 12 Attempts vs Grade (Quiz4)

Figure 12 shows the graphical representation of the number of attempts vs grade for Quiz 2. The size of the
point on the graph shows the frequency of the occurrence. It can be concluded from the graph that fewer people take
part in quizzes as the course progresses, however more people start scoring higher.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
4.2. Implementation of K-means algorithm for clustering

Figure 13 Number of cluster k vs Gap statistics (k)

Figure 13 visualize number of clusters vs gap statistic. Cluster analysis was performed on the users based on
their online activity data (using the total number of clicks in the system, number of activities completed, the total
number of quiz attempt and the average of quiz grades). An optimal number of clusters was visualized where 5 was
found to be the optimal number of clusters for the given dataset, given that the growth of gap statistic starts to decrease
on the point that coincides to 5 on the x-axis for the first time.

Figure 14 Cluster plot of user activity

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
Figure 14 shows the cluster plot for user activity-based clustering. Each point on the plot represents a single
user and is marked with the user’s assigned index. 5 clusters were formed based on the result of optimal cluster number
analysis. Activity and performance of clusters increase in the direction of the x-axis.
Table 1 shows the results of the cluster analysis. It was concluded that clusters 1 and 3 represent low
performing, cluster 2 represents medium performing and clusters 4 and 5 represent high performing students. Judging
by this grouping, quiz averages were labeled as low if equal to or smaller than 3 (i.e. Average <= 3), medium if between
3 and 7 (i.e. Average > 3 and <= 7) and high if higher than 7 (i.e. Average >7).
Table 1 Result of medians of each variable for each cluster

4.3. Implementation of SVM for classification


SVM (Support Vector Machine) method is used to develop the predictive model. It is a useful method for
prediction since it can be used on both linearly separable and non-linearly separable datasets. Further, it is an effective
data mining technique for prediction task with maximum accuracy and minimum root mean square error [5]. At first,
a simple SVM model was built. 70% of the indices were randomly chosen for training and the remaining 30% was
used for testing from the whole dataset of 210 records. The totalNumClicks (i.e. total number of clicks in OL system),
totalQuizAttempts (i.e. total number of quiz attempts) and numActivitiesCompleted (i.e. total number of activities such
as course content, videos, forum, etc. completed) were used as independent features of the training and testing sets
whereas, gradeLabel (average quiz grade labeled as low, med and high) was used as dependent feature. The confusion
matrix generated by this prediction is shown in Table 2.

Table 2 Confusion Matrix

The accuracy of this model on the testing dataset is 76.19%. Accuracy for predicting high and low classes
are high for this model. However, medium class is always classified wrong. This model correctly classified 20 students
and incorrectly classified 1 student into high-performance class from actual 28 high-performance students. Similarly,
it correctly classified 28 students and incorrectly classified 2 students into low-performance class from actual 35 low-
performance students. And, it correctly classified 0 students and incorrectly classified 12 students into the medium-
performance class from actual 0 medium students.

5. DISCUSSION AND CONCLUSION


In the first step, a cluster analysis was performed on the users’ online activity data (i.e. the total number of
clicks in the system, the total number of completed activities, the total number of quiz attempt and the average quiz
grades) so as to group the students with similar performance. Then, based on the result of clustering, quiz averages
were labeled as low (i.e. Average <= 3), medium (i.e. Average > 3 and <= 7) and high (i.e. Average >7). In the second
step, a predictive model was built for 3 class classification of student success labeled as low, medium and high using
SVM classifier. Then, a predictive model SVM is used to classify student performance on testing dataset on which it
showed an accuracy of 76.19% means it has correctly classified 48 of 63 students from testing dataset to the right
class labeled as low, medium and high which proves how realistic the predictive model is. The obtained result revealed

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.
that there is a strong relationship between the online activities of users in the MOOC system and their success at the
end of the course. This type of study can help both the teachers and students improve their performance in the OL
system. From the extraction of hidden information about students, the teachers can better understand the learners and
their learning behaviors. Also, the predictive model developed help early prediction of the students’ performance such
that the teachers can distinguish the weak and strong learners. Hence, the teachers can take early action at right time
with proper planning and decision making to improve the quality of education. As well as they can provide proper
counselling to the weak learners to improve their performance which consequently improves the learners’ performance
and side by side reduce student’s dropout rate in online learning.

This research helps in student behaviour analysis to understand the students performance and predict the
performance with respect to the given data set. The instructors are highly benefited from such kind of research. In
addition, students are also able to know the status of their performance. This type of research develops a platform for
the early warning system, which can overall improve the teaching learning process.

ACKNOWLEDGEMENT
This research was conducted at Digital Learning Research Lab of Kathmandu University, Nepal.

REFERENCES

[1] M. Kloft, F. Stiehler, Z. Zheng and N. Pinkwart, “Predicting MOOC dropout over weeks using machine learning methods,” In
Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs, (pp. 60-65), 2014.
[2] S. B.Aher and L. M. Lobo, “Course Recommender System in E-Learning,” International Journal of Computer Science and
Communication, 3(1), 159-164, 2012.
[3] S. Kadiyala and C. S. Potluri, “Analyzing the Student’s Academic Performance by using Clustering Methods in Data Mining,” Int. J.
Sci. Eng. Res, 5(6), 198-202, 2014.
[4] D. Neha and B. M. Vidyavathi, “A survey on applications of data mining using clustering techniques”. International Journal of
Computer Applications, 126(2), 2015.
[5] G. Pratiyush and S. Manu, “Classifying Educational Data Using Support Vector Machines: A Supervised Data Mining Technique.”
Indian Journal of Science and Technology, 9(34), 2016.
[6] H. Al-Shehri, A. Al-Qarni, L. Al-Saati, A. Batoaq, H. Badukhen, S. Alrashed and S. O. Olatunji, “Student performance prediction
using Support Vector Machine and K-Nearest Neighbor,” In Electrical and Computer Engineering (CCECE), 2017 IEEE 30th
Canadian Conference on (pp. (pp. 1-4)). IEEE, 2017.
[7] E. A. Amrieh, T. Hamtini and I. Aljarah, “Mining educational data to predict Student’s academic performance using ensemble
methods,” International Journal of Database Theory and Application, 9(8), 119-136, 2016.
[8] C. Romero and S. Ventura,”Educational data mining: a review of the state of the art,” IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews), 40(6), 601-618, 2010
[9] R. Saxena, “Educational Data Mining: Performance Evaluation of Decision Tree and Clustering Techniques using WEKA Platform,”
International Journal of Computer Science and Business, 2015.
[10] S. S. Nikam, “A comparative study of classification techniques in data mining algorithms,” Oriental Journal of Computer Science &
Technology, 8(1), 13-19, 2015.

Authorized licensed use limited to: University of Texas at Arlington. Downloaded on May 14,2020 at 21:26:04 UTC from IEEE Xplore. Restrictions apply.

You might also like