Professional Documents
Culture Documents
Abstract—Computer programming is one of the most im- solution and appropriate problems to solve. It is important to
portant and vital skill in the current generation. In order to correctly identify these at risk programmers at an early stage
encourage and enable programmers to practice and sharpen their in order to help them in overcoming the difficulties. However,
skills, there exist many online judge programming platforms.
Estimation of these programmers’ strength and progress has early forecasting of ‘at risk’ programmer is challenging since
been an important research topic in educational data mining the strength of these programmer depends on various features
in order to provide adaptive educational contents and early and characteristics depicted in Table I. Each of these features
prediction of ‘at risk’ learner. In this paper, we trained a are correlated to one another. Thus, applying clustering and
Kohonen Self organizing feature map (KSOFM) neural network statistical analysis on these multidimensional data could give
on programmers’ performance log data of Aizu Online Judge
(AOJ) database. Propositional rules and knowledge was extracted an insight and pattern among these programmers. Studies [4–
from the U-matrix diagram of the trained network which 16] used different machine learning and neural network models
partitioned AOJ programmers into three distinct clusters ie. for early forecasting of weak student’s. In our research, we
‘expert’, ‘intermediate’ and ‘at risk’. The proportional rules have trained a Self-organizing feature map on AOJ log data
performed classification with an accuracy of 94% on a testing in order to get a lower dimensional mapping. Further analysis
set. For validation and comparison, three more predicting models
were trained on the same dataset. Among them, feedforward on the trained network provided us with propositional rules to
multilayer neural network and decision tree have scored accuracy partition these programmers based on strength.
of 97% and 96% respectively. In contrast, the precision score for The rest of the paper is organized as follows. Section II
support vector machine was about 88%, but it scored the highest summarizes the literature review. Section III presents the
recall score of 99% in terms of identifying ‘at risk’ students. methodology of our research. Section IV, V and VI presents
Index Terms—Online judge system, novice programmers, clus-
tering, early prediction, self organizing feature map. experiment setup, parameter settings and result from analysis
respectively. Finally, section VII concludes the paper with
limitations and possible future works.
I. I NTRODUCTION
Online judge system is an educational site which refers to II. BACKGROUND AND R ELATED W ORKS
a web service originally designed for programming contest A significant amount of research works have been conducted
like ACM-ICPC (ACM International Collegiate Programming in the identification of ‘at risk’ and ‘novice learner’ both
Contest). Such online judge platforms have a huge number in e-learning and off-learning systems. Among these studies
of programming problems which can be solved both in online key features that have been taken into account are progress
and offline mode. Most of the users of online judge system are in introductory programming courses, prior programming ex-
from computer science and mathematics background. These perience, gender, disliking/negative attitude in programming,
users use the system with a view to enhancing their problem- mathematics background, formal training in programming,
solving skills and compete against each other online. With student’s understanding against difficulties of learning ma-
the increasing number of online judge platform, the amount terials and ability of a student or programmer in finding
of accumulated data are also increasing. These accumulated the way of solving problems [4–7]. Numerous studies have
data gives us an opportunity to leverage them and discover implemented and trained statistical learning model and neural
important knowledge about programmer’s (novice, expert, network for the purpose of classification. Recent work [4] pro-
intermediate, etc.) behavior and progress. Programming is an posed a back propagation neural network which can estimate
interdisciplinary subject. Moreover, competitive programming student performance according to students prior knowledge.
can be relatively difficult and intimidating to a novice user The contribution also constructs a Student attribute matrix
due to problem difficulties, diverse categories and competitive (SAM), indicators and predictors which can tell how much
nature. It is observed from our studies on Aizu Online Judge a specific factor would affect student’s performance. Research
[1–3] submission log data that there are a significant number work [5] provided the classification of programmers based on
of programmers who are struggling with finding the correct the submission log data such as compilation profiles, error
TABLE III
P ROPOSITIONAL RULES EXTRACTED FROM TRAINED SOM
Rules
[ ]
if solved in range[ 600,1300 ]
and AC in range 2000, 3000
and WA and CE and TLE <800
then Expert
[ ]
if solved in range[ 150, 600 ]
and AC in range 1500, 2500 [ ]
and WA and CE and TLE in range 800, 1500
Fig. 6. Two-dimensional feature mapping of total submissions, solved then Intermedaite
and accepted answers. Here the Kohonen neuron layer (16,16) depicts the
clustering of three distint users based on mentioned features. The red circle if solved <150
denotes the clustering of ‘at risk’ users. Blue and green indicate intermediate and WA and CE and TLE >800
and strong users respectively. then at Risk or Novice
D. Rule extraction from Self Organizing Map (SOM) label. The label of the test set was determined in the same
Analysis of Kohonen self-organizing map can provide with manner and validated with the help of expert programmers.
knowledge discovery and exploratory data analysis [22]. Re- All the predicting models were trained and tested on the same
search work by James Malone et al. [22] proposed an algo- training and testing sets respectively.
rithm to extract propositional if-then type rules from the U-
matrix of a trained SOM network. Thus, implementing the V. PARAMETER S ETTINGS
proposed method in our trained network can give us key
The value of parameters of each model has been selected
properties of the discovered clusters. Initially, the boundaries
based on the Grid search method. For SVM, we have set the
of the important components are identified from the trained
regularization parameter C to 0.5. We used the Gaussian RBF
SOM’s U-matrix. The boundary is identified with the help of
(Radial Basis Function) for the kernel. The free parameter
neighboring units. Two selected neighboring units with the
Gamma of RBF has been set to 0.7. For the Multi-layer per-
highest relative difference are selected as candidate boundary
ceptron network, we have set the number of the hidden layer
units [22]. Table III depicts some of the key rules for classi-
to 1 (With 50 hidden neurons). The output layer consists of
fication.
3 neurons for each class. We have used ReLU(Rectifier linear
IV. E XPERIMENT S ET UP unit) function for the non-linearity in both hidden and output
layers. We have used the Gini score for the measurement of
The dataset contains submission log of 25,000 users. This
split quality in the decision tree implementation.
dataset was divided into two sets. 60% of the sample was
drawn at random from the dataset for training. The rest 40%
VI. R ESULTS AND EXTRACTED RULE VALIDATION
was kept for the testing. Although explicit labels about the
strength of users were not available in the training set, instead The performance of extracted rules from the Self-organizing
the range of user’s rating was mapped to an integer value for map (SOM) was tested together with three different learning