Professional Documents
Culture Documents
Predicting School Failure Using Data Min
Predicting School Failure Using Data Min
C. MÁRQUEZ-VERA
Autonomous University of Zacatecas, Mexico
C. ROMERO AND S. VENTURA
University of Cordoba, Spain
________________________________________________________________________
This paper proposes to apply data mining techniques to predict school failure. We have used real data about
670 middle-school students from Zacatecas, México. Several experiments have been carried out in an attempt to
improve accuracy in the prediction of final student performance and, specifically, of which students might fail.
In the first experiment the best 15 attributes has been selected. Then two different approaches have been applied
in order to resolve the problem of classifying unbalanced data by rebalancing data and using cost sensitive
classification. The outcomes of each one of these approaches using the 10 classification algorithms and 10 fold-
cross validation are shown and compared in order to select the best approach to our problem.
Key Words and Phrases: School failure, Educational Data Mining, Prediction, Classification.
_________________________________________________________________________________________
1. INTRODUCTION
Recent years have shown a growing interest and concern in many countries about the
problem of school failure and the determination of its main contributing factors. This
problem is known as the “the one hundred factors problem” and a great deal of research
has been done on identifying the factors that affect the low performance of students
(school failure and dropout) at different educational levels (primary, secondary and
higher) (Araque et al., 2009). A very promising solution to resolve this problem is the use
of Data Mining (DM) that is called Educational Data Mining (EDM) when applied to an
educational context (Romero and Ventura, 2010). There are examples about how to apply
EDM techniques for predicting drop out and school failure (Kotsiantis, 2009). These
works have shown promising results with respect to those sociological, economic or
educational characteristics that may be more relevant in the prediction of low academic
performance. It is also important to notice that most of this research on EDM applied to
resolve this problems have been applied primarily to the specific case of higher education
(Kotsiantis, 2009) and more specifically to online or distance education (Lykourentzou et
al., 2009). However, very little information has been found in specific research on
elementary and secondary education, and what has been found only uses statistical
methods, not DM techniques (Parker, 1999).
The selection of the attributes with a frequency greater than 2 has reduced the
dimensionality of our dataset from the original 77 attributes to only the best 15 attributes.
Starting from these 15 attributes a second experiment has been carried. Table I (B) shows
the results with the test files (the average of 10 executions) using only the best 15
attributes. When comparing the results obtained versus the previous one using all the
attributes, that is, Table II (A) versus (B), we can see in general that all the algorithms
have improved in some measures (TN Rate and Geometric mean). And about the other
measures (TP rate and Accuracy) there are some algorithms that obtain a bit worse or a
bit better values, but very similar in general to the previous ones. In fact, the maximum
values obtained now are better than the previous ones obtained using all attributes in two
evaluation measures (TN rate and Geometric mean). The algorithm that obtains these
maximum values is Jrip (81.7% TN rate and 89% Geometric Mean). However, although
these results are better than the previous one; they are still very lower than the obtained
by TP rate (greater than 95.6% and a maximum value of 99.2%) and Accuracy (greater
than 93.1% and a maximum value of 97.3%). This is because our data are imbalanced.
TABLE I. COMPARISON OF RESULTS: (A) USING ALL ATTRIBUTES, (B) USING THE BEST ATTRIBUTES.
TABLE III. COMPARISON OF RESULTS: (A) USING DATA-BALANCING, (B) USING COST-SENSITIVE
A B
Algorithm TP TN Accuracy Geometric TP TN Accuracy Geometric
rate rate Mean rate rate Mean
Jrip 97.7 65 94.8 78.8 96.2 93.3 96 94.6
Nnge 98.7 78.3 96.9 87.1 98.2 71.7 95.8 83
OneR 88.8 88.3 88.8 88.3 96.1 70 93.7 80.5
Prism 99.8 37.1 94.7 59 99.5 39.7 94.4 54
Ridor 97.9 70 95.4 81.4 96.9 58.3 93.4 74
ADT ree 98.2 86.7 97.2 92.1 98.1 81.7 96.6 89
J48 96.7 75 94.8 84.8 95.7 80 94.3 87.1
RandomT ree 96.1 68.3 93.6 79.6 96.6 68.3 94 80.4
REPT ree 96.5 75 94.6 84.6 95.4 65 92.7 78.1
SimpleCart 96.4 76.7 94.6 85.5 97.2 90.5 96.6 93.6
REFERENCES
ARAQUE F., ROLDAN C., SALGUERO A. 2009. Factors Influencing University Drop Out Rates. Computers
& Education, 53, 563–574.
ELKAN, C. 2001. The Foundations of Cost-Sensitive Learning. International Joint Conference on Artificial
Intelligence, 1-6.
Gu, Q., Cai, Z, Zhu, L., Huang, B. 2008. Data Mining on Imbalanced Data Sets. Proceedings of International
Conference on Advanced Computer Theory and Enginnering, Phuket, 1020-1024.
Kotsiantis S. 2009. Educational Data Mining: A Case Study for Predicting Dropout – Prone Students. Int. J.
Knowledge Engineering and Soft Data Paradigms, 1(2), 101–111.
LYKORENTZOU I., GIANNOUKOS I., NIKOPOULOS V., MPARDIS G. AND LOUMOS V. 2009. Dropout Prediction in
e-learning Courses through the Combination of Machine Learning Techniques. Computers & Education, 53,
950–965.
NITESH V. CHAWLA ET. AL. 2002. Synthetic Minority Over-sampling Technique. Journal of Artificial
Intelligence Research, 16, 321-357.
PARKER A. 1999. A Study of Variables That Predict Dropout From Distance Education. International Journal
of Educational Technology, 1(2) 1-11.
ROMERO C. AND VENTURA S. 2010. Educational Data mining: A Review of the State of the Art. IEEE
Transactions on Systems. Man, and Cybernetics. 40(6), 601-618.
WITTEN I. H., EIBE F. AND HALL, M.A. 2011. Data Mining, practical Machine Learning Tools and Techniques.
Third Edition. Morgan Kaufman Publishers.