You are on page 1of 2

Automatic Syllabus Classification

Xiaoyan Yu, Manas Tungare, William Cameron, GuoFang Teng, and
Weiguo Fan, Manuel Perez-Quinones, Lillian Cassel
and Edward A. Fox Villanova University
Virginia Tech Villanova, PA, USA
Blacksburg, VA, USA {william.cameron, guofang.teng,
{xiaoyany, manas, wfan, perez, lillian.cassel}

ABSTRACT Class Definition Syllabus Out-Links
Full a syllabus without links to T F
Syllabi are important educational resources. However, searching
other syllabus components.
for a syllabus on the Web using a generic search engine is an error-
Partial a syllabus along with links to T T
prone process and often yields too many non-relevant links. In this
other syllabus components
paper, we present a syllabus classifier to filter noise out from search
somewhere else.
results. We discuss various steps in the classification process, in-
Entry a page that contains a link F T
cluding class definition, training data preparation, feature selection,
Page to a syllabus.
and classifier building using SVM and Naı̈ve Bayes. Empirical re-
sults indicate that the best version of our method achieves a high Noise all others. F N/A
classification accuracy, i.e., an F1 value of 83% on average.
Table 1: Class definitions.
Categories and Subject Descriptors: H.3 [Information Storage
and Retrieval]: Digital Libraries
General Terms: Performance, Design, Experimentation and the system could be extended easily to other disciplines and
Keywords: Syllabus, Genre, Text Classification, SVM This paper presents our progress regarding automatic classifica-
tion towards a syllabus collection. A classification task usually can
1. INTRODUCTION be accomplished by defining classes, selecting features, and build-
A course syllabus is the skeleton of a course. It is an important ing a classifier. In order to quickly build an initial collection of
educational resource for educators, students, and life-long learners. CS syllabi, we obtained more than 8000 possible syllabus pages
Free and fast access to a collection of syllabi could have a signif- by searching on Google (details in [4]). After randomly examin-
icant impact on education. Unfortunately, searching for a syllabus ing the set, we found the result set to be very noisy. To help with
on the Web using a generic search engine is an error-prone pro- the task of properly identifying true syllabi, we defined four syl-
cess and often yields too many non-relevant links. The MIT Open- labus class types, shown in Table 1, and then proposed syllabus
CourseWare1 project, which provides free access to MIT course feature characteristics for each class. We used both content and
materials, is a good start towards making a wide and open digital structure information for syllabus classification, as they have been
library of syllabi. found useful in the detection of other genres [2]. Finally, we ap-
However, there exists a chicken-and-egg situation regarding the plied machine learning approaches to learn classifiers to produce
adoption of such a repository on a much larger scale: there is lit- the syllabus repository. We chose and compared Naı̈ve Bayes and
tle incentive for instructors to take the additional effort to add their Support Vector Machine (SVM) approaches and their variances in
syllabi to this repository unless there are existing services that they the syllabus classification since they are good at text classification
can then use. On the other hand, useful services would need a large tasks in general [1, 3].
collection of syllabi to work on. Hence, to break out of this dead- There are many other genres of data on the Web. We hope that
lock, we decided to seed our repository with syllabi acquired from our application of machine learning techniques to obtain a repos-
the Web in order to bootstrap the process. We restrict our focus itory of genre-specific data will encourage the creation of similar
to computer science syllabi offered by universities in the USA as systems for other genres.
a starting point of our proof-of-concept project. The methodology
We define the four classes of documents acquired from the Web
search in Table 1. We consider only the full and the partial classes
Permission to make digital or hard copies of all or part of this work for as syllabi. The reason we treat a partial syllabus as a syllabus is
personal or classroom use is granted without fee provided that copies are that we can complete a partial syllabus by following outgoing links
not made or distributed for profit or commercial advantage and that copies from it, which would be helpful for a variety of services. Each class
bear this notice and the full citation on the first page. To copy otherwise, to has its own characteristics. An entry page contains a link that con-
republish, to post on servers or to redistribute to lists, requires prior specific tains the word ‘syllabus’ or the prefix ‘syl’ or a link whose anchor
permission and/or a fee.
JCDL’07, June 18–23, 2007, Vancouver, British Columbia, Canada. text contains the syllabus keyword. Many keywords such as ‘pre-
Copyright 2007 ACM 978-1-59593-644-8/07/0006 ...$5.00. requisite’ would occur in a full syllabus. These keywords would


833 25% 55% Acctr 56% 57% 71% 99% Acctr 68% 54% 77% 99% . only SMO with the polynomial [4] M.597 0.666 0.waikato. we come to the tentative conclusion that syllabus repository for computer science courses. Tungare. K. along with their features.094 0. Shepherd.922 26% 47% Partial 0.-S. For exam- ple.516 0. It also improves [3] S.cs. ACKNOWLEDGMENTS Weka2 . For the data set Learning. Cassel. fication in this paper.752 0.568 0. IEEE Computer Society. IEEE The training set accuracy for each setting is shown in Table 2 and Transactions on Knowledge and Data Engineering. but with F only 53. training set. SMO with linear kernel (SMO-L).-C.581 0. Proceedings of the 38th Annual Hawaii International the SMO with the polynomial kernel performs 25% better than the Conference on System Sciences (HICSS’05) . DC. In 3. Text categorization with support vector F1 values for the two data sets are shown in Tables 2 and Table machines: Learning with many relevant features.559 Entry 0. we ran- domly sampled 1020 documents from the web-acquired collection of more than 8000 documents and manually classified them into the 4.685 Full 0.358 Partial 0. and 55% better than the one in the Washington. Yu. Our Science Education (SIGCSE 2007). 2007.286 0. Joachims. H. Rim. Kennedy and M. We did some failure analysis on the classification result. It would 3. A capitalized key. Proceedings of the 38th Technical Symposium on Computer tion task. Naı̈ve Bayes Table 3: F1 comparisons on the same four settings on the data with kernel (NB-K).52 Noise 0. attributes and SMO (Sequential Minimal Optimization. - Table 2: F1 comparisons on Naı̈ve Bayes (NB). However. all implemented in 5.67 0. our best result. Springer. Fox.525 0. November 2006. In ad- dition. SMO with the polynomial kernel on a uniform dis- nent with a heading in the page.73 0. with the original distribution.636 0. effective techniques for naive bayes text classification. In SVM performs better than Naı̈ve Bayes for the syllabus classifica. Cameron. also occur in a partial syllabus but often along with a link.174 0. over SMO-L.477 0. as can be seen in the last two columns in Table 3. Han. suggest a syllabus component outside the page.386 0. Generally SMO produces more accurate classifiers on the train. [1] T.517 0. 3. H. ing set than Naı̈ve Bayes.629 0.309 0. For example. INC43 : the increasing percentage that SMO-P is over the best results in the setting shown in bold in Table 2.538 0. 2005. REFERENCES and the average results are reported below.747 0. to described in the above We employed also are building an extractor for structured syllabi and creating Naı̈ve Bayes with/without kernel density estimation for numerical services upon these collections. Motivated by the above observa.767 26% 2% EntryPage 0. and S. and the co-occurrences of full. we considered the occurrences of the keywords around a link and among the anchor text of the link as the same feature. We trained and tested the classifiers on both data sets with 10-fold cross-validation 6. M. Pérez-Quiñones.8% on average. Teng. Automatic identification of the best. a results also showed that the classifiers which are learnt from the keyword within the anchor text of a link or around the link would data set with the uniform distribution perform better. Fan. partial. a member of SVM family) with linear/polynomial kernel. W. Entry: EntryPage. EVALUATION mislead the classifier to classify a full syllabus with links to a few resources as a partial one. Towards a From these experiments. E. For word at the beginning of a page would suggest a syllabus compo. 18(11):1457–1466. and set with the uniform distribution after 100% re-sampling the SMO with polynomial kernel of degree 5 and lambda=10 original data set.Track 4. and L. Myaeng.514 0.166 0. the positions of keywords. kernel can reach near 100% accuracy. We are still testing more settings. 138 entry and 175 We presented a description of our approach to syllabus classi- noise pages in the sample set.519 0.569 0. Some for each class. In HICSS ’05: Proceedings of the re-sampling. CONCLUSIONS four classes. we tion under DUE grant #0532825. USA. We was done with Naı̈ve Bayes and SVM methods. different feature selection approaches and training data set sizes. For the data set after home pages on the web. tribution data set. Overall.590 0. F1 NB-K NB SMO-L SMO-P F1 NB-K NB SMO-L SMO-P INCPL INC43 Full 0.607 0. G. SMO with the linear kernel performs [2] A. X. This work was funded in part by the National Science Founda- Since data points for each class are not evenly distributed. In order to train and evaluate our syllabus classifier. In order to improve the classification between the full and the keywords and links.51 0.-B. for the classification task. to provide Proceedings of the European Conference on Machine an overall measure of classification performance. SMO with the linear kernel. the performance of each classifier improves. original distribution setting in terms of average F1 . which discover more insights to build the best classifier for syllabi.688 0. Acctr : Accuracy on the train. Kim. such as pled documents in a variety of formats. 61% of mis-labeled full syllabi were labeled as tions.767 32% 99% Noise 0.564 0. Acctr : Accuracy on the (SMO-P) on the original data set. 2 http://www. INCPL : the increasing percentage that SMO-P is ing set. also use the re-sampling approach to create another set of data with each class containing evenly distributed data 441 .551 0. we need to improve feature selection methods. the position of a keyword in a page matters. 208 partial. We extracted free text from the sam. we selected 84 features mainly concerning the occurrences partial and 44% of the mis-labeled partial syllabi were labeled as of keywords. 1998. W. SVM with a polynomial kernel performs the best. F1 is a measure that balances precision and recall.874 17% 46% Avg 0. We observed 499 full.531 Avg 0.