Professional Documents
Culture Documents
net/publication/315468772
CITATIONS READS
5 510
7 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jidong Ge on 01 November 2017.
Miaomiao Lei, Jidong Ge(B) , Zhongjin Li, Chuanyi Li, Yemao Zhou,
Xiaoyu Zhou, and Bin Luo
1 Introduction
documents are recorded and more than 70K new judgments are indexed every-
day. This huge amount of judicial documents is of great importance not only
for improving judicial justice and openness, but also for court administrators
for record keeping and future reference in decision making and judgment writ-
ing. Furthermore, data sharing and deeper analysis on these judgments is the
key approach in the process of information construction and legislation system
improvement.
In China Judgment Online System, judicial cases are indexed with five major
types depending on the cause of action, which are administrative case, criminal
case, civil case, compensation case and execution case respectively. Beneath each
type usually lies other organization hierarchies. Grouping by keywords is one of
the most common used methods, for example, keywords in criminal cases can
be illegal possession, surrender, joint offence, penalty, etc. Keywords matching
does make a contribution to a better organized judgment documenting system.
But, there are also some limits existing, on one hand, a list of keywords must
be manually created and maintained. However, enumerating all keywords com-
pletely is difficult, thus leading to extra human labor cost. On the other hand,
grouping by keywords does not meet all demands in real applications, when new
classification requirements are generated, it is almost impossible to categorize
all of them manually.
Text classification, as an important task in natural language processing,
involves assigning a text document to one of the predefined classes or topics.
Text classification has been widely studied in the text mining and information
retrieval communities and commonly used in a number of diverse domains such
as news automatic categorization, spam detecting and filtering, opinion min-
ing and document indexing, etc. In this paper, we propose a machine learning
approach to automatically classify Chinese judgment documents into predefined
categories. Our work includes: (1) propose an automated method to construct
a list of judicial specific stop-words, (2) propose an effective strategy for Chi-
nese judgment documents representation as well as feature dimensional reduction
while keeping as much important information as possible, (3) achieve high perfor-
mance utilizing machine learning based algorithms to classify Chinese judgment
documents.
To evaluate the performance of our approach, we also manually label 6735
pieces of Chinese judgment documents that are related to liabilities of product
quality into 13 categories based on the statutory standard of industry division.
With the experiment results on this dataset, we will prove the contributions of
this paper by answering the research questions as follows:
(1) How is the performance of the classifier improved by domain stop words
list construction and text preprocessing in Chinese judgment documents
classification?
(2) What kind of features should be selected and how can we benefit from
dimensional reduction?
(3) Which machine learning algorithm achieves better performance for Chinese
judgment documents classification?
Automatically Classify Chinese Judgment Documents 5
2 Related Work
In recent years, the problem of text classification has gained increasing attention
due to the large amounts of text data that are created in many information-
centric applications, which is concomitant with tremendous researches on meth-
ods and algorithms in text classification. In this section, we provide an overview
of key techniques for text classification.
Technically, text data is distinguished from other forms of data such as rela-
tional or quantitative data in many aspects. The most important characteristic
of text data is that it is sparse and high dimensional [1]. Text data can be ana-
lyzed at different levels of representation. Bag-of-words (BOW) simply represent
text data as a string of words. TF-IDF takes both word frequency and document
frequency into consideration to determine the importance of each word, which is
commonly used in the representation of documents. Strzalkowski demonstrated
that a proper term weighting is important and different types of terms and
terms that are derived from different means should be differentiated [2]. Jiang
integrated rich document representations to derive high quality information from
unstructured data to improve text classification [3]. Liu studied document repre-
sentation based on semantic smoothed topic model [4]. Besides, document rep-
resentations in many different applications have been studied. Yang proposed
a novel approach for business document representation in e-commerce [5], and
Arguello introduced two document representation models for blog recommenda-
tion [6].
The bag of words (BOW) representation can help retain a great deal of use-
ful information, but it is also troublesome because BOW vectors are very high
dimensional. To provide an ideal lower-dimensional representation, dimension
reduction in many forms are studied to find the semantic space and its rela-
tionship to the BOW representation. Two techniques for dimensional reduction
stand out. The first one is latent semantic indexing (LSI) which is based on
singular vector decomposition to find a latent semantic space and construct a
low rank approximation of the original matrix while preserving the similarity
between the documents [7]. The other one is topic models that provide a proba-
bilistic framework for the dimension reduction task [8]. Hofmann proposed PLSI
that provides a crucial step in topic modeling by extending LSI in a probabilistic
context which is a good basis for text analysis but contains a large number of
parameters that grows linearly with the number of documents [9]. Latent Dirich-
let Allocation (LDA) includes a process that generates topics in each document,
therefore, it greatly reduces the number of parameters to be learned and is an
improvement of PLSI [10].
A wide variety of machine learning algorithms have been designed for text
classification and applied in many applications. Apte proposed an automated
6 M. Lei et al.
learning method of decision rules for text categorization [11]. Baker introduced
a text classification method based on distributional clustering of words [12].
Drucker studied support vector machines for spam categorization [13]. Ng and
Jordan made comparisons of discriminative classifiers and generative classifiers
[14]. Sun presented supervised Latent Semantic Indexing for document catego-
rization [15].
3 Approach
In this section, we present our approach for Chinese judgment documents clas-
sification in detail as follows. Section 3.1 presents an overview of the workflow
for Chinese judgment documents classification. Section 3.2 introduces text pre-
processing. Section 3.3 introduces a method of document representation and
approaches to provide a lower-dimensional feature vector after generated by
document representation method. Section 3.4 introduces classifiers used in our
work.
3.1 Overview
Figure 1 presents the overview of the workflow we use for Chinese judgment doc-
uments classification. To analyze Chinese judgment documents in a deep going
way, the classification approach starts from setting a clear-out classification goal
and then, based on it, accessing Chinese Judgment Online System to obtain
amount of to-be-classified judgment documents with a certain kind of cause of
action. To build a classification model and evaluate the performance of it, we
put a huge amount of human efforts in labeling a proportion of those judgment
documents according to the classes we have defined in the goal setting process.
After that, labeled judgment documents will be separated into 2 parts, respec-
tively called training dataset and test dataset, one for model training and the
other one for performance evaluation. As for how large a portion should be used
as training dataset, this problem has been well discussed in [16] and 70% is the
number we choose in our work.
Different from documents written in English, Chinese documents require dif-
ferent text preprocessing methods because of the huge distinction in morphol-
ogy, grammar, syntax, etc. Beyond that, judgment documents also have their
own characteristic. Generally, a certain format is used when a court administra-
tor writing a judgment. Therefore, not all of content in a judgment is useful for
achieving our classification goal. Due to this reason, we figure out that extracting
the content we need for classification only can make a contribution to reducing
noise thus improving performance. In Sect. 3.2.1 we will introduce the method
we use to extract contents. As mentioned, Chinese documents should be seg-
mented into words before they are available for representation. For this purpose,
we employ techniques in natural language processing, also called Chinese Word
Segmentation, whose detail will be clarified in Sect. 3.2.2. While it is fairly easy
to use a published set of stop words list, in many cases, using such stop words is
Automatically Classify Chinese Judgment Documents 7
text is called noises in text classification, which effects experiment results and
leads to unpleasant performance especially in short texts such as judgment doc-
uments. Due to the reasons listed above, we think extracting particular contents
is necessary for reducing noises and therefore, improving performance.
In our work, regular expression is utilized to extract a certain part of content
in a judgment document. Regular expression, abbreviated as regex, is used to
represent patterns that matching text need to conform to. For extracting differ-
ent parts from Chinese judgment documents, different regular expressions are
developed. Table 1 presents the most used regular expressions we have studied
and summarized for extracting different part of contents from a judgment.
Table 1. A summary of the most used regular expressions for extracting different
contents from a judgment
1. Use the terms that occur frequently across judgment documents (low IDF
terms) as stop words. Inverse Document Frequency (IDF) refers to the inverse
fraction of documents in whole collection that contain a specific term, IDF is
often used to represent the importance of a term. In other words, if a term
occurs almost in every document of a collection, that means it is not important
and can be regard as potential stop word. After word segmentation, we sum
the term frequency of each unique word by scanning all documents. And then
sort the terms in descending order. Top N terms have been chosen as stop
words. The number N is chosen manually after human scanning. To make sure
the set of stop words be judicial related, we filter all common used Chinese
words in advance. The benefit of this approach is that it is very intuitive and
easy to implement.
2. Use the least frequent terms as stop words. Terms are extremely infrequent
may not be useful for text mining and retrieval. For example, these terms in
judgments may be location names, human names, specific feeling expressions,
etc. which may not be relevant for a judgment classification object. Removing
these terms can significantly reduce overall feature space.
(1) Numbers, Chinese numbers, English letters and special symbols. As none of
word segmentation systems can provide accuracy of 100%, some errors can
occur in this step. As a result, words with numbers or Chinese numbers,
English letters and special symbols can be found in word segmentation
results, which is not useful for judgment classification.
10 M. Lei et al.
(2) Judicial stop words. After a set of judicial stop words has been constructed
as presented in last section, we remove all words in every document that is
matched as a judicial stop word.
(3) Location names. By investigating an amount of judgment documents, loca-
tion has occurred in almost every judgment, such as province, city, district,
village, street, etc. By accessing a location name database, we can remove
most of them. For the rest, we match the last word of a term, see if it
represents a location, if is, remove the term.
(4) Human names. Human name recognition is one of the Name Entity Recog-
nition (NER) tasks in natural language processing. Some NLP systems pro-
vide such services to accurately label human names occur in a given text.
However, in judgments, human names are also considered as least frequent
terms, as a step of stop words removal, using such systems for this purpose
is a waste of resource. Therefore, by removing the least frequent terms, we
can remove human names together.
it). By calculating the TF-IDF weight, each term is regarded as a feature of the
document, and the corresponding value is its TF-IDF weight. Thus all documents
are transformed into a feature vector.
3.4 Classifiers
At present, the most used machine learning algorithms are Nave Bayes, Decision
Tree, Random Forest and Support Vector Machine (SVM). In order to study the
performance that different classifiers can achieve in classifying Chinese judgment
documents, after documentation representation and dimensional reduction, we
apply off-the-shelf machine learning algorithms to train a classification model
that can be used to classify an unseen record as belonging to one of the predefined
categories. In our work, we will evaluate all these most used machine learning
algorithms by training corresponding learned classifiers and compare their results
together.
4 Evaluation
To evaluate the performance of Chinese judgment documents classification, we
experiment on further separating judgments related to liabilities for product
12 M. Lei et al.
quality into their specific industries. In this section, we introduce the experiment
dataset and evaluation metrics. At the end of this section, experiments and
results are presented.
4.1 Dataset
In order to train a classification model and test the accuracy and performance
of it, a golden standard dataset for Chinese judgment documents is required.
As there exist no such datasets, we make a lot of efforts to manually label 6735
pieces of judgment documents which are related to liabilities for product quality
into 13 categories based on the statutory standard of industry division.
Hold-out method is utilized in dataset separation. To avoid introducing extra
errors in splitting datasets and to maintain the consistency of data distribution,
we use stratified sampling method other than random sampling to separate train-
ing dataset and test dataset. With stratified sampling method, the ratio of each
category in training dataset and test dataset maintain the same. However, exper-
iment results using singular hold-out may not be stable and reliable. Therefore,
multiple stratified sampling is used and the average of all results are given to
evaluate performance. As for the setting of ratio value, there exists a trade-off,
since the more samples training dataset contains, the closer the classification
model is to the one trained by the overall dataset, but the less samples test
dataset will contain, which may lead to inaccurate and unstable results. While
the larger test dataset is, the smaller training set will be, thus leading to a more
distorted classification model. No perfect solution exists for this problem, com-
mon way to tackle it is using 2/3–4/5 samples as training dataset while the rest
as test dataset. In our work, we set the ratio as 70%. Figure 2 illustrates the dis-
tribution of our datasets for Chinese judgment documents classification. Among
all labeled judgment documents, 70% is used for building classification model,
30% for performance evaluation.
With stop words removal in text preprocessing, the number of feature dimen-
sions is reduced from 98750 to 68155. Figure 4 presents the overall accuracies of
14 M. Lei et al.
classifiers with text preprocessing and without text preprocessing. As text pre-
processing removes a large amount of noises significantly, thus provides a lower
dimensional feature vector. As Fig. 4 illustrates, the overall accuracy is improved
greatly no matter which classifier is used. In specific, the overall accuracy of NB
improves 0.89%, of DT improves 6.22%, of RFC improves 7.2%, and of SVM
improves 1.14%.
Fig. 4. Overall accuracies of classifiers with text preprocessing and without text pre-
processing
Table 2. Time cost of classifiers with text preprocessing and without text preprocessing
RQ (2): What kind of features should be selected and how can we benefit
from dimensional reduction?
By utilizing 3 kinds of dimensional reduction methods can we reduce feature
dimensions. Figure 3 illustrates the words with low document frequency, which
may not be meaningful for classification, occupy most of the places. We have
Automatically Classify Chinese Judgment Documents 15
Meanwhile, when it comes to the time costs of utilizing each classifier for
classification model training, the time cost of SVM reduces accordingly with the
reduction of feature dimensions. For other classifiers, the performance improves
at a certain degree with PCA and SVD. Taken both overall accuracy and time
cost into consideration, SVM achieves the best overall accuracy comparing to the
others, but it costs more time in model training, with minimum document fre-
quency and PCA or SVD for further dimensional reduction, better performance
can be achieved with a little cost of accuracy.
RQ (3): Which machine learning algorithm achieves better performance for
Chinese judgment documents classification?
Figure 6(a), (b), (c) illustrates the precision, recall, F1-score of each category
respectively when using different classifier. In each figure, the last cluster repre-
sents the average result. When sorting the performance of classifiers by average
precision, SVM takes the first place, RFC is better than DT, and DT is better
than NB. Meanwhile, when sorting by average recall, we have the same results.
16 M. Lei et al.
Since F1-score measures the balance of precision and recall, the higher F1-score
is, the better performance. Therefore, as Fig. 6(c) presents, for Chinese judg-
ment documents classification, we have the overall performance order as: SVM
> RFC > DT > NB. Drilling down to each category, the distinction of the per-
formance of classifiers can be tracked back to the different training dataset size
of each category, also known as value of support. With categories with larger
amount of dataset, i.e. Pharmaceutical Industry and Chemical Industry, SVM
can achieve better results than others. However, with smaller datasets, RFC can
have more stable performance. Based on the experiment results, SVM achieves
better performance with F1-score at 87%.
(a) Precisions of each cat- (b) Recalls of each category (c) F1 scores of each cate-
egory using different classi- using different classifiers. gory using different classi-
fiers. fiers.
5 Conclusion
Approaches to automatically classify Chinese judgment documents utilizing a
wide variety of machine learning algorithms are explored in this paper. Differ-
ent from other documents, Chinese judgments have a certain kind of format, as
a result, for different classification goal, related content extraction from origi-
nal judgments is necessary for removing unrelated information. To improve the
performance of classification, first, we construct a list of judicial stop words by
statistically analyzing words that occur frequently across all documents as well as
words with least frequencies. Second, we utilize three different dimensional reduc-
tion methods to reduce feature dimensions while keeping as much information
as possible, which include minimum document frequency, PCA and Truncated
SVD. Results of experiments demonstrate the effectiveness of those dimensional
reduction approaches for improving the performance in Chinese judgments clas-
sification. Third, four machine learning algorithms are applied for document
classification, SVM achieves better performance compared to others, with aver-
age F1-score at 87%.
These methods can be easily applied in other judgments classification. To
realize a more openness, justice and functional judicial system, more classi-
fication tasks in other kinds of judgments and deeper analysis on judgments
should be carried out. As a judgment could belong to multiple classes, for
Automatically Classify Chinese Judgment Documents 17
Acknowledgement. This work was supported by the Key Program of Research and
Development of China (2016YFC0800803), the National Natural Science Foundation,
China (No. 61572162, 61572251), the Fundamental Research Funds for the Central
Universities.
References
1. Aggarwal, C.C., Zhai, C.X.: An introduction to text mining. In: Mining Text Data,
pp. 1–10 (2012)
2. Strzalkowski, T.: Document representation in natural language text retrieval. In:
Proceedings of the Workshop on Human Language Technology, pp. 364–369 (1994)
3. Jiang, S., Lewris, J., Voltmer, M.: Integrating rich document representations for
text classification. In: Systems and Information Engineering Design Symposium
(SIEDS) (2016)
4. Liu, Y., Song, W., Liu, L.: Document representation based on semantic smoothed
topic model. In: International Conference on Software Engineering, Artificial Intel-
ligence, Networking and Parallel/Distributed Computing (SNPD) (2016)
5. Yang, S., Guo, J.: A novel approach for business document representation and
processing without semantic ambiguity in e-commerce. In: 6th IEEE Conference
on Software Engineering and Service Science (ICSESS) (2015)
6. Arguello, J., Elsas, J.L., Callan, J., Carbonell, J.G.: Document representation and
query expansion models for blog recommendation. In: Proceedings of the 2nd Inter-
national Conference on Weblogs and Social Media (ICWSM) (2008)
7. Berry, M.: Large-scale sparse singular value computations. Int. J. Supercomput.
Appl. 6(1), 13–49 (1992)
8. Blei, D., Lafferty, J.: Dynamic topic models. In: ICML, pp. 113–120 (2006)
9. Hofmann, T.: Probabilistic latent semantic analysis. In: UAI, p. 21 (1999)
10. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. 3, 993–
1022 (2003)
11. Apte, C., Damerau, F., Weiss, S.: Automated learning of decision rules for text
categorization. ACM Trans. Inf. Syst. 12(3), 233–251 (1994)
12. Baker, L., McCallum, A.: Distributional clustering of words for text classification.
In: ACM SIGIR Conference (1998)
13. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization.
IEEE Trans. Neural Netw. 10(5), 1048–1054 (1999)
14. Ng, A.Y., Jordan, M.I., On discriminative vs. generative classifiers: a comparison
of logistic regression and Naive Bayes. In: NIPS, pp. 841–848 (2001)
15. Sun, J.-T., Chen, Z., Zeng, H.-J., Lu, Y., Shi, C.-Y., Ma, W.-Y.: Supervised latent
semantic indexing for document categorization. In: ICDM Conference (2004)
16. Zhou, Z.: Machine Learning (2015)
17. Che, W., Li, Z., Liu, T.: LTP: a Chinese language technology platform. In: Pro-
ceedings of the COLING: Demonstrations, Beijing, China, pp. 13–16, August 2010