Professional Documents
Culture Documents
Yunqiu Shao1 , Jiaxin Mao1 , Yiqun Liu1∗ , Weizhi Ma1 , Ken Satoh2 , Min
Zhang1 and Shaoping Ma1
1
BNRist, DCST, Tsinghua University, Beijing, China
2
National Institute of Informatics, Tokyo, Japan
yiqunliu@tsinghua.edu.cn
3501
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
fication task with a small-scale legal case entailment dataset or query-passage pairs. As for legal case retrieval, it is still
(Challenge 3). In practice, we employ the cascade frame- worth investigating how to utilize BERT to model the rela-
work to avoid high computational cost and arrange the model tionship between long case documents.
in a multi-stage pipeline for legal case retrieval. Specifically, Finding relevant materials is fundamental in the legal field.
we select top-K candidates according to the BM25 rankings In countries following the common law system, prior cases
and fine-tune BERT with an extra legal dataset before apply- are one of the primary sources of law. Thus, legal case re-
ing BERT-PLI to relevance prediction. Experiments are con- trieval is an important research topic. Meanwhile, legal case
ducted on the COLIEE 2019 [Rabelo et al., 2019] legal case retrieval is challenging because it is different from ad-hoc re-
retrieval task and the results demonstrate the effectiveness of trieval in various aspects, including the definition of relevance
our proposed method. in law [Van Opijnen and Santos, 2017] and the characteristics
of legal documents, such as the document length, professional
2 Related Work legal expressions, and the logical structures behind natural
languages [Turtle, 1995]. A variety of approaches as well as
A large number of retrieval models, especially for ad-hoc expert knowledge are involved in this task [Bench-Capon et
text retrieval, have been proposed in the past decades. Tra- al., 2012], e.g., logical analysis, lexical matching, distributed
ditional bag-of-words IR models, including VSM [Salton representation, etc. For instance, decomposition of legal is-
and Buckley, 1988], BM25 [Robertson and Walker, 1994], sues [Zeng et al., 2005], ontological frameworks [Saravanan
and LMIR [Song and Croft, 1999], which are widely applied et al., 2009], and link analysis [Monroy et al., 2013] have
in search systems, are mostly based on term-level matching. been explored. Generally speaking, methods can be grouped
Since the mid-2000s, LTR (Learning to Rank) methods [Liu into two broad categories: those based on manual knowledge
and others, 2009], which are driven heavily by manual feature engineering (KE) and those based on natural language pro-
engineering, have been well studied and utilized by commer- cessing (NLP) [Maxwell and Schafer, 2008]. The compe-
cial web search engines as well. titions held recently, such as COLIEE, are mostly aimed at
In recent years, the development of deep learning has also exploring the application of NLP-based methods and provid-
inspired applications of neural models in IR. Generally, the ing benchmarks for the legal case retrieval task. In COLIEE
methods can be categorized into two types [Xu et al., 2018], 2019, both traditional retrieval models and neural models are
methods of representation learning and methods of match- explored in the legal case retrieval task. Specifically, [Tran
ing function learning. Based on the idea of representation et al., 2019] combined distributed representation with lexi-
learning, queries and documents are represented in the latent cal matching features via LTR algorithms while [Rossi and
space by deep learning models, and the query-document rel- Kanoulas, 2019] utilized BERT with the help of automatic
evance score is calculated based on their latent representa- summarization algorithms.
tions with a vector space scoring function, e.g., cosine sim-
ilarity. Various neural network models have been applied 3 Method
to this task. For instance, DSSM [Hu et al., 2014] uti-
lized DNN in the early stage. Further, some studies ex- 3.1 Task Description
ploit CNNs to capture the local interactions [Hu et al., 2014;
The legal case retrieval task involves finding prior cases that
Shen et al., 2014], while some studies apply RNNs to mod-
should be “noticed” concerning a given query case in the set
eling text sequences [Palangi et al., 2016]. However, most
of candidate cases [Rabelo et al., 2019]. “Noticed case” is
of the model structures are not designed for representing long
a legal technical term denoting that a precedent is relevant
documents. In particular, it is difficult for CNN models to rep-
to a query case, in other words, it supports the decision of
resent the complex global semantic information while RNN
a query case. Formally, given a query case q, and a set of
models tend to forget important signals when dealing with a
candidate cases D = {d1 , d2 , . . . , dn }, the task is to identify
long sequence. On the other hand, matching function learn-
the supporting cases D∗ = {d∗i | d∗i ∈ D ∧ noticed (d∗i , q)},
ing methods first construct a matching matrix or capture local
where noticed (d∗i , q) denotes that d∗i should be noticed given
interactions based on the word-level matching, and then use
the query case q. Both the query and the candidates are legal
neural networks to discover the high-level matching patterns
documents containing long texts, which consist of the facts in
and aggregate the final relevance score. Well-known mod-
a case.
els include ARC-II [Hu et al., 2014], MatchPyramid [Pang et
al., 2016], and Match-SRNN [Wan et al., 2016]. Although
these models work well for ad-hoc text retrieval, where the
3.2 Architecture Overview
query is quite short, their performances are restricted in the In general, we deal with the legal case retrieval task within a
scenario of legal case retrieval due to its quadratic time and multi-stage pipeline inspired by the cascade framework. As
memory complexity in constructing the whole interaction ma- illustrated in Figure 1, it consists of three stages. In Stage 1,
trix. Since BERT [Devlin et al., 2018] has reached state- we select top-K candidates from the initial candidate corpus
of-art performance in 11 NLP tasks, pre-trained language with respect to the query case q according to BM25 scores.
models have drawn great attention in many fields related to In Stage 2, we fine-tune the BERT model on a sentence pair
NLP. Recently, some works have shed light on the appli- classification task with a legal case entailment dataset in or-
cation of BERT to ad-hoc retrieval [Dai and Callan, 2019; der to adapt it to modeling semantic relationships between
Yilmaz et al., 2019] by modeling evidence in query-sentence legal paragraphs. In the final stage, BERT-PLI conducts rele-
3502
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Entailment
Label
Stage 2
Interaction Map
BERT MaxPooling
Sentence Pair Classification pq1 … p′qk1
… …
Prediction
pq2
Attention
E[CLS] E1 EQ E[SEP] E′1 E′P … p′qk2
RNN
… …
= y
pqi … p′qki
[CLS] Tok 1 … Tok Q [SEP] Tok1 … Tok P
Cij …
pqN … p′qkN
pk1 pk2 pkj pkM
Decision Candidate
Paragraph Paragraph
BERT
Aggregation & Prediction
pqi pkj
QueryCase BM25 d1 q = (pq1, pq2, . . . , pqN ) dk = (pk1, pk2, . . . , pkM )
d2
Stage 3
Paragraph-Level Interaction
…
Corpus
dk
vance prediction with the fine-tuned BERT (Stage 2) among binary classification, optimizing a cross-entropy loss, written
the selected candidates (Stage 1). as −(y log(ŷ) + (1 − y) log(1 − ŷ)).
3.3 Stage 1: BM25 Selection 3.5 Stage 3: BERT-PLI
Deep learning models are usually time-consuming and To tackle the challenge brought by long and complex docu-
resource-consuming. Considering the computational cost, ments, we first break a document into paragraphs (Challenge
we employ the cascade framework which utilizes BM25 to 1) and model the interactions between paragraphs in the se-
prune the set of candidates. The BM25 model is implemented mantic level (Challenge 2). In Figure 1 (Stage 3), the query
according to the standard scoring function [Robertson and q and one of the candidate document dk are represented by
Walker, 1994]. This stage inevitably hurts both recall and paragraphs, denoted as q = (pq1 , pq2 , . . . , pqN ) and dk =
precision, and we mainly pay attention to optimizing recall (pk1 , pk2 , . . . , pkM ), where N and M denote the total num-
at this stage since the downstream models can further discard bers of paragraphs in q and dk , respectively. For each para-
irrelevant documents. graph in q and dk , we construct a paragraph pair (pqi , pkj ),
3.4 Stage 2: BERT Fine-tuning where 1 ≤ i ≤ N and 1 ≤ j ≤ M , as the input of BERT,
along with the reserved tokens (i.e. [CLS] and [SEP]). The
Fine-tuning is relatively inexpensive compared with the pre- final hidden state vector of the token [CLS] is viewed as the
training procedure, which allows BERT to model specific aggregate representation of the input paragraph pair. In that
tasks with small datasets [Devlin et al., 2018]. Therefore, way, we can get an interaction map of paragraphs, in which
before applying BERT to infer the relationship of case para- each component Cij , Cij ∈ RHB , models the semantic re-
graphs, we fine-tune it with a small-scale legal case entail- lationship between the query paragraph pqi and the candidate
ment dataset provided by COLIEE 2019 Task 2 (Challenge paragraph pkj .
3). This task involves identifying the paragraphs that entail For each paragraph of the query, we capture the strongest
the decision paragraph of a query case from a given relevant matching signals with the candidate document using the max-
case. Fine-tuning on this task enables BERT to infer the sup- pooling strategy, and hence get a sequence of vectors, denoted
portive relationships between paragraphs, which is useful for as p0qk = [p0qk1 , p0qk2 , . . . , p0qkN ]. Each component p0qki ag-
the legal case retrieval task.
gregates the interactions of all of paragraphs in the candidate
We fine-tune all parameters of BERT on a sentence pair
document corresponding to the i-th query paragraph as fol-
classification task in an end-to-end fashion. The input is com-
lows:
posed of the decision paragraph of a query case and a candi-
date paragraph in the relevant case. The text pair is separated p0qki = MaxPool(Ci1 , Ci2 , . . . , CiM ), p0qki ∈ RHB . (1)
by the [SEP] token and a [CLS] token is prepended to the text
pair. As for the output, we feed the final hidden state vector We use an RNN structure to further encode the represen-
corresponding to the first input token ([CLS]) into a classifi- tation sequence. Assuming that the legal document always
cation layer. In this task, we use a fully-connected layer to do follows a certain reasoning order, we consider the forward
3503
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
RNN in this model. In the forward pass, RNN generates a Task 1 Train Test
sequence of hidden states: # query case 285 61
hqk = [hqk1 , hqk2 , . . . , hqkN ], hqki ∈ R HR
, (2) # candidate cases / query 200 200
# noticed cases / query 5.21 5.41
where hqki = RNN(hqk(i−1) , p0qki )is generated by LSTM Task 2 Train Test
or GRU in practice.
The attention mechanism is also employed to infer the im- # query case 181 44
# candidate paragraphs / query 32.12 32.91
portance of each position. The attention weight of each posi-
# entailing paragraphs / query 1.12 1.02
tion is measured by:
exp(hqki · uqk ) Table 1: Summary of datasets in COLIEE 2019 task 1 and task 2.
αqki = P , (3)
i0 exp(hqki0 · uqk )
where hqki is the i-th hidden state given by the forward RNN of predominantly Federal Court of Canada case laws. Table 1
and uqk is generated as follows: gives a statistical summary of the raw datasets in these two
tasks. Task 1 is the main focus of our work, while the data of
uqk = Wu · MaxPool(hqk ) + bu , (4) Task 2 are used to fine-tune BERT in Stage 2.
where Wu ∈ RHR ×HR , bu ∈ RHR . We can then get the We follow the evaluation metrics in the competition.
document-level representation via attentive aggregation: Micro-average of precision, recall, and F1 are used.
3504
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
English and French versions of description, and only the En- Model Precision Recall F1
glish one is considered in our experiments. As illustrated in BM25 0.5100 0.4636 0.4867
Figure 1 (Stage 1), we select candidates according to BM25 VSM 0.4833 0.4394 0.4603
scores. We set K = 50 (top 50 candidates for each query), LMIR 0.5467 0.4970 0.5306
considering both the recall and the effectiveness in the fol-
ARC-II 0.1967 0.1788 0.1873
lowing stages. Note that only the cases that happened before
MP 0.1933 0.1758 0.1841
the current case could be noticed according to the problem
definition, those invalid cases in terms of time are dismissed, ARC-IIsum 0.1900 0.1727 0.1810
which would otherwise be noisy in the candidates. The re- MPsum 0.2233 0.2030 0.2127
call of Stage 1 on the training and testing set are 0.9159 and BERT-PLI (LSTM) 0.5931 0.5697 0.5812
0.9273, respectively. The relative high recall suggests that BERT-PLI (GRU) 0.6026 0.5697 0.5857
setting K = 50 in Stage 1 brings little harm to the overall BERT-PLIorg (LSTM) 0.5278 0.4606 0.4919
retrieval performance. BERT-PLIorg (GRU) 0.4958 0.5364 0.5153
In Stage 2, paragraph pairs are constructed using the de-
cision and candidate paragraphs in Task 2. The paragraphs Table 2: Performance of traditional retrieval models, deep retrieval
have no more than 100 words on average and we truncate the models, and BERT-PLI on the test set, measured by micro-average
words symmetrically if the pair exceeds the maximum input of precision, recall and F1. MP is the abbreviation for MatchPyra-
length of BERT. We use the uncased base version of BERT.1 mid. sum denotes the model that uses the generated summary as the
At first, it is fine-tuned on the training data and tested on the model input. org denotes the model without Stage 2, which uses the
remaining testing data. We use the Adam optimizer and set original BERT parameters.
the learning rate as 10−5 . The fine-tuning procedure can cov-
erage after three epochs and F 1 on the test set reaches 0.6526,
which is better than the results of BM25 in this task. After 4.4 Results and Analysis
that, we utilize the same hyperparameter settings but merge All of the models employ the cascade framework except
the training and testing data to fine-tune BERT for 3 epochs BM25. In other words, these models conduct further rank-
from scratch. ing or classification based on the dataset D1 selected in Stage
In Stage 3, the paragraph segmentation given by the orig- 1. In particular, all of the deep learning models are trained
inal documents is adopted. Similarly, if the paragraph pair on the same training data and selected according to F1 on
exceeds the maximum input length, we simply truncate the the validation data. For the ranking models, we consider the
texts. We set N = 54 and M = 40, which can cover 3/4 of top 5 results as the relevant ones4 , while for the classification
paragraphs in most query and candidate cases. In BERT-PLI, models, we simply use the label given by the model.
HB = 768, which is determined by the size of the BERT Table 2 shows the performance on the whole test set. In the
hidden vector. As for RNN, HR is set as 256 and only one BERT-PLI model, we use LSTM and GRU as the RNN layer
hidden layer is used for both LSTM and GRU. The training respectively. The LSTM and GRU versions achieve similar
set is split into two parts. 20% queries from the training set as performance here and both outperform the baseline methods,
well as all of their candidates are treated as the validation set. including the traditional retrieval models and the deep learn-
We train the model on the training data left for no more than ing ones, by a large margin. The structure of BERT-PLI is
60 epochs and select the best model in the training process able to take the whole case document into consideration. At
according to the F1 measure on the validation set. During the the same time, it has a better semantic understanding ability
training process, we use the Adam optimizer and set the start than the bag-of-words IR models with the help of BERT and
learning rate as 10−4 with a weight decay of 10−6 . sequence modeling components.
As for the baseline methods, we use a bigram language Among the baseline retrieval methods, deep learning re-
model with linear smoothing in LMIR while VSM and BM25 trieval models perform much worse than the traditional re-
are calculated based on the standard scoring functions. ARC- trieval models in this task. Since these deep learning models
II and MatchPyramid are implemented by MatchZoo [Guo are mostly designed for the ad-hoc scenario, it is hard for
et al., 2019]. We train ARC-II and MatchPyramid in a pair- them to handle long text retrieval. The length of the input is
wise way since pairwise training usually outperforms point- restricted in these models. The first 256 words are not enough
wise training. The input length is also restricted due to mem- to represent the document well, so it is not surprising that
ory and time limit. We truncate both the query and candidate they perform poorly in this task. In addition to truncation, we
document to 256 words, which is consistent with the restric- also attempt to utilize automatic summarization techniques
tions of BERT. Meanwhile, we use TextRank2 to generate a to shorten the input length. However, it does not result in
summary with a 256-words length limit for each case. JNLP stable improvements. Even taking the generated summary
group also provides the summary encoding and the lexical as the input, the deep models still underperform. Assum-
features, and we further conduct experiments based on the ing that the legal documents contain plenty of information
provided features using RankSVM.3 and complex logic, it is hard to express them well in a much
1
https://github.com/google-research/bert 4
The threshold 5 is widely used by teams in the competition and
2
https://radimrehurek.com/gensim/ is reasonable considering there are about 5 relevant cases per query
3
https://www.cs.cornell.edu/people/tj/svm light/svm rank.html on average.
3505
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
3506
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
3507