You are on page 1of 13

Knowledge-Based Systems 195 (2020) 105738

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

A novel random forest approach for imbalance problem in crime


linkage✩

Yu-Sheng Li, Hong Chi, Xue-Yan Shao , Ming-Liang Qi, Bao-Guang Xu
Institutes of Science and Development, Chinese Academy of Sciences, Beijing 100190, China
School of Public Policy and Management, University of Chinese Academy of Sciences, Beijing 100049, China

article info a b s t r a c t

Article history: Crime linkage is a challenging task in crime analysis, which is to find serial crimes committed by
Received 30 September 2019 the same offenders. It can be regarded as a binary classification task detecting serial case pairs.
Received in revised form 1 March 2020 However, most case pairs in the real world are nonserial, so there is a serious class imbalance
Accepted 4 March 2020
in the crime linkage. In this paper, we propose a novel random forest based on the information
Available online 9 March 2020
granule. The approach does not resample the minority class or the majority class but concentrates on
Keywords: indistinguishable case pairs at the classification boundary. The information granule is used to identify
Crime linkage case pairs that are difficult to distinguish in the dataset and constructs a nearly balanced dataset
Classification in the uncertainty region to deal with the imbalanced problem. In the proposed approach, random
Class imbalance trees come from the original dataset and the above mentioned nearly balanced dataset. A real-world
Information granule
robbery dataset and some public imbalanced datasets are employed to measure the performance of the
Random forest
approach. The results show that the proposed approach is effective in dealing with class imbalances,
and it can be extended to combine with other methods solving class imbalances.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction behaviors can be observed, measured and accurately recorded [5].


Based on the above assumptions, behavioral similarity has been
Serial crimes refer to two or more crimes committed by the successfully used to detect serial crimes, which can be applied
same criminal or the same group of criminals [1]. Serial crimes not only to serious crimes, but also to large-scale crimes, such as
account for a large proportion of the total number of crimes [2]. theft and robbery and so on [6]. Crime linkage(CL) refers to the
Therefore, in some countries, governments require the police to identification of crimes committed by the same suspect based on
focus on investigating serial offenders who are the minority of behavioral similarity [3], also called linkage analysis, case linkage,
offenders committing the majority of crimes [3]. However, the crime linkage [7]. It has been applied to many types of crime,
traditional method of crime analysis relies on manual processing, such as serial sexual offenses [8,9], serial robbery [10,11], serial
which is time-consuming and labor-intensive and depends on the burglary [12,13] and so on. In crime linkage, the similarity of
memory of the crime analyst [4]. It is also easy to make mistakes criminal behaviors is generally characterized by Modus operandi
by relying on human memory to analyze. Therefore, an intelligent (M.O.). The M.O. refers to behaviors committed during an offense
detection algorithm of serial crimes is very important for concen- that serve to ensure its completion while also protecting the
trating police force, improving the efficiency of identifying serial perpetrator’s identity and facilitating escape following the offense
crimes and maintaining social security. [14]. The criminal’s M.O. is represented by a series of attributes,
Serial crimes identification relies on three assumptions. The whose values are filled according to crime investigation.
first assumption is that criminal behaviors are consistent, the Many researchers regard CL as a classification problem, and
second assumption is that criminals exhibit some distinctiveness machine learning algorithms are favored. Every two crimes form
a crime pair, and the task of CL is to judge whether a crime pair is
in their behaviors, and the third assumption is that criminal
a serial crime pair or not. In the real world, serial crime pairs al-
ways less than nonserial crime pairs, which calls class imbalance
✩ No author associated with this paper has disclosed any potential or
problem [15]. When a dataset is imbalanced, the class imbalance
pertinent conflicts which may be perceived to have impending conflict with impedes algorithms to learn the minority class patterns [16], and
this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.
2020.105738.
the minority is submerged in the majority class sample. In this
∗ Corresponding author at: Institutes of Science and Development, Chinese case, minority samples are easily misclassified. This problem also
Academy of Sciences, Beijing 100190, China. exists in other areas such as payment card fraud [17], medical
E-mail address: xyshao@casisd.cn (X.-Y. Shao). diagnosis [18], etc. The samples on the borderline and the ones

https://doi.org/10.1016/j.knosys.2020.105738
0950-7051/© 2020 Elsevier B.V. All rights reserved.
2 Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738

Table 1
Notation description.
Notation Description
n Number of crimes
N Number of crime pairs, N = n × (n − 1)/2
pij The crime pair consisting of crime i and j, where i, j ∈ 1, 2, . . . , n, i < j
m The feature attribute of crime, m ∈ [1, M ]
am
i The value of the attribute m of crime( i
The attribute vector of crime i, ai = a1i , . . . , am
i , . . . , ai
M
)
ai
am The attribute m
simm The similarity measure of the attribute m
Sijm The similarity value of the crime pair pij in attribute m
Sij The similarity Vector of the crime pairpij , where Sij = (Sij1 , . . . , Sijm , . . . , SijM )
yij The label of whether crime i and j belong to a series, yij ∈ 0, 1

nearby are more likely to be misclassified, so many algorithms transformed into the calculation of the similarity of various at-
attempt to distinguish them for better prediction. tributes. Previous researches have summed up many attributes
Serial crimes differ from nonserial crimes in terms of behav- that reflect criminal behaviors, including numerical attributes,
ioral similarity [19], but some crime pairs are still indistinguish- categorical attributes, and keyword attributes. For numerical at-
able. There have been some literature studies on the separability tributes, absolute distance is frequently used [24].
of similarity between serial and nonserial crimes [20]. In this Pérez-Hernández et al. proposed a two level methodology based
paper, we concentrate on indistinguishable crime pairs near the on deep learning to analyze small objects’ similarity in shape
borderline, which we call uncertainty areas. We increase the and way in which they are handled [25]. Categorical attributes
algorithm’s learning of uncertain areas to reduce classification are usually binarized and then binary similarity measures are
errors. Information granules can play a role in finding uncertain applied to calculate its similarity [26]. Bennell et al. used Jaccard’s
regions, so we use it to construct the shape of uncertainty areas coefficient to measure the similarity of categorical attributes of
of crime pairs and generate a nearly balanced dataset. Ensemble criminal behaviors [27]. Some attribute values of criminal behav-
learning methods are composed of multiple classifiers, which ior come from the keywords in the narrative of the crime. Only by
makes it possible to increase the learning of specific patterns by understanding the true meaning of words, the comparison results
changing the area it learns. Random Forest [21] is an ensemble of words are credible. In this paper, we use word2vec to calculate
learning method for classification, it constructs multiple decision the similarity of keyword attributes [28].
trees during training, which corrects the over-fitting habit of a After completing the measure of similarity of criminal behav-
decision tree. The random forest has been considered one of the iors, an algorithm can be used to associate crimes. In recent years,
most accurate classifiers [22] and has the ability to handle class machine learning algorithms have been favored by crime linkage
imbalanced data when the imbalanced ratio is not too high [23]. researchers. Previous crime linkage studies have involved both
In this paper, we increase the learning of indistinguishable crime supervised and unsupervised algorithms. In general, a supervised
pairs based on the random forest to mitigate the effects of class approach treats crime linkage as a binary classification task. This
imbalance in crime linkage. Some random trees are learned from classification task is not to classify crimes but crime pairs, and the
the mentioned nearly balanced dataset and combine with stan- attributes’ similarity of crime pairs is used as the classification
dard random forest. In summary, the contribution of our study is basis for crime linkage. Several supervised learning algorithms
described as follows: (1) We apply an information granule to find have been applied to crime linkage, including neural networks
indistinguishable crime pairs and determine the uncertainty area [11], logistic regression [29,30], decision trees [31], Bayesian clas-
of the crime dataset. (2) A novel random forest based on the in- sification [32], etc. Researchers use unsupervised methods to
formation granule(IGRF) is proposed to solve the class imbalance identify all serial crimes rather than serial crime pairs, various
in crime linkage. (3) We collect a serial crime dataset from the clustering algorithms [33], outlier detection [34] and Restricted
real world and evaluate the uncertainty area found by IGRF, the Boltzmann Machine (RBM) [35], etc. In addition, some scholars
classification performance of IGRF is also compared with other applied semi-supervised algorithms [13] and fuzzy multi-criteria
excellent algorithms. decision making [36,37] to associate crimes.
The structure of the paper is organized as follows: Section 2 Supervised algorithms can learn domain knowledge from re-
presents a brief summary and a review of relevant research. In solved crimes, and they usually perform better than unsupervised
Section 3, we describe the crime linkage and its class imbal- and semi-supervised algorithms. Therefore, we use a supervised
ance problem. Section 4 summarizes the methodology applied to algorithm for crime linkage in this paper. Since the number of
detect serial robbery crimes and introduces our proposed IGRF serial crime pairs is always smaller than nonserial crime pairs,
approach, including the construction of information granule and there is a class imbalance problem in classification. In recent
the process of modified random forest. In Section 5, we employ years, a few studies have mentioned the imbalanced problem
a real-world robbery crime dataset of a city of China and some in crime linkage, but they have not given effective solutions.
public imbalanced datasets. Some evaluation criterions and the Borg et al. proposed that the imbalance problem in serial crimes
analysis of the results of the experiments are also presented in always existed [2]. Tonkin and Woodhams et al. mentioned that
this section. Conclusions and future research are discussed in the imbalance problem could cause many decision-making errors,
Section 6. but no solution was provided [9,38,39].

2. Literature review 2.2. Dealing with class imbalance

2.1. Crime linkage process and algorithms Standard machine learning algorithms are suitable for bal-
anced training sets, but the imbalanced datasets have some is-
Criminal behaviors in crime linkage are characterized by dif- sues such as class overlapping, missing rare patterns in minority,
ferent attributes, so behavioral similarity measurement can be which makes algorithms often provide suboptimal classification
Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738 3

Fig. 1. The framework of our methodology.

results [40]. There is already some software to deal with class more vulnerable to noise (mislabeled class). Bader-El-Den et al.
imbalance [41]. Class imbalance is a common problem in clas- applied the nearest neighbor algorithm to identified critical areas
sification, and a large amount of literature has been devoted around minority samples, some random trees generated from
to solving this problem [42]. In general, methods to address critical areas, then they were fed into the standard random forest,
this problem can be grouped into two levels: data-level meth- it was called BRAF [45]. BRAF is an efficient algorithm for the
ods and algorithm-level methods. Data-level methods solve the imbalanced problem, but its parameters are difficult to be set.
imbalanced problem by adding minority samples (i.e., SMOTE), O’Brien and Ishwaran introduced RFQ based on the ratio of data
reducing majority samples (i.e., ENN) or combining the above densities for learning imbalanced data [46]. It is better at select-
two sampling methods (i.e., SMOTEENN). Adding minority sam- ing variables across imbalanced data, but it is time-consuming.
ples may cause other unexpected mistakes and reducing major- There are also some studies that combine ensemble learning with
ity samples has higher efficiency but may lose some patterns. resampling methods. Sun et al. proposed an effective decision
Algorithm-level methods improve classification by modifying al- tree ensemble model named as DTE-SBD, which combines SMOTE
gorithms, such as cost-sensitive learning [43]. However, it may be and Bagging with differentiated sampling rates to solve the class
difficult to determine the accurate misclassification cost when ap- imbalance. Since different sampling rates are used at different it-
plying the cost-sensitive learning method to solve the imbalanced eration times, the diversity of the basic classifier is increased [47].
problems. Sun et al. proposed an embedding integration model of SMOTE
Some studies have improved the random forest to adapt to with ADASVM-TW, which embeds SMOTE into the iteration of
the class imbalance problem. Chen and Breiman proposed the ADASVM-TW and creatively designs a new sample weighting
balanced random forest (BRF) by changing the sampling way of mechanism. Depending on whether the samples are new and dif-
standard random forest, proposed the weighted random forest ficult, different resampling strategies are applied to the samples
(WRF) by following the idea of cost-sensitive learning [44]. BRF is to make the training dataset class-balanced [48]. Samples around
computationally more efficient with large imbalanced data since the borderline are more likely to be misclassified than samples
each tree only uses a small portion of the training set to grow. far from the borderline. To achieve better prediction, many clas-
WRF assigns a weight to the minority class, possibly making it sification algorithms devote to finding class borderline during
4 Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738

training. Some researchers have considered the importance of Due to the class imbalance in crime linkage, the number of
samples in the boundary area when oversampling. Han, Wang, serial crime pairs is small and some spare patterns are easily
and Mao proposed two new oversampling methods, borderline- neglected. To reduce the misclassification of the minority sample,
SMOTE1 and borderline-SMOTE2 [49]. Unlike traditional SMOTE, a novel algorithm IGRF is proposed. The goal of this paper is
the two methods only oversample the minority samples near the to find the uncertainty area of indistinguishable crime pairs and
borderline. Some studies use the support vector machine (SVM) strengthen the ability to identify them. We aim to improve the
to find samples near the borderline and then oversampled them detection performance of serial crimes by the IGRF.
[50,51]. These methods perform resampling in areas that are
easily misclassified in the sample, which can avoid unexpected 4. Methodology
errors caused by resampling all samples. Based on this idea, we
find samples near the borderline and construct a nearly balanced In this paper, a novel random forest based on the information
dataset without synthesizing new samples. Some random trees granule has been proposed. The process of using it for detecting
serial crimes is provided in Fig. 1.
are generated from the new dataset and combine with standard
The first step is to select similarity measures to calculate the
random forest to form the final classifier to address the class
similarity vector of every crime pair. There are several attribute
imbalance in crime linkage.
types, and suitable similarity measures are employed to different
attributes.
3. Problem formulation The second step is to construct an information granule and
find the uncertainty area containing indistinguishable case pairs.
As stated in the introduction, CL is to determine crimes com- To address the class imbalance problem and increase the learning
mitted by the same offender. In general, the similarity of the of the uncertainty area, a nearly balanced dataset is built.
offender’s modus operandi between two crimes is a crucial way to The third step is to obtain the final classifier. Random trees
detect serial crimes, offenders’ modus operandi is characterized are generated from the original case pairs dataset and the nearly
by different attributes. Given a set of crimes C , CL can be regarded balanced dataset respectively and these trees are used to identify
as a process of identifying whether crime pair (ci , cj ) ∈ C were serial case pairs.
committed by the same offender. The attributes’ similarity can
be calculated by different similarity measures. Table 1 lists some 4.1. Similarity measures
notations that are used in this section.
Crime attributes include three types of contents: numeric
Definition 1 (Crime Pair). Given two crimes ci and cj represented attributes, categorical attributes, and keyword attributes. For nu-
by m attributes a1i , . . . , am meric attributes, we use the absolute distance to measure the
i , . . . , ai , they comprise a crime pair
M

represented by pij . If there are n crimes, then the amount of crime similarity between two numeric values. For categorical attributes,
pair will be n × (n − 1) /2. Jaccard’s coefficient is applied, and we give different importance
for each value on an attribute. Different importance scores are
Definition 2 (Similarity Measure). Each attribute similarity sm calculated according to the features’ frequencies; the lower is the
ij of
m occurrence probability of a feature, the higher is its importance.
a crime pair pij is quantified by a similarity measure sim . This
For keyword attributes, word2vec is used. The detailed calcula-
i , aj , where
( )
computing process can be represented as simm am m
m m m tion processes of three similarity measure methods have been
ai and aj represent the values of the attribute a in crimes ci
discussed in our previous work [52].
and cj respectively.
Every attribute’ similarity value ranges between 0 and 1. 0
4.2. Proposed approach
indicates that there is no similarity on this attribute in the crime
pair according to the similarity measure results. 1 indicates that In this paper, we propose a novel random forest IGRF. To
this crime pair has the exact same values on this attribute. learn the features of indistinguishable crime pairs, we combine
serial crime pairs in the information granule and their κ nearest
Definition 3 (Similarity Vector). Given two crimes ci and cj rep- neighborhoods to form a nearly balanced dataset. The nearly
resented by M attributes, the similarity of these attributes is balanced dataset is constructed based on an information granule.
calculated by different similarity measures simm . The similarity IGRF does not directly change the structure of the original data
of all attributes in crime pair pij is s1ij , . . . , sm
ij , . . . , sij , a similarity
M
by SMOTE or other oversampling methods but generates some
vector for the crime pair pij is defined as: random trees from the nearly balanced dataset, and they are fed
into a standard random forest to solve the class imbalance in
Sij = ⟨s1ij , . . . , sm
ij , . . . , sij ⟩
M
(1) crime linkage.
The goal of CL is to classify every similarity vector Sij into two
classes which have different labels: serial crimes (yij = 1) and 4.2.1. Constructing information granule
nonserial crimes (yij = 0). An information granule may be class, subset, object, or cluster
of a universe generated by distinguishability, similarity, and func-
tionality [53]. It is the key issue of granular computing (GrC). GrC
{Definition 4 (Crime Linkage). Given labeled crime pairs T =
is effective for solving the class imbalance problem. It has been
Sij , yij u×(M +1) , including serial crimes and nonserial crimes, as
}
{ } combined with the support vector machine [54] and autoencoder
the training dataset and unlabeled crimes V = Sij v×M (usually [55], etc.
v ≪ u), crime linkage aims to gain classifiers from training on T , We construct an information granule to identify indistinguish-
then apply this classifier to classify unlabeled crimes V . able crime pairs. Constructing information granule need to con-
sider two factors of prototype and granule size [56]. The pro-
Definition 5 (The Class Imbalance Problem in Crime Linkage). The totype is used to identify the center location of an information
imbalance in CL is reflected in the fact that the number of non- granule, and the size determined the description range of an
serial crime pairs is far more than serial crime pair. Given the information granule [57]. We construct an information granule
training dataset T , the label for most crime pairs is yij = 0. based on a granulating algorithm.
Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738 5

4.2.1.1. Prototype selection. We prefer our information granule where β is an adjustable parameter to control the granule size,
contains more indistinguishable samples. The degree of indistin- a big β result in a small specificity. β = 0 indicates no limit on
guishability is measured by samples’ membership. The smaller the radius ρ of the hypersphere, then information granule would
the difference between the two membership degrees is, the contain all samples.
higher the uncertainty degree of the sample is. The memberships • optimizing size parameter
of the prototype should have the smallest difference. Firstly, we The coverage function and the specificity function are com-
calculate the class center as follows: prehensively used to select the optimal parameter ρ , which is
Nc
∑ defined as follows:
ϕc = xl /Nc (2)
argMax: Q = cov erage (ρ) · specificity (ρ) (9)
l=1 ρ
where Nc indicates the number of samples in the cth class, ϕc In Eq. (9), with the increase of the parameter ρ , the coverage
indicates the center of the cth class. Each sample’s membership becomes bigger and bigger and the specificity is smaller and
to different classes is calculated by the distance between the smaller. Then, the optimal ρ can be obtained when Q reaches the
sample and class centers. The distance between a sample and maximum. After the prototype δ and size parameter ρ are calcu-
class centers is calculated by Euclidean distance, as shown in lated, the construction of the information granule is completed.
Eq. (3). If a sample is close to a class center, then the sample’s The detailed process of constructing information granule is shown
membership to this class is large, so the membership is measured in Algorithm 1.
by Eq. (4).
dis (xl , ϕc ) = ∥xl − ϕc ∥2 (3)
1

µcl = dis (xl , ϕ1−c ) / dis (xl , ϕc ) (4)
c =0

where µcl indicates the membership of the lth sample to the class
c. To determine the center location of an information granule, we
obtained its prototype by Eq. (5):
δ = (ϕ0 + ϕ1 ) /2 (5)
where xl indicates the lth sample, N indicates the number of
samples. It means that the memberships of δ to two classes are
0.5 and have the smallest difference.
4.2.1.2. Granule size. The granule size is determined by the cover-
age and specificity. The coverage function expresses an informa-
tion granule’s ability to covering samples. The specific function
is to ensure that the samples falling into the information granule
are more difficult to distinguish.
• Coverage function
After the prototype of the information granule is determined,
it is necessary to find the indistinguishable sample centering
on the prototype. Whether a sample falls into the information
granule is determined by its distance from the prototype and its
membership. The coverage function is defined as follows:
1 ∑
cov erage (ρ) = H (µ0l , µ1l ) (6)
N
xl :∥xl −δ∥2 /max(∥xl −δ∥2 )≤ρ 2 4.2.2. Modified random forest
The IGBF proposed in this paper is to solve the problem of class
where N is the number of samples. ρ is the parameter of granule
imbalance between serial crime pairs and nonserial crime pairs
size, whose range is 0 to 1, and it can be regarded as the radius of
in crime linkage. Instead of resampling to change the structure of
the hypersphere. ∥xl − δ∥2 /max (∥xl − δ∥2 ) is used to project the
the original data, we use information granule to find the uncer-
distance between case pairs and prototype to 0 to 1. A sample
tainty area which contains indistinguishable crime pairs and then
falls into the information granule if the distance between the
construct a nearly balanced dataset to generate random trees to
sample and prototype is not bigger than ρ . H (·) is an uncertainty
combine with the standard random forest.
function related to samples’ memberships, and we use entropy to
The detailed process of our proposed IGRF is shown in Algo-
measure the uncertainty of samples [57], as shown below:
rithm 2. Firstly, given a crime set T , where the serial crime pair
H (µ0l , µ1l ) = −µ0l · log (µ0l ) − µ1l · log µ1l (7) set is T1 and nonserial crime pair set is T0 . Secondly, construct
an information granule with the given parameter β to determine
The closer the memberships of the sample xl are to 0.5, the
the uncertainty area. In the uncertainty area, the serial crime pair
larger its entropy value is, then the value of the coverage function
set is T1′ and the nonserial crime pair set is T0′ . A nearly balanced
is larger.
crime set is defined as Tg in this uncertainty area. This is done by
• Specificity function
finding the κ -nearest neighbors in T0′ of each serial crime pair in
To make the information granule contain uncertainty samples,
T1′ which is denoted as Tkn , where Tkn ⊂ T0′ . Duplicated samples
it is also necessary to define the specificity function to control
are removed from Tkn , the union of T1′ and Tkn constitutes Tg .
the indistinguishability of samples contained in the information
Finally, random trees are generated from the original dataset T
granule. The specificity function can be expressed as:
and the nearly balanced dataset Tg , and they combine to form
specificity (ρ) = (1 − ρ)β (8) the final classifier. There are two parameters, F is the number of
6 Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738

trees, γ is the ratio of trees generated from T and 1 − γ is the the crime pair of each attribute is calculated using the similarity
ratio of trees generated from Tg . functions proposed in Section 4. We generate two case sets by
year, which are 2013–2017 and 2013–2018. The case set of 2013–
2018 contains all the cases, the case set of 2013–2017 only
contain cases from 2013 to 2017, and the information of the case
pairs and imbalance ratio (IR) of the two case sets are given in
Table 3.

5.2. Evaluation criterions

After calculating the similarity of the crime pair attributes, we


consider the confusion matrix and some other evaluation indexes
Accuracy, True Positive (minority) Rate (TPR), False Positive (mi-
nority) Rate (FPR), F-measure (FM) and G-mean (GM) to measure
the performance of linking crime. F-Measure(FM) provides more
insight into the functionality of a classifier than the accuracy
metric. G-Mean(GM) evaluates the degree of inductive bias in
terms of a ratio of positive accuracy and negative accuracy [42].
So we mainly compare the performance of methods from FM
and GM. The confusion matrix is shown in Table 4. True Positive
(TP) refers to the crime pairs that were correctly labeled as a
serial crime. True Negative (TN) indicates the crime pairs that
were correctly labeled as a nonserial crime. False Negative (FN)
represents the crime pairs that were not correctly labeled as a
serial crime. False Positive (FP) indicates the crime pairs that were
not correctly labeled as a nonserial crime.
According to the confusion matrix provided in Table 4, the
index can be calculated. A large TPR means that the algorithm
finds more serial case pairs. A smaller FPR means that serial
case pairs found have a low error rate. The higher the values of
Accuracy, FM, and GM, the better the algorithm performs. In this
paper, we use F1 (b = 1) to measure classification performance.

Accuracy = (TP + TN ) /(P + N ) (10)


TP
TPR = (11)
TP + FN
FP
FPR = (12)
TN + FP
5. Experimental evaluation
1 + b2 × TPTP × TP
( )
FM = ( TP +FP TPTP +)FN (13)
5.1. Crime data description b2 × TP +FP + TP +FN

The crime dataset was derived from the judicial document TP TN
GM = × (14)
published on the OpenLaw website. These crimes were real solved TP + FN TN + FP
robbery crimes in Zhengzhou City, Henan Province, China. These
incidents occurred from January 2013 to October 2018. 5.3. Experiment results and discussion
The dataset includes a total of 364 cases, which were com-
mitted by 292 criminals; 253 cases were committed by a single In this section, the purpose of our experiments is mainly in
criminal, not a serial crime. The remaining 111 cases were serial three aspects. Firstly, analyze the construction of the information
crimes, which were committed by 39 criminals, each of which granule and the performance of IGRF with different parameters.
committed a maximum of 9 crimes and a minimum of 2 crimes. Secondly, evaluate the classification performance of IGRF in dif-
In serial crimes, 24 criminals committed 2 cases; 10 criminals ferent crime sets and combine it with other methods(SMOTE
committed 3 cases; 2 criminals committed 4 cases; 2 criminals and ADASYN) to address the imbalanced problem. Thirdly, em-
committed 8 cases; 1 criminal committed 9 cases. Therefore, ploy IGRF on some two-class imbalanced datasets from the KEEL
the entire dataset can construct 66,066 case pairs, of which 158 datasets repository at http://www.keel.es/datasets.php to prove
pairwise cases were committed by the same offender, and 65,908 the universality of IGRF in imbalanced binary classification.
pairwise cases were committed by different offenders. It can be In our experiments, CART is selected to construct random
seen from the number of serial and non-serial pairs that crime decision trees. We repeat 20 times 10-fold cross-validation and
linkage is an imbalanced classification problem. calculate their average as the results of classification perfor-
According to previous studies about robbery crime linkage, we mance. The literature using 29 datasets for experiments showed
use 11 attributes to characterize the M.O. of robbery criminals. there is no more significant difference between the forests using
The attribute values of each crime are manually obtained. The 128 trees and more trees [58], and a study further proved this
detailed information of attributes and their similarity measures conclusion [59]. Therefore, we set the number of random decision
are listed in Table 2. We apply the CBOW model to train word trees F as 200. What is more, in the original training set, the
embeddings on the Wikipedia Chinese corpus dumps. The vector parameters of the random tree in the IGRF are the same as those
dimensionality of word embedding is 400. The similarity between of the standard random forest (RF).
Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738 7

Fig. 2. The PCA result of information granule construction . ‘‘+’’ represents nonserial case pairs. ‘‘×’’ represents serial case pairs. ‘‘⋆’’ represents serial case pairs in
the information granule. (a) is the PCA result of the original case set. (b) is the PCA result of information granule when β = 2. (c) is the PCA result of information
granule when β = 3. (d) is the PCA result of information granule when β = 4.

Fig. 3. Values of coverage, specificity, Q with the change of ρ (β = 3).

Table 2
Attributes information and similarity measures.
Attribute Description Measure
C_Num Number of criminals Absolute
C_Tools Tools used by the criminal, e.g., knife, handcuffs, etc. Jaccard
W_Disguise The way criminals disguise Jaccard
W_Harm The way criminals harm victims Jaccard
W_Property The way criminals rob property Jaccard
W_Threat The way criminals threat victims Jaccard
P_Harm Part of the victim being harmed Jaccard
R_Item Robbed Item Jaccard
V_Control Actions to control the victim word2vec
W_Breaking The way criminals break through obstacles word2vec
C_Action Actions taken by the criminals word2vec

5.3.1. Information granule and parameters setting serial case pairs that overlap with nonserial case pairs. In Fig. 2(b,
We construct the information granules on the dataset contain- c, d), most of the samples falling into the information granule are
ing cases from 2013 to 2018 and Fig. 2 shows the construction on the junction of the serial case pairs and nonserial case pairs
results after PCA. In Fig. 2(a), we can see that there are many (as shown in the oval circle), proving the information granule can
8 Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738

Fig. 4. Performance of IGRF with different γ (κ = 2, the horizontal dotted line is the performance of standard random forest).

Fig. 5. Performance of the proposed approach with different κ (γ = 0.6, horizontal dotted line is the performance of standard random forest).

Fig. 6. The learning curve of 2 case sets on standard random forest. (a) is the learning curve of the 2013–2017 dataset. (b) is the learning curve of the 2013–2018
dataset.

Fig. 7. Comparison of 3 pairs of algorithms on FM and GM in the 2013–2017 crimes.

find the boundary area. It means that the constructed information concern. What is more, as the value of parameter β increases, the
granule can detect the indistinguishable case pairs, which is our number of serial case pairs contained in the information granule
Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738 9

Fig. 8. Comparison of 3 pairs of algorithms on FM and GM in the 2013–2018 crimes.

Table 3 shown in Fig. 4(a), the accuracy of IGRF at the beginning increases
Information of two crime sets, where Nc means the number of crimes, Nserial , as γ increases, gradually is higher than the standard random
Nnonserial mean the number of serial crime pairs and nonserial crime pairs
correspondingly, and IR represents the imbalanced Ratio.
forest, then decreases after reaching its peak and approaches the
Nc Nserial Nnonserial IR (Nnonserial /Nserial ) Total
standard random forest. The change of FM and GM shows the
same trend as accuracy (refer to Fig. 4(b)). This shows that the
2013–2017 334 116 55495 478.41 55611
2013–2018 364 158 65908 417.14 66066
performance of classifiers trained on samples contained in the
information granule alone is not outstanding, but when they work
together with classifiers trained from the original samples, the
Table 4 effect gradually appears. As the proportion of classifiers trained
Confusion matrix.
in the original sample approaches 1, the performance of IGRF
Confusion matrix Predict
is gradually approaching the standard random forest. It can be
Serial crimes Nonserial crimes Total seen that classifiers trained in the uncertain region identified by
Serial crimes TP FN P the information granule can assist the standard random forest for
Actual Nonserial crimes FP TN N decision making.
Total P’ F’ P+N
Fig. 5 shows the performance of IGRF with the change of κ ,
where γ = 0.6, indicating that 60% of the trees in the IGRF
are generated from the original training set, and the rest are
increases, but information granules containing too few samples generated from the nearly balanced dataset. In Fig. 5(a), the
(refer to Fig. 2(c)) or too many (refer to Fig. 2(d)) are not what we accuracy of IGRF is higher than that of the standard random forest
want. Therefore, all experiments select β = 3 (refer to Fig. 2(b)) at the beginning. As the parameter κ increases, IGRF’s accuracy
to construct information granules. Fig. 3 illustrates the values of decreases. The performance of FM and GM decreases with the
coverage, specificity and Q with the change of the parameter ρ . increase of κ and gradually is lower than the standard random
The change trends of coverage and specificity are reversed. As the forest (refer to Fig. 5(b)), which means that the classifiers gener-
parameter ρ increases, the value of Q increases in the beginning, ated from the nearly balanced dataset have a positive effect on
and then decreases after reaching the highest point, indicating improving the classification performance, but as κ increases, the
that the specificity guarantees the indistinguishability of samples imbalanced ratio of the nearly balanced dataset becomes bigger
falling in the information granule. and bigger, resulting in poor classification performance.
To analyze the influence of each parameter on the perfor-
mance of IGRF, we repeat 10-fold cross-validation 20 times in 5.3.2. Performance on crime datasets
the dataset containing all cases from 2013 to 2018. The exper- In this section, we apply IGRF on the two case sets mentioned
imental results are averaged and the standard random forest in Table 3 in order to show its role in resolving the class im-
algorithm is used as a reference. Fig. 4 shows the performance balance problem in crime linkage. Firstly, we draw the learning
of IGRF with the change of parameter γ , where κ = 2, which curve for applying standard random forest on the two case sets.
means that the serial case pairs in the information granule and The results are shown in Fig. 6. It can be seen that the difference
their nearest 2 neighbors constitute a nearly balanced dataset. As between the training error and the cross-validation error is small

Table 5
The comparison of the different algorithms in two crime sets (the best result on each dataset is emphasized in bold)
RF IGRF
TPR FPR FM GM TPR FPR FM GM
2013–2017 0.6486 8.9e−05 0.7566 0.7991 0.6677 0.0001 0.7613 0.8119
2013–2018 0.6530 0.0002 0.7551 0.8041 0.7096 0.0002 0.7794 0.8389
RF+SMOTE IGRF+SMOTE
TPR FPR FM GM TPR FPR FM GM
2013–2017 0.6644 0.0002 0.7496 0.8090 0.6835 0.0002 0.7621 0.8218
2013–2018 0.6915 0.0003 0.7631 0.8281 0.7268 0.0004 0.7695 0.8491
RF+ADASYN IGRF+ADASYN
TPR FPR FM GM TPR FPR FM GM
2013–2017 0.6695 0.0001 0.7612 0.8121 0.6883 0.0002 0.7701 0.8248
2013–2018 0.6882 0.0002 0.7694 0.8259 0.7226 0.0003 0.7712 0.8467
10 Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738

Table 6 IGRF+SMOTE), (RF+ ADASYN, IGRF+ ADASYN). Several configura-


Information of 24 imbalanced two-class datasets from KEEL, where Np , Nn mean tions of SMOTE and ADASYN are tested and the best settings are
the number of minority samples and majority samples correspondingly, and IR
represents the imbalanced Ratio.
used for RF.
Id Dataset Np Nn Features IR (Nn / Np )
The classification performance of the three-groups methods on
the two case sets is shown in Table 5. Regarding the TPR, the
bupa bupa 145 200 6 1.38
con0 contraceptive0 629 844 9 1.34
improvement of IGRF, IGRF+SMOTE, and IGRF + ADASYN have
con1 contraceptive1 333 1140 9 3.42 2%–5%, which means that using this method can identify more
con2 contraceptive2 511 962 9 1.88 serial case pairs than RF. FPR results of IGRF are generally low,
eco0 ecoli0 143 193 7 1.35 although they are not better than traditional methods. While
eco1 ecoil1 77 259 7 3.36
guaranteeing the error rate, IGRF has found more serial case pairs
eco7 ecoli7 51 285 7 5.59
gla0 glass0 70 144 9 2.06 to provide references for police detection. It can also be seen that
gla1 glass1 76 138 9 1.82 IGRF performs better than RF on both FM and GM. Compared with
habe haberman 81 225 3 2.78 RF, the improvement of IGRF’s FM in case sets is approximately
led7 led7digit3 57 433 7 7.60 1%–2%, and that of GM is approximately 2%–3%. On the other
pag1 page-blocks1 329 5143 10 15.63
pag3 page-blocks3 87 5385 10 61.90
hand, the performance of using the IGRF alone is better than
pag4 page-blocks4 115 5357 10 46.58 RF+SMOTE and RF+ADASYN. Compared with using the IGRF alone,
pima pima 268 500 8 1.87 IGRF+SMOTE and IGRF+ADASYN do not improve much on FM but
veh2 vehicle2 218 628 18 2.88 can achieve greater improvement on GM. The results of 20 times
vow0 vowel0 90 900 13 10.00
10-fold cross-validation (20 × 10) on the two case sets are shown
wdbc wdbc 212 357 30 1.68
wir7 winequality-red7 199 1400 11 7.04 in Figs. 7–8. The boxplot of IGRF, IGRF+SMOTE, and IGRF+ADASYN
wiw4 winequality-white4 163 4735 11 29.05 are concentrated, and their median and lower quartile are higher,
wiw8 winequality-white8 175 4723 11 26.99 indicating that the performance of IGRF is more stable and better.
yea0 yeast0 244 1240 8 5.08 Overall, IGRF can achieve better performance when used alone,
yea4 yeast4 51 1433 8 28.10
yea5 yeast5 163 1321 8 8.10
and can also be combined with other methods solving class
imbalance problems.

5.3.3. Performance on public two-class imbalanced datasets


in these learning curves, and they gradually approach, which To further discuss the classification with consideration of the
means that these case sets can avoid overfitting. uncertain area, 24 publicly available datasets from the KEEL
We not only compare IGRF with standard random forest but dataset website are taken for experiments. In the experiments, we
assume that the minority is the positive class and the majority is
also combine it with other oversampling methods that address
the negative class. We still set IGRF’s parameter F = 200, β = 3,
the imbalanced problem, including SMOTE and ADASYN. Taking κ = 2, and γ = 0.6 as before in this section. The information of
IGRF+SMOTE as an example, we first construct the information these datasets is presented in Table 6.
granule on the original training set and generate a nearly bal- The experimental results of public datasets are reported in
anced dataset. Some random trees in the final classifier come terms of TPR, FPR, FM, and GM as shown in Tables 7–9. The
results show that IGRF can improve TPR while ensuring a lower
from the training set after oversampling, and the remaining ran-
FPR, which has similar situations when it is used alone or with
dom trees are from the nearly balanced dataset. Three groups of SMOTE and ADASYN. In addition, GM and FM perform better than
such comparisons are carried out, they are (RF, IGRF), (RF+SMOTE, traditional random forests. Compared with RF, IGRF improves on

Table 7
The comparison of RF and IGRF in imbalanced two-class datasets (the best result on each dataset is emphasized in bold)
RF IGRF
TPR FPR FM GM TPR FPR FM GM
bupa 0.5912 0.1568 0.6444 0.7002 0.6243 0.1923 0.6514 0.7051
con0 0.2822 0.0946 0.3472 0.5010 0.2976 0.1004 0.3585 0.5130
con1 0.3797 0.2251 0.4177 0.5399 0.4287 0.2506 0.4481 0.5645
con2 0.5689 0.2345 0.6017 0.6585 0.5975 0.2595 0.6119 0.6638
eco0 0.9364 0.0460 0.9363 0.9443 0.9461 0.0507 0.9385 0.9469
eco1 0.6713 0.0106 0.7564 0.8021 0.7680 0.0277 0.7867 0.8574
eco7 0.7473 0.0638 0.7467 0.8310 0.7958 0.0746 0.7647 0.8534
gla0 0.7822 0.0861 0.7837 0.8371 0.8415 0.0996 0.8095 0.8639
gla1 0.6783 0.0751 0.7353 0.7855 0.7286 0.0939 0.7549 0.8063
habe 0.2845 0.1594 0.3114 0.4494 0.3309 0.1743 0.3475 0.4937
led7 0.6092 0.0212 0.6641 0.7552 0.6584 0.0371 0.6529 0.7849
pag1 0.6472 0.0031 0.7120 0.7984 0.6933 0.0049 0.7114 0.8267
pag3 0.8905 0.0043 0.9082 0.9412 0.8929 0.0051 0.9038 0.9421
pag4 0.8678 0.0028 0.8411 0.9280 0.8895 0.0030 0.8477 0.9398
pima 0.6010 0.1442 0.6364 0.7139 0.6305 0.1685 0.6432 0.7213
veh2 0.9687 0.0056 0.9755 0.9813 0.9769 0.0078 0.9766 0.9844
vow0 0.9515 0.0012 0.9665 0.9740 0.9407 0.0012 0.9598 0.9681
wdbc 0.9404 0.0223 0.9499 0.9585 0.9485 0.0241 0.9524 0.9617
wir7 0.4708 0.0214 0.5737 0.6738 0.5249 0.0291 0.6009 0.7098
wiw4 0.2053 0.0020 0.3133 0.4332 0.2632 0.0052 0.3609 0.4978
wiw8 0.4143 0.0005 0.5692 0.6358 0.4188 0.0013 0.5655 0.6394
yea0 0.7021 0.0215 0.7417 0.8255 0.7466 0.0264 0.7555 0.8500
yea4 0.4525 0.0287 0.5598 0.6585 0.4820 0.0364 0.5727 0.6776
yea5 0.1531 0.0029 0.2079 0.2749 0.2155 0.0061 0.2722 0.3691
Average 0.6165 0.0597 0.6625 0.7334 0.6517 0.0700 0.6770 0.7559
Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738 11

Table 8
The comparison of RF+SMOTE and IGRF+SMOTE in imbalanced two-class datasets (the best result on each dataset is
emphasized in bold)
RF+SMOTE IGRF+SMOTE
TPR FPR FM GM TPR FPR FM GM
bupa 0.6315 0.2041 0.6525 0.7042 0.6549 0.2254 0.6573 0.7075
con0 0.3747 0.1355 0.4038 0.5655 0.3673 0.1300 0.4016 0.5618
con1 0.4422 0.2714 0.4499 0.5655 0.4707 0.2844 0.4665 0.5784
con2 0.5947 0.2539 0.6125 0.6647 0.6140 0.2749 0.6170 0.6658
eco0 0.9403 0.0482 0.9371 0.9452 0.9475 0.0530 0.9378 0.9464
eco1 0.8236 0.0271 0.8219 0.8873 0.8450 0.0408 0.8060 0.8953
eco7 0.8232 0.0887 0.7625 0.8618 0.8423 0.0960 0.7671 0.8687
gla0 0.8443 0.1128 0.8014 0.8610 0.8658 0.1163 0.8122 0.8708
gla1 0.7406 0.1063 0.7552 0.8072 0.7596 0.1110 0.7639 0.8163
habe 0.4124 0.2410 0.3800 0.5363 0.4261 0.2334 0.3962 0.5508
led7 0.7864 0.1033 0.5885 0.8297 0.8093 0.1143 0.5807 0.8412
pag1 0.7565 0.0125 0.6389 0.8610 0.7791 0.0126 0.6492 0.8742
pag3 0.9112 0.0076 0.8963 0.9506 0.9167 0.0076 0.8995 0.9534
pag4 0.9112 0.0041 0.8318 0.9511 0.9157 0.0041 0.8348 0.9532
pima 0.6934 0.2088 0.6611 0.7384 0.6966 0.2175 0.6582 0.7361
veh2 0.9803 0.0081 0.9780 0.9860 0.9812 0.0097 0.9761 0.9856
vow0 0.9796 0.0026 0.9751 0.9881 0.9781 0.0023 0.9758 0.9874
wdbc 0.9539 0.0277 0.9525 0.9627 0.9598 0.0268 0.9560 0.9661
wir7 0.6689 0.0689 0.6164 0.7864 0.6648 0.0668 0.6186 0.7851
wiw4 0.4062 0.0206 0.3958 0.6219 0.4540 0.0198 0.4371 0.6592
wiw8 0.5226 0.0213 0.4910 0.7095 0.5020 0.0156 0.5150 0.6974
yea0 0.8305 0.0349 0.7809 0.8936 0.8380 0.0354 0.7839 0.8974
yea4 0.5713 0.0658 0.5943 0.7274 0.5841 0.0679 0.6002 0.7350
yea5 0.4498 0.0289 0.3770 0.6281 0.4511 0.0269 0.3874 0.6270
Average 0.7104 0.0877 0.6814 0.7930 0.7218 0.0913 0.6874 0.7983

Table 9
The comparison of RF+ADASYN and IGRF+ADASYN in imbalanced two-class datasets (the best result on each dataset is
emphasized in bold)
RF+ADASYN IGRF+ADASYN
TPR FPR FM GM TPR FPR FM GM
bupa 0.6413 0.2065 0.6586 0.7085 0.6581 0.2243 0.6611 0.7092
con0 0.3733 0.1382 0.4009 0.5640 0.3680 0.1325 0.3998 0.5612
con1 0.4450 0.2788 0.4490 0.5644 0.4717 0.2863 0.4665 0.5782
con2 0.5859 0.2516 0.6073 0.6607 0.6128 0.2730 0.6172 0.6660
eco0 0.9420 0.0593 0.9308 0.9404 0.9486 0.0598 0.9338 0.9434
eco1 0.8264 0.0306 0.8178 0.8891 0.8461 0.0441 0.7992 0.8945
eco7 0.8377 0.1016 0.7567 0.8635 0.8523 0.1069 0.7593 0.8689
gla0 0.8463 0.1418 0.7790 0.8472 0.8784 0.1380 0.8024 0.8666
gla1 0.7772 0.1369 0.7574 0.8131 0.7804 0.1260 0.7672 0.8208
habe 0.4277 0.2442 0.3917 0.5466 0.4218 0.2373 0.3912 0.5447
led7 0.8001 0.1093 0.5872 0.8366 0.8181 0.1180 0.5795 0.8445
pag1 0.7568 0.0135 0.6282 0.8607 0.7696 0.0132 0.6376 0.8683
pag3 0.9121 0.0090 0.8871 0.9504 0.9161 0.0087 0.8911 0.9526
pag4 0.9138 0.0044 0.8243 0.9523 0.9098 0.0041 0.8304 0.9503
pima 0.7092 0.2262 0.6612 0.7388 0.7089 0.2303 0.6587 0.7365
veh2 0.9778 0.0109 0.9725 0.9833 0.9798 0.0121 0.9719 0.9837
vow0 0.9818 0.0024 0.9770 0.9893 0.9631 0.0020 0.9685 0.9796
wdbc 0.9669 0.0327 0.9543 0.9668 0.9668 0.0306 0.9564 0.9678
wir7 0.6909 0.0740 0.6203 0.7975 0.6843 0.0716 0.6210 0.7947
wiw4 0.4030 0.0211 0.3905 0.6195 0.4526 0.0198 0.4364 0.6582
wiw8 0.5220 0.0223 0.4843 0.7087 0.5057 0.0165 0.5110 0.6995
yea0 0.8644 0.0412 0.7824 0.9091 0.8596 0.0393 0.7850 0.9073
yea4 0.6007 0.0833 0.5888 0.7392 0.6087 0.0809 0.5978 0.7452
yea5 0.4516 0.0307 0.3716 0.6333 0.4443 0.0279 0.3789 0.6224
Average 0.7189 0.0946 0.6783 0.7951 0.7261 0.0960 0.6842 0.7985

Table 10
The Wilcoxon signed-rank test results based on the FM and GM, where R+ is the sum of ranks for the datasets in which
the first method outperforms the second and R− is the sum of ranks for the opposite.
Comparison R+ R− p-value Hypothesis (0.05)
IGRF vs. RF 268 32 7.48e−04 Rejected
FM IGRF+SMOTE vs. RF+SMOTE 251 49 3.91e−03 Rejected
IGRF+ADASYN vs. RF+ADASYN 236 64 0.0140 Rejected
IGRF vs RF 292 8 4.97e−05 Rejected
GM
IGRF+SMOTE vs. RF+SMOTE 248 52 0.0051 Rejected
IGRF+ADASYN vs. RF+ADASYN 198 102 0.1702 Not rejected
12 Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738

FM and GM by about 2%, and it can increase up to 6% on FM [5] M.D. Porter, A statistical approach to crime linkage, Amer. Statist. 70 (2)
and 9% on GM at most. When IGRF is used in conjunction with (2016) 152–165, http://dx.doi.org/10.1080/00031305.2015.1123185.
[6] J. Woodhams, C.R. Hollin, R. Bull, The psychology of linking crimes: A
other imbalanced methods, the lifting effect is not as obvious as
review of the evidence, Legal Criminol. Psychol. 12 (2) (2007) 233–249,
when used alone, but it still performs better than RF+SMOTE and http://dx.doi.org/10.1348/135532506x118631.
RF+ADASYN. Therefore, IGRF is effective for imbalanced two-class [7] B.E. Turvey, J. Freeman, Chapter 14 - Case linkage: Offender modus
classification problems. operandi and signature, in: B.E. Turvey (Ed.), Criminal Profiling (Fourth
To gain additional insight into the performance of examined Edition), Academic Press, San Diego, 2012, pp. 331–360.
[8] R.R. Hazelwood, J.I. Warren, Linkage analysis: modus operandi, ritual, and
methods, we use the Wilcoxon signed-rank test [60] to analyze signature in serial sexual crime, Aggress. Violent Behav. 9 (3) (2004)
the FM and GM of experimental results. The confidence level in 307–318, http://dx.doi.org/10.1016/j.avb.2004.02.002.
this test is set to 0.05. The lower the p-value, the difference is [9] J. Woodhams, M. Tonkin, A. Burrell, H. Imre, J.M. Winter, E.K.M. Lam, P.
more statistically significant. If the p-value is lower than 0.05, Santtila, Linking serial sexual offences: Moving towards an ecologically
valid test of the principles of crime linkage, Legal Criminol. Psychol. 24
then there is a substantial difference between the two methods.
(1) (2019) 123–140, http://dx.doi.org/10.1111/lcrp.12144.
The results of the test are shown in Table 10. Obtained p-values [10] A. Burrell, R. Bull, J. Bond, Linking personal robbery offences using offender
show that except for IGRF + ADASYN, which is not significantly behaviour, J. Invest. Psychol. Offender Profil. 9 (3) (2012) 201–222, http:
better than RF + ADASYN on the GM indicator, other indicators //dx.doi.org/10.1002/jip.1365.
are significantly better than traditional methods. [11] H. Chi, Z. Lin, H. Jin, B. Xu, M. Qi, A decision support system for detecting
serial crimes, Knowl.-Based Syst. 123 (2017) 88–101, http://dx.doi.org/10.
1016/j.knosys.2017.02.017.
6. Conclusion [12] L. Markson, J. Woodhams, J.W. Bond, Linking serial residential burglary:
comparing the utility of modus operandi behaviours, geographical prox-
The problem of class imbalance in crime linkage will always imity, and temporal proximity, J. Invest. Psychol. Offender Profil. (2010)
n/a–n/a, http://dx.doi.org/10.1002/jip.120.
exist. In this paper, we consider the crime linkage as a binary
[13] T. Wang, C. Rudin, D. Wagner, R. Sevieri, Finding patterns with a rotten
classification task. Instead of resampling the dataset, we con- core: Data mining for crime series with cores, Big Data 3 (1) (2015) 3–21,
centrate on the case pairs that are difficult to distinguish at http://dx.doi.org/10.1089/big.2014.0021.
the class boundary. We use the information granule to find out [14] D. Gee, A. Belofastov, Profiling sexual fantasy, in: R.N. Kocsis (Ed.), Criminal
Profiling: International Theory, Research, and Practice, Totowa, NJ, 2007,
the uncertainty area where the serial case pairs are difficult to
pp. 49–71.
distinguish and generate a nearly balanced dataset from this area [15] N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study,
and increase the learning of the uncertainty area to improve the Intell. Data Anal. 6 (5) (2002) 429–449.
classification effect in the case of class imbalance. The random [16] C. Su, S. Ju, Y. Liu, Z. Yu, Improving random forest and rotation forest for
forest has a great classification performance, and as an ensemble highly imbalanced datasets, Intell. Data Anal. 19 (6) (2015) 1409–1432,
http://dx.doi.org/10.3233/ida-150789.
learning method, it consists of multiple classifiers. We use this
[17] S. Nami, M. Shajari, Cost-sensitive payment card fraud detection based on
characteristic, part of the random tree is learned from the original dynamic random forest and k-nearest neighbors, Expert Syst. Appl. 110
dataset, and the remaining random tree is generated from the (2018) 381–392, http://dx.doi.org/10.1016/j.eswa.2018.06.011.
above mentioned nearly balanced dataset. Our proposed method [18] S. Shilaskar, A. Ghatol, P. Chatur, Medical decision support system for
is employed on two real-world crime sets and some public two- extremely imbalanced datasets, Inform. Sci. 384 (2017) 205–219, http:
//dx.doi.org/10.1016/j.ins.2016.08.077.
class imbalanced datasets. The experimental results show that it [19] J. Woodhams, G. Labuschagne, A test of case linkage principles with solved
has improved classification performance compared to the stan- and unsolved serial rapes, J. Police Crim. Psychol. 27 (1) (2011) 85–98,
dard random forest, and can also be used in conjunction with http://dx.doi.org/10.1007/s11896-011-9091-1.
other resampling methods, i.e., SMOTE and ADASYN. [20] Z. Lin, The application of separability analysis in feature selection of
the serial crime linkage problem, in: Paper Presented at the The 45th
In our future research, we will investigate the characteristics
International Conference on Computers & Industrial Engineering, Metz /
of cases falling into information granules, such as the separability France, 2015.
[61] of samples falling into the information granule, and also seek [21] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32, http://dx.
other optimization methods to obtain optimal parameter settings doi.org/10.1023/A:1010933404324.
of the information granule. [22] C. Zhang, C. Liu, X. Zhang, G. Almpanidis, An up-to-date comparison
of state-of-the-art classification algorithms, Expert Syst. Appl. 82 (2017)
128–150, http://dx.doi.org/10.1016/j.eswa.2017.04.003.
CRediT authorship contribution statement [23] M. Zhu, J. Xia, X. Jin, M. Yan, G. Cai, J. Yan, G. Ning, Class weights random
forest algorithm for processing class imbalanced medical data, IEEE Access
Yu-Sheng Li: Methodology, Software, Writing - original draft, 6 (2018) 4641–4652, http://dx.doi.org/10.1109/access.2018.2789428.
[24] D.E. Brown, S. Hagen, Data association methods with applications to law
Writing - review & editing. Hong Chi: Formal analysis, Super- enforcement, Decis. Support Syst. 34 (3) (2003) 369–378.
vision, Project administration. Xue-Yan Shao: Conceptualization, [25] F. Pérez-Hernández, S. Tabik, A. Lamas, R. Olmos, H. Fujita, F. Herrera,
Investigation, Resources. Ming-Liang Qi: Formal analysis, Valida- Object detection binary classifiers methodology based on deep learning to
tion, Data curation. Bao-Guang Xu: Visualization, Validation. identify small objects handled similarly: Application in video surveillance,
Knowl.-Based Syst. (2020) 105590, http://dx.doi.org/10.1016/j.knosys.2020.
105590.
References [26] S. Boriah, V. Chandola, V. Kumar, Similarity measures for categorical data:
A comparative evaluation, in: Proceedings of the 2008 SIAM International
[1] M. Tonkin, J. Woodhams, R. Bull, J.W. Bond, E.J. Palmer, Linking Conference on Data Mining, 2008, pp. 243–254.
different types of crime using geographical and temporal proximity, [27] C. Bennell, N.J. Jones, T. Melnyk, Addressing problems with traditional
Crim. Justice Behav. 38 (11) (2011) 1069–1088, http://dx.doi.org/10.1177/ crime linking methods using receiver operating characteristic analysis,
0093854811418599. Legal Criminol. Psychol. 14 (2) (2009) 293–310, http://dx.doi.org/10.1348/
[2] A. Borg, M. Boldt, N. Lavesson, U. Melander, V. Boeva, Detecting serial 135532508x349336.
residential burglaries using clustering, Expert Syst. Appl. 41 (11) (2014) [28] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word
5252–5266, http://dx.doi.org/10.1016/j.eswa.2014.02.035. representations in vector space, 2013, arXiv e-prints. Retrieved from https:
[3] J. Woodhams, K. Toye, An empirical test of the assumptions of case linkage //ui.adsabs.harvard.edu/#abs/2013arXiv1301.3781M.
and offender profiling with serial commercial robberies, Psychol. Publ. [29] C. Bennell, D.V. Canter, Linking commercial burglaries by modus operandi:
Policy Law 13 (1) (2007) 59–85, http://dx.doi.org/10.1037/1076-8971.13. tests using regression and ROC analysis, Sci. Justice 42 (3) (2002) 153–164,
1.59. http://dx.doi.org/10.1016/s1355-0306(02)71820-0.
[4] A. Chohlas-Wood, E.S. Levine, A recommendation engine to aid in iden- [30] M. Tonkin, T. Grant, J.W. Bond, To link or not to link: a test of the
tifying crime patterns, INFORMS J. Appl. Anal. (2019) http://dx.doi.org/10. case linkage principles using serial car theft data, 5(1–2) (2008) 59–77,
1287/inte.2019.0985. http://dx.doi.org/10.1002/jip.74.
Y.-S. Li, H. Chi, X.-Y. Shao et al. / Knowledge-Based Systems 195 (2020) 105738 13

[31] M. Tonkin, J. Woodhams, R. Bull, J.W. Bond, P. Santtila, A comparison [46] R. O’Brien, H. Ishwaran, A random forests quantile classifier for class
of logistic regression and classification tree analysis for behavioural case imbalanced data, Pattern Recognit. 90 (2019) 232–249, http://dx.doi.org/
linkage, J. Invest. Psychol. Offender Profil. 9 (3) (2012) 235–258, http: 10.1016/j.patcog.2019.01.036.
//dx.doi.org/10.1002/jip.1367. [47] J. Sun, J. Lang, H. Fujita, H. Li, Imbalanced enterprise credit evaluation
[32] C.-H. Ku, G. Leroy, A decision support system: Automated crime report with DTE-SBD: Decision tree ensemble based on SMOTE and bagging
analysis and classification for e-government, Gov. Inf. Q. 31 (4) (2014) with differentiated sampling rates, Inform. Sci. 425 (2018) 76–91, http:
534–544, http://dx.doi.org/10.1016/j.giq.2014.08.003. //dx.doi.org/10.1016/j.ins.2017.10.017.
[33] A. Borg, M. Boldt, Clustering residential burglaries using modus operandi [48] J. Sun, H. Li, H. Fujita, B. Fu, W. Ai, Class-imbalanced dynamic financial
and spatiotemporal information, Int. J. Inf. Technol. Decis. Mak. 15 (01) distress prediction based on Adaboost-SVM ensemble combined with
(2016) 23–42, http://dx.doi.org/10.1142/s0219622015500339. SMOTE and time weighting, Inf. Fusion 54 (2020) 128–144, http://dx.doi.
[34] S. Lin, D.E. Brown, An outlier-based data association method for linking org/10.1016/j.inffus.2019.07.006.
criminal incidents, Decis. Support Syst. 41 (3) (2006) 604–615, http://dx. [49] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: A new over-sampling
doi.org/10.1016/j.dss.2004.06.005. method in imbalanced data sets learning, in: Paper Presented at the
[35] S. Zhu, Y. Xie, Crime event embedding with unsupervised feature selection, International Conference on Intelligent Computing, ICIC 2005, August 23,
in: Paper Presented at the ICASSP 2019-2019 IEEE International Conference 2005 - August 26, 2005, Hefei, China, 2005.
on Acoustics, Speech and Signal Processing, ICASSP, 2019. [50] H.M. Nguyen, E.W. Cooper, K. Kamei, Borderline over-sampling for im-
[36] F. Albertetti, P. Cotofrei, L. Grossrieder, O. Ribaux, K. Stoffel, The CriLiM balanced data classification, in: Paper Presented at the Proceedings: Fifth
methodology: Crime linkage with a fuzzy MCDM approach, in: Paper International Workshop on Computational Intelligence & Applications,
Presented at the 2013 European Intelligence and Security Informatics 2009.
Conference, 2013. [51] K. Savetratanakaree, K. Sookhanaphibarn, S. Intakosum, R. Thawonmas,
[37] S. Goala, P. Dutta, A fuzzy multicriteria decision-making approach to Borderline over-Sampling in Feature Space for Learning Algorithms in
crime linkage, Int. J. Inf. Technol. Syst. Approach 11 (2) (2018) 31–50, Imbalanced Data Environments, vol. 43, 2016.
http://dx.doi.org/10.4018/ijitsa.2018070103. [52] Y.-S. Li, M.-L. Qi, An approach for understanding offender modus operandi
[38] M. Tonkin, J. Lemeire, P. Santtila, J.M. Winter, Linking property crime to detect serial robbery crimes, J. Comput. Sci. 36 (2019) 101024, http:
using offender crime scene behaviour: A comparison of methods, J. Invest. //dx.doi.org/10.1016/j.jocs.2019.101024.
Psychol. Offender Profil. (2019) http://dx.doi.org/10.1002/jip.1525. [53] J. Hu, T. Li, H. Wang, H. Fujita, Hierarchical cluster ensemble model
[39] M. Tonkin, T. Pakkanen, J. Sirén, C. Bennell, J. Woodhams, A. Burrell, based on knowledge granulation, Knowl.-Based Syst. 91 (2016) 179–188,
P. Santtila, Using offender crime scene behavior to link stranger sexual http://dx.doi.org/10.1016/j.knosys.2015.10.006.
assaults: A comparison of three statistical approaches, J. Crim. Justice 50 [54] Y. Tang, Y. Zhang, N.V. Chawla, S. Krasser, SVMs modeling for highly
(2017) 19–28, http://dx.doi.org/10.1016/j.jcrimjus.2017.04.002. imbalanced classification, IEEE Trans. Syst. Man Cybern. B 39 (1) (2009)
[40] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, G. Bing, Learning 281–288, http://dx.doi.org/10.1109/TSMCB.2008.2002909.
from class-imbalanced data: Review of methods and applications, Expert [55] J. Leng, Q. Chen, N. Mao, P. Jiang, Combining granular computing technique
Syst. Appl. 73 (2017) 220–239, http://dx.doi.org/10.1016/j.eswa.2016.12. with deep learning for service planning under social manufacturing con-
035. texts, Knowl.-Based Syst. 143 (2018) 295–306, http://dx.doi.org/10.1016/j.
[41] C. Zhang, J. Bi, S. Xu, E. Ramentol, G. Fan, B. Qiao, H. Fujita, Multi- knosys.2017.07.023.
imbalance: An open-source software for multi-class imbalance learning, [56] X. Zhu, W. Pedrycz, Z. Li, Granular representation of data: A design of
Knowl.-Based Syst. 174 (2019) 137–143, http://dx.doi.org/10.1016/j.knosys. families of ϵ -information granules, IEEE Trans. Fuzzy Syst. 26 (4) (2018)
2019.03.001. 2107–2119, http://dx.doi.org/10.1109/tfuzz.2017.2763122.
[42] H. Haibo, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. [57] T. Ouyang, W. Pedrycz, N.J. Pizzi, Record linkage based on a three-way
Data Eng. 21 (9) (2009) 1263–1284, http://dx.doi.org/10.1109/tkde.2008. decision with the use of granular descriptors, Expert Syst. Appl. 122 (2019)
239. 16–26, http://dx.doi.org/10.1016/j.eswa.2018.12.038.
[43] A. Ali, S.M. Shamsuddin, A.L. Ralescu, Classification with class imbalance [58] T.M. Oshiro, P.S. Perez, J.A. Baranauskas, How many trees in a random
problem: a review, Int. J. Adv. Soft Comput. Appl. 7 (3) (2015) 176–204. forest? in: Machine Learning and Data Mining in Pattern Recognition, 2012,
[44] C. Chen, A. Liaw, L. Breiman, Using random forest to learn imbalanced data, pp. 154–168.
Univ. California 110 (1–12) (2004) 24. [59] P. Probst, A.-L. Boulesteix, To tune or not to tune the number of trees in
[45] M. Bader-El-Den, E. Teitei, T. Perry, Biased random forest for dealing with random forest, J. Mach. Learn. Res. 18 (2017) 181:181-181:118.
the class imbalance problem, IEEE Trans. Neural Netw. Learn. Syst. (2018) [60] J. Demišar, D. Schuurmans, Statistical comparisons of classifiers over
http://dx.doi.org/10.1109/TNNLS.2018.2878400. multiple data sets, J. Mach. Learn. Res. 7 (1) (2006) 1–30.
[61] J.-R. Cano, Analysis of data complexity measures for classification, Expert
Syst. Appl. 40 (12) (2013) 4820–4831, http://dx.doi.org/10.1016/j.eswa.
2013.02.025.

You might also like