You are on page 1of 8

WACV WACV

#4444 #4444
WACV 2023 Submission #4444. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

000 054
001 055
002 056
003 Applied Deep Learning for Fuzzy String Matching 057
004 058
005 059
006 060
Anonymous WACV submission
007 061
008 062
009 Paper ID 4444 063
010 064
011 065
012
Abstract proaches based on approximate string matching were born066
013
and applied to this problem. These methods are mainly067
014
hé lô paper formulates entity matching as a classification based on phonetic encoding and pattern matching. With068
015 069
problem, where the basic goal is to classify entity pairs as the continuous development of technology, when machine
016
matching or non-matching learning was born, popular machine learning models such070
017
as SVM, Logistic Regression and Random Forest were ap-071
018
plied to perform the name matching problem and achieve072
019
1. Introduction better results than previous methods [2]. One of the most073
020 074
recent approaches is based on deep learning.
021 String matching, specifically personal name matching, 075
022 plays an important role in many natural language process- In this paper, we sử dụng mạng học sâu tham khảo từ bài076
023 ing applications. Personal name matching is the problem of báo toponym, 077
024 determining whether two name strings refer to the same per- 078
025 son or not. As more and more personal data is generated, 079
026 research on personal name matching is becoming increas- 2. Related works 080
027 ingly interesting. Names appear everywhere; for example, 081
028 emails, customer profiles, patient profiles, citizen data, lists 082
Analysis: Upon further analysis, it was observed that
029 of employees of companies, politics, social media posts, or 083
edit-distance features could not label cases like abbrevia-
030 even scientific reports containing the names of their au- 084
tions (ex: IBM and International Business Machines), pop-
031 thors. Therefore, people’s names are often used to search 085
ular aliases (ex: big blue and IBM) and partial popular or-
032 for documents in large databases, such as finding authors or 086
ganisation names (Ex:Disney and Walt Disney)
033 finding patients, and in the top search keywords of Google 087
034 search, famous people’s names also make up the majority. Các phương pháp truyền thống cho bài toán so khớp088
035 In addition, the person’s name is also considered a valuable chuỗi được dựa theo bài toán tính giá trị tương đồng giữa089
036 data point in the data linkage problem between two different hai chuỗi, sau đó xác định trùng theo ngưỡng đã đặt ra. Lý090
037 datasets when there is no other connection between them. thuyết cho bài toán so khớp chuỗi xấp xỉ hay tính độ tương091
038 Furthermore, many applications use names to deduplicate đồng chuỗi thực tế rất rộng lớn, tuy nhiên theo Cohen et al092
039 records in the same dataset. Personal name matching is con- (2003), tác giả chia thành ba phương pháp chính sau đây:093
040 sidered one of the best ways to solve these problems. character-based, vector-based and hybrid approaches. Các094
041 Personal name matching has faced many challenges be- phương pháp character-based chủ yếu dựa trên việc xử lý095
042 cause people’s names are recorded with different varia- các ký tự, bao gồm cộng từ, xóa từ hoặc thay thế từ. Đối với096
043 tions. For example, the word for a single person’s name phương pháp vector-based, các từ sẽ được biểu diễn thành097
044 can have many different spellings, such as "Gail", "Gale", vector tương ứng, sau đó tính khoảng cách vector. Giá trị098
045 or "Gayle" [3]. People often have nicknames like "Bob" for khoảng cách càng nhỏ thì độ tương đồng của hai từ càng099
046 "Robert" or "Bill" for "William". Furthermore, naming can cao và ngược lại. Cuối cùng phương pháp hybrid giới thiệu100
047 result in variations such as with or without a family name, các cách tiếp cận bằng việc kết hợp phương pháp vector và101
048 misspelled words, initial only, title, honorific, multiple lan- character để cải thiện độ hiệu quả, hay còn được gọi là to-102
049 guages or names that have changed in time (e.g. marriage, ken. Section 2.1 sẽ cho chúng ra bức tranh khái quát về các103
050 divorce) which also lead to the challenge of personal name phương pháp so sánh chuỗi truyền thống và các mô hình104
051 matching tasks. Neural Network được ứng dụng cho bài toán trên, trong khi105
052 From the above problems, comparing names based on phần 2.2 sẽ giới thiệu các công trình cho phần liên quan đến106
053 exact string matching often gives bad results. Therefore, ap- so khớp tên người. 107

1
WACV WACV
#4444 #4444
WACV 2023 Submission #4444. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

108 162
109 163
110 164
111 165
112 166
113 167
114 168
115 169
116 170
117 171
118 172
119 173
120 174
121 175
122 176
123 177
124 178
125 179
126 Figure 1: Left: The original architecture. Middle: When we start the fine-tuning process we freeze all embedding and180
127 Transformer-encoder layers and only allow the gradient to backpropagate through the FC layers. Right: After the FC layers181
128 have had a chance to warm up we may choose to unfreeze all layers in the network and allow each of them to be fine-tuned182
129 as well 183
130 184
131 185
2.1. Existing methods for fuzzy string matching 2.2. Type-style and fonts
132 186
133 3. Transfer learning 187
134
Edit distances allow to map a pair of string s and t to a 188
135
real number r, in which distance is the cost of best sequence Recently, the transfer learning approach helps especially189
136
of edit operations that convert s to t. Typical edit distances where only limited training examples are available. In this190
137
are character insertion, deletion, and substitution each oper- way, an already trained model on a large dataset can be191
138
ation must be assigned to a cost. There are 2 edit-distance tuned to a new domain. For the personal name matching task192
139
functions which will be taken into consideration in this sec- tuned on ULAN and Wikidata (refer to section *), Trans-193
140
tion. The similarity Levenstein value (1996) assigns a unit former can be tuned with two approaches: 194
141
cost to all edit operations. For instance, the edit distance - Fine-tuning for the classification task: Recent evidence195
142
between two names Jean and Jennie is three, correspond- [*] suggests that only a few of the final layers need to be196
143
ing to one substitution and two insertions, namely a -> e, fine-tuned to achieve remarkable accuracy. This means that197
144
empty -> i and empty -> e. As an example of a more com- we used our new dataset to tune the weights of the current198
145
plex well-tuned distance function, many improvements have set of fully-connected layers, while any learnable parame-199
146
been proposed, including the approach by Damerau (1964), ters highlighted in blue in Fig.1 can be frozen during train-200
147
in which takes transposition of two characters into account. ing. By training only a few layers of FC, the network may201
148
A broadly similar metric, which is not based on an be able to learn new patterns without destroying its previous202
149
edit-distance, is the Jaro metric (Jaro 1995; 1989; Winkler powerful features. 203
150
1999). This metric is based on the number and order of - Fine-tuning for the feature extraction: This method uses204
151
common characters, which is applicable for matching short Transformer as a feature extraction model to generate vector205
152
strings, such as organisation names and personal names. Ac- representations for all known variations of entity names. Af-206
153
cording to these approach, a matching pair is defined by the ter the FC head has started to learn patterns in our dataset207
154
difference between their positions which should be no more (refer to previous paragraph), we pause training, unfreeze208
155
than half the length of the longer string. It is worth not- the body, and then continue the training with a very small209
156
ing that the Jaro-based approach proposed recently could be learning rate until sufficient accuracy is obtained. These210
157
problematic if the strings to be matched contain multiple vector representations extracted from the fine-tuned Trans-211
158
words that are differently ordered (i.e., when matching the former are then fed into the Candidate Ranker, where they212
159
names David and David Copperfield) are compared to candidate vectors using a metric. . The213
160 metric consists of the DeezyMatch’s prediction scores, L2-214
161 Regrading vector-space approaches, norm distance, or cosine similarities between the query and215
WACV WACV
#4444 #4444
WACV 2023 Submission #4444. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

216 270
217 271
218 272
219 273
220 274
221 275
222 276
223 277
224 278
225 279
226 280
227 281
228 282
229 283
230 284
231 285
232 286
233 287
234 288
235 289
236 Figure 2: Two main component of the framework: Pair classifier (left) and Candidate ranker (right). 290
237 291
238 292
239 candidate vectors. with the positional encoded to create the shallow embed-293
240 dings. Then the two shallow embeddings, one for the query294
241 4. Framework and one for the candidate, went through six sequential trans-295
242 former encoder layers to generate deeper embeddings, also296
243
We built our model based on the reference from the known as vector representations of them. After being em-297
244
model in the paper [4]. The model consists of two main bedded, we combined these two vector representations by298
245
components: Pair classifier and Candidate ranker. The pair concatenating them with their multiplication and difference.299
246
classifier discriminates the input pairs into two classes: true Finally, the concatenated vector is used as input for the feed-300
247
and false, while the candidate ranker uses the vector dis- forward network with one hidden layer, and the activation301
248
tances between the given query and candidates to rank the function is Rectified Linear Unit (ReLU). We then applied302
249
result based on most similarity to least similarity. Softmax activation function to the output layer, which has303
250 two units, to make the final prediction. 304
4.1. Pair classifier To give more understanding of the proposed model, we305
251
252 The pair classifier component is a deep neural network will present the setting parameters in detail. The input string306
253 binary classifier, see the left component in figure 2. It takes is tokenized by char and n-gram with n=2,3. The prefix and307
254 input as a query-candidate pair of name strings and its out- suffix are "<" and ">". The maximum sequence length is308
255 put is the label true or false (for matching or non-matching 120, and the embedding size is 60. The transformer layer309
256 respectively) of this pair. In DeezyMatch model, the deep consists of 6 encoders sequentially, each encoder has 8310
257 neural network is built on RNN core (RNN, LSTM, and heads. The feed-forward network in each encoder includes311
258 GRU). In this paper, to improve the performance, we pro- one linear layer to project onto 120 and one linear layer to312
259 posed using Transformer as the core of the model. Note that project down to 60, the activation function is ReLU. The313
260 we only use transformer encoders [7] in our work. last feed-forward layers in the model is similar to the feed-314
261 Overall, there are essentially four stages in the training forward in transformer encoders, but the second linear layer315
262 process, beginning with preprocessing the query-candidate project down to 2. 316
263 input pair, embedding, combining vector representations of 317
4.2. Candidate ranker
264 query and candidate, and ending with label prediction for 318
265 the input pair. In the first stage, the query-candidate pair The candidate ranker is the component that performs the319
266 is converted to lowercase, removed white spaces and to- process of ranking the candidates according to their simi-320
267 kenized (character, n-gram) separately between query and larity scores to the user’s query. There are many ways to321
268 candidate. Next, at the embedding stage, these tokens are compare the similarity of a query-candidate pair. In this pa-322
269 first passed through the embedding layer and then summed per, we will discuss the following two ways. In the first way,323
WACV WACV
#4444 #4444
WACV 2023 Submission #4444. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

324
we use the whole pretrained pair classifier model. In the the next search-size candidates. This process will stop when378
325
output layer of the model, after being applied softmax, the the number of candidates found meets the requirements or379
326
probability for label true is chosen as the confidence score when all the candidates in the database are considered. This380
327
to arrange the order of the candidates. In the second way, we method has greatly reduced the computation time when we381
328
calculate the similarity based on the vector representations have to search a large database. However, it has a weakness382
329
of the query and the candidates. To get the vector represen- that the candidates are selected for consideration based on383
330
tation, we also use the pair classifier model as above. The the L2-norm distance, so the number of candidates found384
331
difference is that we don’t use the whole model but just get can be satisfied while the candidate we want to find the most385
332 386
the output of the last encoder layer. This output will be the is not yet selected.
333 387
vector representation of the query or candidate just passed
334 388
in. Since the query and candidate are passed independently 5. Experiment and Result
335 389
at this step, we don’t need to input a query-candidate pair,
336 390
we just need to input one of the two to get its representation We built our model in Pytorch, based on the source code
337 391
vector. Once we have these representation vectors, we can given in DeezyMatch [4]. By running several experiments
338 392
compare the distances of the vectors to calculate the simi- on different datasets, we recorded sufficient results to ana-
339 393
larity that we want. In this paper we use the cosine similar- lyze the performance of the proposed model. We also com-
340 394
ity method and L2-norm distance [5]. Finally, the similarity pared it to the DeezyMatch model and other established ap-
341 395
score is computed based on the selected method and is used proaches. The result shows that our model has better perfor-
342 396
to rearrange the order of the candidates. mance than the others in the transfer learning task.
343 397
344 At this point, we have completed the candidate ranking 398
345 step. However, this model has a few weaknesses that we can 5.1. Data sources 399
346 easily recognize. That is, every time we perform a candidate
In this section, we introduce data sources for 2 main400
347 ranking for a new query, the model has to generate vector
tasks, which are training and fine-tuning. To begin with, the401
348 representations for that query and for all the candidates in
paper [6] provides five million (5M) Toponym pair datasets402
349 the database, then perform the similarity calculation. This
collected from the public GeoNames gazetteer for training403
350 process is considered a waste of time and resources because
the source model. Second, for the fine-tuning task, we use404
351 the data in the database is usually very large. To overcome
The Union List of Artist Names (ULAN) dataset from The405
352 this drawback, the candidate ranker model needs a little ad-
Getty Vocabularies [1] and the Wikidata. The Toponym,406
353 justment. First, we will use a pretrained model to generate
ULAN and Wikidata datasets will be explained in detail in407
354 vector representations of all the data in the database. Note 408
the next section.
355 that this step only needs to be done once. Next, when there 409
356 is a new query, we proceed to create a vector representation 410
357 for it and then compare it with all the vector representations 411
5.1.1 Toponym dataset
358 in the database created initially. Thus, with just a small ad- 412
359 justment, we can reduce the number of times we generate The Toponym dataset includes five million (5M) pairs of413
360 vector representations for data, leading to a great reduction toponyms collected from the public GeoNames gazetteer,414
361 in computation time. which was created by Santos et al [6]. GeoNames consists415
362 Although the above adjustment has greatly reduced the of multiple places worldwide and their variant names. These416
363 computation time, it still takes a lot of time to compare names contain different alphabets and languages. For cre-417
364 the query string with each candidate string in the database ating and labeling the pairs, the authors [6] came up with418
365 when we perform a search in a huge database. The author the data generator function. This function creates matching419
366 of DeezyMatch has proposed another search method that is pairs from the variant names of a place, each of the two420
367 more efficient, faster and does not need to search the en- variant names must have more than 2 characters and they421
368 tire database. This method first requires the user to choose are not matching every character after being converted to422
369 a search-size variable and a threshold. Then, the method se- lowercase. On the other hand, the non-matching pairs are423
370 lects the search-size candidates from the database that have chosen from the names of different places, and also have424
371 the smallest L2-norm distance to the query. Then they com- more than 2 characters. Moreover, the non-matching pairs425
372 pute the similarity of them based on the selected method were made with the purpose of bringing more challenges426
373 and discard the candidates that have a similarity score that for automated classification, they discard the pairs that have427
374 does not satisfy a preset threshold (for example, the con- the Jaccard similarity coefficient equal to zero with a prob-428
375 fidence score must be greater than 0.8). If the number of ability of 0.75. Finally, the Toponym dataset includes 50%429
376 remaining candidates is not satisfied, we will expand the matching pairs and 50% non-matching pairs, which is a430
377 search range and again use the L2-norm distance to find great distribution for the binary classification task. 431
WACV WACV
#4444 #4444
WACV 2023 Submission #4444. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

432 486
5.1.2 ULAN dataset mulas are defined as follows:
433 487
434 TN + TP
The Union List of Artist Name (ULAN) data is a part of Accuracy = (1)488
435 The Getty Vocabularies [1], it was collected from 1984 TN + TP + FN + FP 489
436 to 1994 by the user community and editors of the Getty 490
The most intuitive performance metric is Accuracy, which
437 Vocabulary Program. The ULAN dataset is a structured 491
is simply a ratio of correctly predicted samples to total sam-
438 vocabulary that contains data about artists from all over 492
ples. It is very effective when the dataset is balanced. But in
439 the world such as their id, name, variant names, national- 493
the real world, not all datasets are balanced. In that case, we
440 ity, role, gender, birthday, and death date. ULAN variant 494
need other measures like Precision and Recall.
441 names allow for multiple types of names like given names, 495
442 pseudonyms, different spellings, misspellings, multilingual TP 496
P recision = (2)
443 names, birth names, abbreviations, common names, signa- TP + FP 497
444 tures, full names, and names that have changed over time 498
Precision is the ratio of the correctly predicted positive sam-
445 (e.g., married names). For the purpose of matching artist 499
ples to the total predicted positive samples.
446 names, we only use names and variant names of artists 500
447 in ULAN. Based on that idea, we downloaded the ULAN TP 501
448
Recall = (3)502
full.zip file from [1] and then just collected names and vari- TP + FN
449 ant names, we ignored the other information and removed 503
450 them from the dataset. After that, we converted the names Recall is the ratio of the correctly predicted positive sam-504
451 into the suitable format to passed it through the data gen- ples to all the positive samples in the dataset. We want our505
452 erator function, which we have discussed above, to generate model to have high precision and high recall at the same506
453 matching and non-matching name pairs. In the end, we have time, but there is always a trade-off between these two mea-507
454 the ULAN dataset, which contains around 550K name pair- sures. Therefore, we have the F1 score: 508
455 ings with a 50-50 rate of matching and non-matching pairs. P recision ∗ Recall 509
456 F 1_score = 2 ∗ (4)510
P recision + Recall
457 511
458 5.1.3 Wikidata-90K dataset F1 score is calculated by weighted averaging Precision and512
459 Recall. As F1 score takes into account both Precision and513
Wikidata-90K is a collection of approximately 93,000 Recall, the higher the F1 score is, the greater the model can514
460
singer name sources that are a subset of Wikidata covering perform. Note that these four formulas have the best value515
461
entities with multi-language labels. Each name entity in this at 1 and the worst value at 0.
462 516
subset consists of various types of names relating to the cor-
463 517
responding singer, namely full names, nicknames, or short 5.3. Training
464 518
names. To construct the dataset, we started with a medium
465 We use the Toponym dataset to train the baseline model519
collection of singer names recorded through the Wikidata
466 because it is diversely huge [6]. We trained a total of four520
API. The data is then downloaded, filtered, and combined
467 models. They shared the same architecture, which was de-521
to prepare the input format for the next stage of matching
468 scribed in section 3. The only differences are the cores,522
pair generation. Following collecting, these entries were fed
469 which are RNN, LTSM, GRU, and Transformer. The hy-523
into the data generator function described in the paper [6]
470 perparameters are the same for four models and are set up524
in order to generate matching or non-matching pairs.
471 as follows: the learning rate is set to 0.001, the batch size525
472 is 256, the number of epochs is 10, the dropout is 0.01 and526
5.2. Metrics
473 the optimizer is Adam. The loss function we use is Binary527
474 The name matching problem is considered as a binary Cross Entropy loss. We ran the experiments on an NVIDIA528
475 classifier task, where the inputs are two name strings and the RTX A6000 with 48GB of RAM, GDDR6, and PCIe 4.0529
476 output is a label that indicates if they match or not. To assess x16. The train, test, and valid sets are split with a ratio of530
477 the model’s performance, we choose the standard evaluation 0.7, 0.15, and 0.15, respectively. 531
478 metrics for binary classification, which are Accuracy, Preci- The figure 3 will demonstrate how finetuning affect the532
479 sion, Recall, and F1 score. Before showing the formulas, we model architecture on number of entries. The ULAN and533
480 first defined four types of data after being classified: True Wikidata dataset provided to assessed the performance of534
481 Positive (TP) - when the label is true and the model predicts GRU and Transformer, including training and testing on the535
482 true; True Negative (TN) - when the label is false and the corresponding dataset. In the beginning, The baseline taken536
483 model predicts false; False Positive (FP) - when the label on Transformer is lower than GRU, comparing 0.83 and537
484 is false but the model predicts true; False Negative (FN) - 0.87 respectively. Nevertheless, only 8K data points were538
485 when the label is true but the model predicts false. The for- needed to boost the performance of Transformer from 0.87539
WACV WACV
#4444 #4444
WACV 2023 Submission #4444. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

540
that gave the highest accuracy. The chosen methods, thresh-594
541
olds, and their results are presented in table 1. As can be595
542
seen from the table, the advantages of these methods are596
543
that they do not require training and their inference times597
544
are very fast because they do not need to tokenize and em-598
545
bed the data. The results are impressive. Davis and De Salla599
546
method reached the highest recall at 0.993. Except for Lev-600
547
enshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler, all601
548
the other methods have an accuracy and F1 score above602
549
0.91. However, our proposed method still has better results.603
550
Compared to the best string similarity method, Monge-604
551
ELkan, which has an F1 score of 0.953, our model reached605
552
0.957. The transformer model also has greater accuracy and606
553 607
precision as well.
554 608
555 609
556
5.4.2 With supervised machine learning methods 610
557 For machine learning methods, we chose Support Vec-611
558 tor Machine, Random Forest, Gradient Boosted Trees and612
559 Naive Bayes to compare with our models (see table 1).613
560 The implementation of these models is provided by the614
561 scikit-learn python package. We first embed two strings by615
562 Word2Vec with an embedding size of 120, then concate-616
563 nate them into one vector and use it as the feature for the617
564 above models. Random Forest and Support Vector Machine618
565 were the best and second best in terms of performance out619
566 of the four models. Their F1 scores are above 0.84, which620
567 are 0.859 and 0.49, respectively. Moreover, two of them,621
568 Figure 3: The impact of finetuning and choice of architec- Support Vector Machine and Gradient Boosted Trees, have622
569 ture on the model performance a longer training time than the proposed method. Still, our623
570 model has better results when compared to machine learn-624
571 ing models in the personal name matching task. 625
572 to 0.89, which taking over these on GRU from now on. 626
573 While both models show the same upward trend when fine- 627
tuning, Transformer shows its priority in adapting to a new 5.4.3 With DeezyMatch GRU methods
574 628
575 domain, making it reach the peak at nearly 0.95 in the end. The DeezyMatch GRU model and our Transformer model629
576 It is worth noting that the performance tested on Wikdata share the same architecture, with the only difference being630
577 show remarkable score, allowing models continue learning the cores. In order to assess the performance of these deep631
578 when compared to the performance tested on ULAN in fine- neural network models on personal name matching tasks,632
579 tuning. we fine-tuned the last two layers of them using the ULAN633
580 dataset. The results are shown in table 1. We both know that634
581
5.4. Comparison deep learning models take time to train. But it is also worth635
582 We compared the performance of the proposed model noting that the deep learning model makes use of parallel636
583 in transfer learning tasks with different methods and tested processing on the GPU. Therefore, in this case, the training637
584 them on the ULAN dataset. time is not so large in comparison to other supervised ma-638
585 chine learning methods, which are 1183 seconds for Trans-639
586
5.4.1 With traditional string matching methods former and 569 seconds for GRU. 640
587 To provide a better understanding of the results of two641
588 We selected a total of 12 traditional string similarity metrics deep learning models, we investigated some of the predicted642
589 such as Levenshtein distance, Jaro-Winkler distance, Jac- samples of them. In table 2, the rows indicate the predic-643
590 card similarity,... for evaluation. For choosing the threshold tions of Transformer and GRU, the columns show the true644
591 values in each method, we ran experiments with thresholds label of the name pairs. As we can see, the Transformer645
592 starting at 0.1 and ending at 0.9. The step is 0.05, so there model performed well on the initial cases (such as Benard646
593 are 17 runs for each method. Then, we chose the threshold Ch. J. - Bénard Charles Joachim, and Roger G.G. - Roger647
WACV WACV
#4444 #4444
WACV 2023 Submission #4444. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

648 702
Method Threshold Accuracy Precision Recall F1 score Training time Inference time
649 703
Levenshtein 0.30 0.674 0.395 0.895 0.548 0.62 sec.
650 704
651 Damerau-Levenshtein 0.30 0.674 0.396 0.892 0.548 0.62 sec. 705
652 Jaro 0.60 0.792 0.678 0.878 0.765 0.50 sec. 706
653 Jaro-Winkler 0.60 0.792 0.678 0.878 0.765 0.57 sec. 707
654 Sorted Jaro-Winkler 0.65 0.930 0.884 0.974 0.927 0.68 sec. 708
655 Cosine similarity 0.65 0.925 0.888 0.958 0.922 1.45 sec. 709
656 710
Jaccard similarity 0.50 0.918 0.867 0.967 0.914 1.79 sec.
657 711
658
Overlap coefficient 0.85 0.927 0.887 0.965 0.924 1.39 sec. 712
659 Dice coefficient 0.65 0.919 0.870 0.964 0.915 1.36 sec. 713
660 Soft-Jaccard 0.50 0.952 0.922 0.980 0.950 4.84 sec. 714
661 Monge-Elkan 0.70 0.955 0.919 0.990 0.953 5.09 sec. 715
662 Davis and De Salles 0.60 0.947 0.900 0.993 0.944 6.82 sec. 716
663 717
Support Vector Machine 0.849 0.847 0.852 0.849 13954.24 sec. 2614.90 sec.
664 718
Random Forest 0.864 0.888 0.833 0.859 866.47 sec. 3.00 sec.
665 719
666 Gradient Boosted Trees 0.766 0.805 0.702 0.750 1978.34 sec. 0.32 sec. 720
667 Naive Bayes 0.585 0.614 0.459 0.525 1.11 sec. 0.40 sec. 721
668 Fine-tuned Transformer 0.957 0.971 0.941 0.957 1183 sec. 20.92 sec. 722
669 Fine-tuned GRU 0.944 0.945 0.941 0.944 569 sec. 10.06 sec. 723
670 724
671 Table 1: A comparison of the proposed method with the DeezyMatch: GRU method, the traditional string similarity matching725
672 methods, and the supervised machine learning methods performed on the ULAN dataset. 726
673 727
674 728
675 729
676 Guillaume) while the GRU model did not. One explana- [5] Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale730
677 tion is that Transformer can learn information about the similarity search with gpus. IEEE Transactions on Big Data,731
whole string at the same time, not sequence by sequence 7(3):535–547, 2021. 4
678 732
like GRU. Therefore, it can learn more about the seman- [6] Rui Santos, Patricia Murrieta-Flores, Pável Calado, and Bruno
679 733
tics of the name. Overall, the results showed that our model Martins. Toponym matching through deep neural networks.
680 734
outperforms the GRU model. We achieved better accuracy, International Journal of Geographical Information Science,
681 32(2):324–348, Oct. 2017. 4, 5 735
682 precision, and F1 score than GRU, only recall is equal. This
[7] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-736
683 means that the proposed model has great performance on
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia737
684 the personal name matching tasks. Polosukhin. Attention is all you need. CoRR, abs/1706.03762,738
685 2017. 3 739
686
5.5. Conclusion 740
687 741
688 References 742
689 743
[1] The getty vocabularies. http://vocab.getty.edu/.
690 744
Accessed: 2022-03-15. 4, 5
691 745
[2] Mohamad Alifikri and Moch. Arif Bijaksana. Indonesian
692 746
name matching using machine learning supervised approach.
693 747
Journal of Physics: Conference Series, 971:012038, mar
694 2018. 1 748
695 749
[3] Peter Christen. A comparison of personal name matching:
696 Techniques and practical issues. In Sixth IEEE International 750
697 Conference on Data Mining - Workshops (ICDMW’06), pages 751
698 290–294, 2006. 1 752
699 [4] Kasra Hosseini, Federico Nanni, and Mariona Coll Ardanuy. 753
700 DeezyMatch: A flexible deep learning approach to fuzzy 754
701 string matching. pages 62–69, Oct. 2020. 3, 4 755
WACV WACV
#4444 #4444
WACV 2023 Submission #4444. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

756 810
757 811
758 812
759 813
760 814
761 815
762 816
763 817
764 818
765 819
766 820
767 821
768 822
769 823
770 824
771 825
772 Label True Label False 826
773 Greacen Nan ; Nan Faure Greacen GJohnpaul Jones; June John 827
774 Jutz Carl; Iutz Marsh Miss; Eisner Mark 828
775 829
’ ;Allora Jennifer Saranyena Juan; Jan Soreau
776 830
Transformer predicts True Troili Gustaf Uno; Uno Troili ; Antonio Croci
777 831
778
GRU predicts False Benard Ch. J.; Bénard Charles Joachim Dyk Philip van; P. Van Dyke 832
779 Kamis Ram; Ram Karmi Ezra Waite; Ekert Henriëtte van 833
780 Roger G.G.; Roger Guillaume Franz A.; Francisca Labbé 834
781 André-François; Farkas André Anthony Edgeworth; Antonini Andrea 835
782 Milward Mary; Knox Mrs. Mary Böll Aloys; Ajolfi Elia 836
783 837
Pepion Daniel; Pepion Dan Jr. Paxton Joseph; Belcamp John
784 838
Wall Terry; Terence Wall Hanke Stefan; Han Si-gak
785 839
786 Transformer predicts False Bichkoff Erik; Byčkov Israel Smallman Roy; Stéphane Mallarmé 840
787 GRU predicts True Moore Mrs.; Lou Wall Moore Frans Gons; Giovan Francesco Biceso 841
788 Bone Henry R.; H P Bone GCV; 842
789 Wu Shih-mang; Huangyongyu Bernice Cross; Barsness Jim 843
790 Johannes Foersom; ;Ronaldo Racy 844
791 845
792 Table 2: Illustrative examples of predictions from the proposed Transformer and DeezyMatch GRU models, both fine-tuned846
793 on the ULAN dataset. 847
794 848
795 849
796 850
797 851
798 852
799 853
800 854
801 855
802 856
803 857
804 858
805 859
806 860
807 861
808 862
809 863

You might also like