You are on page 1of 5

bergvca.github.

io

Super Fast String Matching in Python


Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure
are too slow for large datasets. Using TF-IDF with N-Grams as terms to find similar strings
transforms the problem into a matrix multiplication problem, which is computationally much
cheaper. Using this approach made it possible to search for near duplicates in a set of 663,000
company names in 42 minutes using only a dual-core laptop.
Update: run all code in the below post with one line using string_grouper:
match_strings(companies['Company Name'])
Name Matching
A problem that I have witnessed working with databases, and I think many other people with me, is
name matching. Databases often have multiple entries that relate to the same entity, for example a
person or company, where one entry has a slightly different spelling then the other. This is a
problem, and you want to de-duplicate these. A similar problem occurs when you want to merge or
join databases using the names as identifier.
The following table gives an example:
Company Name
Burger King
Mc Donalds
KFC
Mac Donald’s
For the human reader it is obvious that both Mc Donalds and Mac Donald’s are the same company.
However for a computer these are completely different making spotting these nearly identical
strings difficult.
One way to solve this would be using a string similarity measures like Jaro-Winkler or the
Levenshtein distance measure. The obvious problem here is that the amount of calculations
necessary grow quadratic. Every entry has to be compared with every other entry in the dataset, in
our case this means calculating one of these measures 663.000^2 times. In this post I will explain
how this can be done faster using TF-IDF, N-Grams, and sparse matrix multiplication.
The Dataset
I just grabbed a random dataset with lots of company names from Kaggle. It contains all company
names in the SEC EDGAR database. I don’t know anything about the data or the amount of
duplicates in this dataset (it should be 0), but most likely there will be some very similar names.
import pandas as pd

pd.set_option('display.max_colwidth', -1)
names = pd.read_csv('data/sec_edgar_company_info.csv')
print('The shape: %d x %d' % names.shape)
names.head()
Line Number Company Name Company CIK Key
0 1 !J INC 1438823
1 2 #1 A LIFESAFER HOLDINGS, INC. 1509607
2 3 #1 ARIZONA DISCOUNT PROPERTIES LLC 1457512
3 4 #1 PAINTBALL CORP 1433777
4 5 $ LLC 1427189
TF-IDF
TF-IDF is a method to generate features from text by multiplying the frequency of a term (usually a
word) in a document (the Term Frequency, or TF) by the importance (the Inverse Document
Frequency or IDF) of the same term in an entire corpus. This last term weights less important words
(e.g. the, it, and etc) down, and words that don’t occur frequently up. IDF is calculated as:
IDF(t) = log_e(Total number of documents / Number of documents
with term t in it).
An example (from www.tfidf.com/):
Consider a document containing 100 words in which the word cat appears 3 times. The term
frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and
the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities:
0.03 * 4 = 0.12.
TF-IDF is very useful in text classification and text clustering. It is used to transform documents
into numeric vectors, that can easily be compared.
N-Grams
While the terms in TF-IDF are usually words, this is not a necessity. In our case using words as
terms wouldn’t help us much, as most company names only contain one or two words. This is why
we will use n-grams: sequences of N contiguous items, in this case characters. The following
function cleans a string and generates all n-grams in this string:
import re

def ngrams(string, n=3):


string = re.sub(r'[,-./]|\sBD',r'', string)
ngrams = zip(*[string[i:] for i in range(n)])
return [''.join(ngram) for ngram in ngrams]

print('All 3-grams in "McDonalds":')


ngrams('McDonalds')
All 3-grams in "McDonalds":

['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']


As you can see, the code above does some cleaning as well. Next to removing some punctuation
(dots, comma’s etc) it removes the string “ BD”. This is a nice example of one of the pitfalls of this
approach: some terms that appear very infrequent will result in a high bias towards this term. In this
case there where some company names ending with “ BD” that where being identified as similar,
even though the rest of the string was not similar.
The code to generate the matrix of TF-IDF values for each is shown below.
from sklearn.feature_extraction.text import TfidfVectorizer

company_names = names['Company Name']


vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(company_names)
The resulting matrix is very sparse as most terms in the corpus will not appear in most company
names. Scikit-learn deals with this nicely by returning a sparse CSR matrix.
You can see the first row (“!J INC”) contains three terms for the columns 11, 16196, and 15541.
print(tf_idf_matrix[0])

# Check if this makes sense:

ngrams('!J INC')
(0, 11) 0.844099068282
(0, 16196) 0.51177784466
(0, 15541) 0.159938115034

['!JI', 'JIN', 'INC']


The last term (‘INC’) has a relatively low value, which makes sense as this term will appear often in
the corpus, thus receiving a lower IDF weight.
Cosine Similarity
To calculate the similarity between two vectors of TF-IDF values the Cosine Similarity is usually
used. The cosine similarity can be seen as a normalized dot product. For a good explanation see:
this site. We can theoretically calculate the cosine similarity of all items in our dataset with all other
items in scikit-learn by using the cosine_similarity function, however the Data Scientists at ING
found out this has some disadvantages:
 The sklearn version does a lot of type checking and error handling.
 The sklearn version calculates and stores all similarities in one go, while we are only
interested in the most similar ones. Therefore it uses a lot more memory than necessary.
To optimize for these disadvantages they created their own library which stores only the top N
highest matches in each row, and only the similarities above an (optional) threshold.
import numpy as np
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct

def awesome_cossim_top(A, B, ntop, lower_bound=0):


# force A and B as a CSR matrix.
# If they have already been CSR, there is no overhead
A = A.tocsr()
B = B.tocsr()
M, _ = A.shape
_, N = B.shape

idx_dtype = np.int32

nnz_max = M*ntop

indptr = np.zeros(M+1, dtype=idx_dtype)


indices = np.zeros(nnz_max, dtype=idx_dtype)
data = np.zeros(nnz_max, dtype=A.dtype)

ct.sparse_dot_topn(
M, N, np.asarray(A.indptr, dtype=idx_dtype),
np.asarray(A.indices, dtype=idx_dtype),
A.data,
np.asarray(B.indptr, dtype=idx_dtype),
np.asarray(B.indices, dtype=idx_dtype),
B.data,
ntop,
lower_bound,
indptr, indices, data)

return csr_matrix((data,indices,indptr),shape=(M,N))
The following code runs the optimized cosine similarity function. It only stores the top 10 most
similar items, and only items with a similarity above 0.8:
import time
t1 = time.time()
matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 10, 0.8)
t = time.time()-t1
print("SELFTIMED:", t)
SELFTIMED: 2718.7523670196533
The following code unpacks the resulting sparse matrix. As it is a bit slow, an option to look at only
the first n values is added.
def get_matches_df(sparse_matrix, name_vector, top=100):
non_zeros = sparse_matrix.nonzero()

sparserows = non_zeros[0]
sparsecols = non_zeros[1]

if top:
nr_matches = top
else:
nr_matches = sparsecols.size

left_side = np.empty([nr_matches], dtype=object)


right_side = np.empty([nr_matches], dtype=object)
similarity = np.zeros(nr_matches)

for index in range(0, nr_matches):


left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similarity[index] = sparse_matrix.data[index]

return pd.DataFrame({'left_side': left_side,


'right_side': right_side,
'similarity': similarity})

Lets look at our matches:


matches_df = get_matches_df(matches, company_names, top=100000)
matches_df = matches_df[matches_df['similarity'] < 0.99999] # Remove all exact
matches
matches_df.sample(20)
left_side right_side similarity
ADVISORY U S EQUITY MARKET ADVISORY US EQUITY MARKET
41024 0.818439
NEUTRAL OVERSEAS FUND LTD NEUTRAL FUND LP
AIM VARIABLE INSURANCE FUNDS
48061 AIM VARIABLE INSURANCE FUNDS (INVESCO VARIABLE INSURANCE 0.856922
FUNDS)
14978 ACP ACQUISITION CORP CP ACQUISITION CORP 0.913479
54837 ALLFIRST TRUST CO NA ALLFIRST TRUST CO NA /TA/ 0.938206
89788 ARMSTRONG MICHAEL L ARMSTRONG MICHAEL 0.981860
54124 ALLEN MICHAEL D ALLEN MICHAEL J 0.928606
66765 AMERICAN SCRAP PROCESSING INC SCRAP PROCESSING INC 0.858714
AGL LIFE ASSURANCE CO SEPARATE AGL LIFE ASSURANCE CO SEPARATE
44886 0.880202
ACCOUNT VA 27 ACCOUNT VA 24
49119 AJW PARTNERS II LLC AJW PARTNERS LLC 0.876761
16712 ADAMS MICHAEL C. ADAMS MICHAEL A 0.891636
96207 ASTRONOVA, INC. PETRONOVA, INC. 0.841667
26079 ADVISORS DISCIPLINED TRUST 1329 ADVISORS DISCIPLINED TRUST 1327 0.862806
16200 ADAMANT TECHNOLOGIES NT TECHNOLOGIES, INC. 0.814618
ANGELLIST-SORY-FUND, A SERIES OF ANGELLIST-NABS-FUND, A SERIES OF
77473 0.828394
ANGELLIST-SDA-FUNDS, LLC ANGELLIST-SDA-FUNDS, LLC
70624 AN STD ACQUISITION CORP OT ACQUISITION CORP 0.855598
16669 ADAMS MARK B ADAMS MARY C 0.812897
48371 AIR SEMICONDUCTOR INC LION SEMICONDUCTOR INC. 0.814091
53755 ALLEN DANIEL M. ALLEN DANIEL J 0.829631
16005 ADA EMERGING MARKETS FUND, LP ORANDA EMERGING MARKETS FUND LP 0.839016
left_side right_side similarity
97135 ATHENE ASSET MANAGEMENT LLC CRANE ASSET MANAGEMENT LLC 0.807580
The matches look pretty similar! The cossine similarity gives a good indication of the similarity
between the two company names. ATHENE ASSET MANAGEMENT LLC and CRANE ASSET
MANAGEMENT LLC are probably not the same company, and the similarity measure of 0.81
reflects this. When we look at the company names with the highest similarity, we see that these are
pretty long strings that differ by only 1 character:
matches_df.sort_values(['similarity'], ascending=False).head(10)
left_side right_side similarity
ANGLE LIGHT CAPITAL, LP - ANGLE ANGLE LIGHT CAPITAL, LP - ANGLE
77993 0.994860
LIGHT CAPITAL - QUASAR SERIES I LIGHT CAPITAL - QUASAR SERIES II
ANGLE LIGHT CAPITAL, LP - ANGLE ANGLE LIGHT CAPITAL, LP - ANGLE
77996 0.994860
LIGHT CAPITAL - QUASAR SERIES II LIGHT CAPITAL - QUASAR SERIES I
APOLLO OVERSEAS PARTNERS APOLLO OVERSEAS PARTNERS
81120 0.993736
(DELAWARE 892) VIII, L.P. (DELAWARE 892) VII LP
APOLLO OVERSEAS PARTNERS APOLLO OVERSEAS PARTNERS
81116 0.993736
(DELAWARE 892) VII LP (DELAWARE 892) VIII, L.P.
AMERICAN SKANDIA LIFE ASSURANCE AMERICAN SKANDIA LIFE ASSURANCE
66974 0.993527
CORP VARIABLE ACCOUNT E CORP VARIABLE ACCOUNT B
AMERICAN SKANDIA LIFE ASSURANCE AMERICAN SKANDIA LIFE ASSURANCE
66968 0.993527
CORP VARIABLE ACCOUNT B CORP VARIABLE ACCOUNT E
APOLLO EUROPEAN PRINCIPAL APOLLO EUROPEAN PRINCIPAL
80929 0.993375
FINANCE FUND III (EURO B), L.P. FINANCE FUND II (EURO B), L.P.
APOLLO EUROPEAN PRINCIPAL APOLLO EUROPEAN PRINCIPAL
80918 0.993375
FINANCE FUND II (EURO B), L.P. FINANCE FUND III (EURO B), L.P.
APOLLO EUROPEAN PRINCIPAL APOLLO EUROPEAN PRINCIPAL
80921 0.993116
FINANCE FUND III (DOLLAR A), L.P. FINANCE FUND II (DOLLAR A), L.P.
APOLLO EUROPEAN PRINCIPAL APOLLO EUROPEAN PRINCIPAL
80907 0.993116
FINANCE FUND II (DOLLAR A), L.P. FINANCE FUND III (DOLLAR A), L.P.

Conclusion
As we saw by visual inspection the matches created with this method are quite good, as the strings
are very similar. The biggest advantage however, is the speed. The method described above can be
scaled to much larger datasets by using a distributed computing environment such as Apache Spark.
This could be done by broadcasting one of the TF-IDF matrices to all workers, and parallelizing the
second (in our case a copy of the TF-IDF matrix) into multiple sub-matrices. Multiplication can
then be done (using Numpy or the sparse_dot_topn library) by each worker on part of the second
matrix and the entire first matrix. An example of this is described here.

You might also like