Good Python Modules For Fuzzy String Comparison

Good Python modules for fuzzy string comparison?
- Stack Overflow
2/14/12 5:02 PM
me to Q&A for professional and enthusiast programmers check out the FAQ!
Good Python modules for fuzzy string comparison?
I'm looking for a Python module that can do simple fuzzy string comparisons. Specifically, I'd like a percentage of how similar the strings are. I know this is potentially subjective so I was hoping to find a library that can do positional comparisons as well as longest similar string matches, among other things. Basically, I'm hoping to find something that is simple enough to yield a single percentage while still configurable enough that I can specify what type of comparison(s) to do.
python string string-comparison fuzzy-comparison
asked Mar 25 '09 at 16:25 Soviut 15.3k 7 42 101 92% accept rate
possible duplicate of Text difference algorithm tzot Sep 20 '10 at 14:08
feedback
11 Answers
Levenshtein Python extension and C library. http://code.google.com/p/pylevenshtein/ The Levenshtein Python C extension module contains functions for fast computation of - Levenshtein (edit) distance, and edit operations - string similarity - approximate median strings, and generally string averaging - string sequence and set similarity It supports both normal and Unicode strings.
http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison
Page 1 of 6
Good Python modules for fuzzy string comparison? - Stack Overflow
2/14/12 5:02 PM
>>>"import"Levenshtein >>>"help(Levenshtein.ratio) ratio(...) """"Compute"similarity"of"two"strings. """"ratio(string1,"string2) """"The"similarity"is"a"number"between"0"and"1,"it's"usually"equal"or """"somewhat"higher"than"difflib.SequenceMatcher.ratio(),"becuase"it's """"based"on"real"minimal"edit"distance. """"Examples: """">>>"ratio('Hello"world!',"'Holly"grail!') """"0.58333333333333337 """">>>"ratio('Brian',"'Jesus') """"0.0 >>>"help(Levenshtein.distance) distance(...) 2 Just wanted to note, for future readers of this thread that happen to be using NLTK in their project, that """"Compute"absolute"Levenshtein"distance"of"two"strings.
nltk.metrics.edit_distance('string1',"'string2') will calculate the Levenshtein distance between string1 and string2 . So if you're using NLTK like me you might not need to download a """"distance(string1,"string2) Levenshtein library besides this. Cheers seafangs Nov 20 '11 at 22:24
""""Examples"(it's"hard"to"spell"Levenshtein"correctly): """">>>"distance('Levenshtein',"'Lenvinsten') Was this post useful to you? Yes No """"4 """">>>"distance('Levenshtein',"'Levensthein') """"2 """">>>"distance('Levenshtein',"'Levenshten') """"1 """">>>"distance('Levenshtein',"'Levenshtein') """"0 answered Mar 26 '09 at 7:18 difflib can do it.
Pete Skomoroch 651 5 3
Example from the docs: >>>"get_close_matches('appel',"['ape',"'apple',"'peach',"'puppy']) ['apple',"'ape'] >>>"import"keyword >>>"get_close_matches('wheel',"keyword.kwlist) ['while'] >>>"get_close_matches('apple',"keyword.kwlist) [] >>>"get_close_matches('accept',"keyword.kwlist) ['except'] Check it out. It has other functions that can help you build something custom.
answered Mar 25 '09 at 16:34 nosklo 39.4k 6 70 115
5 3 4
+1 Neat, I don't recall ever seeing this before Van Gale Mar 25 '09 at 17:07 +1: Quote the documents. S.Lott Mar 25 '09 at 17:13 +1: Great to be introduced to a module I've not used before. Jarret Hardie Mar 25 '09 at 17:51 I've actually used difflib before, but found that I couldn't just ask for a percentage match amount. Its been a while though. Soviut Mar 25 '09 at 19:33
Page 2 of 6
2/14/12 5:02 PM
@Soviut: e.g. difflib.SequenceMatcher(None,"'foo',"'bar').ratio() returns a value between 0-1 which can be interpreted as match percentage. Right? utku_karatas Apr 28 '10 at 10:38
feedback
I like nosklo's answer; another method is the Damerau-Levenshtein distance: "In information theory and computer science, DamerauLevenshtein distance is a 'distance' (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two characters." An implementation in Python from Wikibooks: def"lev(a,"b): """"if"not"a:"return"len(b) """"if"not"b:"return"len(a) """"return"min(lev(a[1:],"b[1:])+(a[0]"!="b[0]),"\ """"lev(a[1:],"b)+1,"lev(a,"b[1:])+1) More from Wikibooks, this gives you the length of the longest common substring (LCS): def"LCSubstr_len(S,"T): """"m"="len(S);"n"="len(T) """"L"="[[0]"*"(n+1)"for"i"in"xrange(m+1)] """"lcs"="0 """"for"i"in"xrange(m): """"""""for"j"in"xrange(n): """"""""""""if"S[i]"=="T[j]: """"""""""""""""L[i+1][j+1]"="L[i][j]"+"1 """"""""""""""""lcs"="max(lcs,"L[i+1][j+1]) """"return"lcs
edited Mar 25 '09 at 17:03 answered Mar 25 '09 at 16:46 Adam Bernier 15.9k 3 21 35
Thanks, I found some information about Levenshtein while doing my initial searching, but the examples were far too vague. Your answer is excellent. Soviut Mar 25 '09 at 19:34 I chose this one because it gives me a nice scalar number I can work with and use for thresholds. Soviut Mar 25 '09 at 19:37
feedback
As nosklo said, use the difflib module from the Python standard library. The difflib module can return a measure of the sequences' similarity using the ratio() method of a SequenceMatcher() object. The similarity is returned as a float in the range 0.0 to 1.0.
Page 3 of 6
2/14/12 5:02 PM
>>>"import"difflib >>>"difflib.SequenceMatcher(None,"'abcde',"'abcde').ratio() 1.0 >>>"difflib.SequenceMatcher(None,"'abcde',"'zbcde').ratio() 0.80000000000000004 >>>"difflib.SequenceMatcher(None,"'abcde',"'zyzzy').ratio() 0.0

answered Mar 10 '10 at 17:03 Edi H 163 1 7 Not terribly impressed by SequenceMatcher. It gives the same score to David/Daved that it gives to David/david. Leeks and Leaks May 28 '10 at 18:00
You'll get the same problem with Levenshtein distance. If you don't care about the case, you should just call lower() on each argument before comparing them. Barthelemy Apr 28 '11 at 18:00
feedback
There is also Google's own google-diff-match-patch ("Currently available in Java, JavaScript, C++ and Python"). (Can't comment on it, since I have only used python's difflib myself)
answered Mar 25 '09 at 17:47 Steven 4,700 9 12 feedback
While not specific to Python, here is a question about similar string algorithms: http://stackoverflow.com/questions/451884/similar-string-algorithm/451910#451910
answered Mar 25 '09 at 16:36 Dana 6,549 2 31 53 feedback
Jellyfish is a python module which supports many string comparison metrics including phonetic matching. It's really fast! pure python implementations of Levenstein edit distance are quite slow compared to jellyfishes implementation.
answered Dec 3 '11 at 19:20 timv 21 1 This looks like a great library, as it has several string comparison algorithms and not just one: Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance Vladtn Jan 27 at 11:35
feedback
Page 4 of 6
2/14/12 5:02 PM
Here's a python script for computing longest comon substring of two words--may ned tweaking to work for multi-word phrases: def lcs(word1, word2): w1"="set(word1[i:j]"for"i"in"range(0,"len(word1)) """""""""for"j"in"range(1,"len(word1)"+"1)) w2"="set(word2[i:j]"for"i"in"range(0,"len(word2)) """""""""for"j"in"range(1,"len(word2)"+"1)) common_subs"""""="w1.intersection(w2) sorted_cmn_subs"="sorted([ """"(len(str),"str)"for"str"in"list(common_subs) """"]) return"sorted_cmn_subs.pop()[1]
answered Apr 20 '09 at 16:32 twneale
feedback
Another alternative would be to use the recently released package FuzzyWuzzy. The various functions supported by the package are also described in this blogpost.
answered Aug 29 '11 at 18:41 live2bshiv 42 4 feedback
Heres the way how it can be done using Charicar's simhash, this is also suitable for long documents, it will detect 100% similarity also when you change order of words in documents too http://blog.simpliplant.eu/calculating-similarity-between-text-strings-in-python/
answered Nov 19 '11 at 11:21 Simpliplant
1
feedback
I am using double-metaphone which works like a charm. An example: >>>"dm(u'aubrey') ('APR',"'') >>>"dm(u'richard') ('RXRT',"'RKRT') >>>"dm(u'katherine')"=="dm(u'catherine') True Update: Jellyfish also has it. Comes under Phonetic encoding.
Page 5 of 6
2/14/12 5:02 PM
answered Dec 16 '11 at 6:30 zengr 7,642 4 18 52 feedback
Not the answer you're looking for? Browse other questions tagged python string
string-comparison fuzzy-comparison or ask your own question.
question feed
Page 6 of 6

Good Python Modules For Fuzzy String Comparison

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Good Python Modules For Fuzzy String Comparison

Uploaded by

Copyright:

Available Formats

Good Python modules for fuzzy string comparison?

Good Python modules for fuzzy string comparison?

possible duplicate of Text difference algorithm tzot Sep 20 '10 at 14:08

Good Python modules for fuzzy string comparison? - Stack Overflow

Good Python modules for fuzzy string comparison? - Stack Overflow

Good Python modules for fuzzy string comparison? - Stack Overflow

>>>"import"difflib >>>"difflib.SequenceMatcher(None,"'abcde',"'abcde').ratio() 1.0 >>>"difflib.SequenceMatcher(None,"'abcde',"'zbcde').ratio() 0.80000000000000004 >>>"difflib.SequenceMatcher(None,"'abcde',"'zyzzy').ratio() 0.0

Good Python modules for fuzzy string comparison? - Stack Overflow

Good Python modules for fuzzy string comparison? - Stack Overflow

answered Dec 16 '11 at 6:30 zengr 7,642 4 18 52 feedback

You might also like