How to find similar strings with python?
Suppose that we have a string str
together with a list of strings arr = [str1, str2, ... ]
. How to find similar string of str
in arr
? To do this, we can calculate the similarity of str
and elements in arr
one by one. But how to calculate the degree of similarity between str
and an element in arr
? First, we need a measurement of the similarity of two strings. In fact, there are many methods that measure the similarity of strings. In this tutorial, we will use python to find similar strings of str
.
In python, there are several packages that concern similarity of strings.
1. Difflib
The builtin module difflib is for comparing sequences. You can use the SequenceMatcher.ratio
method to calculate the similarity. It returns a float in range [0, 1].
>>> from difflib import SequenceMatcher
>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> SequenceMatcher(None, a, b).ratio()
0.3853211009174312
>>> SequenceMatcher(None, "abcd", "abcd").ratio()
1.0
The more similar two strings are, the closer to 1 the ratio is.
You can also use the get_close_matches
method.
>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> get_close_matches('manitori', ['manifold', 'torus', 'tori', 'differential'])
['tori', 'manifold']
However, this method can't be used for long strings comparsion. To compare long strings, you may need to split the strings.
>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> c='Mathjax loads slow. How to load local JS file with Nuxt?'
>>> d='How to render math in Vue or Nuxt?'
>>> e='Nginx is installed but command not found'
>>> f='Django ImageField max_length error when uploading image'
>>> get_close_matches(a, [b,c,d,e,f])
[]
The difflib module is pure python, not a python c library. Thus, it is not the fastest way.
2. Python-levenshtein
Python-levenshtein package is a Python C extension module. It is used for fast computation of Levenshtein distance. Levenshtein distance is a metric measuring the similarity of two strings. The bigger Levenshtein distance is, the more different two strings are.
To install this package, use pip:
pip install python-levenshtein
The following are the official examples of python-levenshtein:
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
Compute similarity of two strings.
ratio(string1, string2)
The similarity is a number between 0 and 1, it's usually equal or
somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
based on real minimal edit distance.
Examples:
>>> ratio('Hello world!', 'Holly grail!')
0.58333333333333337
>>> ratio('Brian', 'Jesus')
0.0
>>> help(Levenshtein.distance)
distance(...)
Compute absolute Levenshtein distance of two strings.
distance(string1, string2)
Examples (it's hard to spell Levenshtein correctly):
>>> distance('Levenshtein', 'Lenvinsten')
4
>>> distance('Levenshtein', 'Levensthein')
2
>>> distance('Levenshtein', 'Levenshten')
1
>>> distance('Levenshtein', 'Levenshtein')
0
3. Jellyfish
Jellyfish is a Python library that supports multiple methods for strings comparsion, including Levenshtein distance. Moreover, it supports phonetic encoding. To install jellyfish, use pip:
pip install jellyfish
Using jellyfish to compute Levenshtein distance is much faster than pure Python implementations. Here is an example:
>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> import jellyfish
>>> jellyfish.levenshtein_distance(a,b)
35
Note that the Levenshtein distance is sensitive to case, so remember to use lower()
if you care about the case:
>>> jellyfish.levenshtein_distance('Nuxt','nuxt')
1
>>> jellyfish.levenshtein_distance('nuxt','nuxt')
0
>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36
Finally, it is difficult to see the difference between two strings by these numbers. Therefore, it is good to normalize the Levenshtein distance between 0 to 1:
def normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
return levenshtein / max(s1.length(), s2.length());
}
Here are some examples:
>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36
>>> 36/max(len(a),len(b))
0.5294117647058824
>>> b='Django Unique Together (with foreign keys)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
52
>>> 52/max(len(a),len(b))
0.7647058823529411
>>> b='Django unique_together with a specifc filter (like: somefield=somevalue)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
59
>>> 59/max(len(a),len(b))
0.8194444444444444
We can see that the closer to 1 the result is, the more different two strings are.
There is no comment, let's add the first one.