How to find similar strings with python?

发布时间：2024-06-28 00:09:18阅读量：273

专业文章

转载请注明来源

Suppose that we have a string str together with a list of strings arr = [str1, str2, ... ]. How to find similar string of str in arr? To do this, we can calculate the similarity of str and elements in arr one by one. But how to calculate the degree of similarity between str and an element in arr? First, we need a measurement of the similarity of two strings. In fact, there are many methods that measure the similarity of strings. In this tutorial, we will use python to find similar strings of str.

In python, there are several packages that concern similarity of strings.

1. Difflib

The builtin module difflib is for comparing sequences. You can use the SequenceMatcher.ratio method to calculate the similarity. It returns a float in range [0, 1].

>>> from difflib import SequenceMatcher
>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> SequenceMatcher(None, a, b).ratio()
0.3853211009174312
>>> SequenceMatcher(None, "abcd", "abcd").ratio()
1.0

The more similar two strings are, the closer to 1 the ratio is.

You can also use the get_close_matches method.

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

>>> get_close_matches('manitori', ['manifold', 'torus', 'tori', 'differential'])
['tori', 'manifold']

However, this method can't be used for long strings comparsion. To compare long strings, you may need to split the strings.

>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> c='Mathjax loads slow. How to load local JS file with Nuxt?'
>>> d='How to render math in Vue or Nuxt?'
>>> e='Nginx is installed but command not found'
>>> f='Django ImageField max_length error when uploading image'
>>> get_close_matches(a, [b,c,d,e,f])
[]

The difflib module is pure python, not a python c library. Thus, it is not the fastest way.

2. Python-levenshtein

Python-levenshtein package is a Python C extension module. It is used for fast computation of Levenshtein distance. Levenshtein distance is a metric measuring the similarity of two strings. The bigger Levenshtein distance is, the more different two strings are.

To install this package, use pip:

pip install python-levenshtein

The following are the official examples of python-levenshtein:

>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

3. Jellyfish

Jellyfish is a Python library that supports multiple methods for strings comparsion, including Levenshtein distance. Moreover, it supports phonetic encoding. To install jellyfish, use pip:

pip install jellyfish

Using jellyfish to compute Levenshtein distance is much faster than pure Python implementations. Here is an example:

>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> import jellyfish
>>> jellyfish.levenshtein_distance(a,b)
35

Note that the Levenshtein distance is sensitive to case, so remember to use lower() if you care about the case:

>>> jellyfish.levenshtein_distance('Nuxt','nuxt')
1
>>> jellyfish.levenshtein_distance('nuxt','nuxt')
0

>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36

Finally, it is difficult to see the difference between two strings by these numbers. Therefore, it is good to normalize the Levenshtein distance between 0 to 1:

def normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    return levenshtein / max(s1.length(), s2.length());
}

Here are some examples:

>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36
>>> 36/max(len(a),len(b))
0.5294117647058824

>>> b='Django Unique Together (with foreign keys)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
52
>>> 52/max(len(a),len(b))
0.7647058823529411

>>> b='Django unique_together with a specifc filter (like: somefield=somevalue)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
59
>>> 59/max(len(a),len(b))
0.8194444444444444

We can see that the closer to 1 the result is, the more different two strings are.

0 0

评论区

暂无评论，来发布第一条评论吧！

Python

弦圈

How to find similar strings with python?

1. Difflib

2. Python-levenshtein

3. Jellyfish

弦圈热门内容

弦圈如何发布新帖子？

宛如来自空无的召唤——数学大师格罗腾迪克的生平（上）

如何构建一个比复数域$\mathbb{C}$还要大的域？

弦圈如何在百科中创建新词条？

如果两个对象的余极限同构，那么这两个对象同构？

趣味数学题05 | 三角形求角度数问题

骨头汤和豆腐哪个更补钙？

Latex如何写二重积分、三重积分、四重积分、n重积分、闭合积分

Latex如何输入花体字母？

10.25弦圈内容更新计划

LaTeX如何打出长竖线？

基变换映射$U\times_{X}X\rightarrow U\times_{Y}Y$