·

How to find similar strings with python?

发布时间:2024-06-28 00:09:18阅读量:385
专业文章
转载请注明来源

Suppose that we have a string str together with a list of strings arr = [str1, str2, ... ]. How to find similar string of str in arr? To do this, we can calculate the similarity of str and elements in arr one by one. But how to calculate the degree of similarity between str and an element in arr? First, we need a measurement of the similarity of two strings. In fact, there are many methods that measure the similarity of strings. In this tutorial, we will use python to find similar strings of str.

In python, there are several packages that concern similarity of strings.

1. Difflib

The builtin module difflib is for comparing sequences. You can use the SequenceMatcher.ratio method to calculate the similarity. It returns a float in range [0, 1].

>>> from difflib import SequenceMatcher
>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> SequenceMatcher(None, a, b).ratio()
0.3853211009174312
>>> SequenceMatcher(None, "abcd", "abcd").ratio()
1.0

The more similar two strings are, the closer to 1 the ratio is.

You can also use the get_close_matches method.

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

>>> get_close_matches('manitori', ['manifold', 'torus', 'tori', 'differential'])
['tori', 'manifold']

However, this method can't be used for long strings comparsion. To compare long strings, you may need to split the strings.

>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> c='Mathjax loads slow. How to load local JS file with Nuxt?'
>>> d='How to render math in Vue or Nuxt?'
>>> e='Nginx is installed but command not found'
>>> f='Django ImageField max_length error when uploading image'
>>> get_close_matches(a, [b,c,d,e,f])
[]

The difflib module is pure python, not a python c library. Thus, it is not the fastest way.

2. Python-levenshtein

Python-levenshtein package is a Python C extension module. It is used for fast computation of Levenshtein distance. Levenshtein distance is a metric measuring the similarity of two strings. The bigger Levenshtein distance is, the more different two strings are.

To install this package, use pip:

pip install python-levenshtein

The following are the official examples of python-levenshtein:

>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

3. Jellyfish

Jellyfish is a Python library that supports multiple methods for strings comparsion, including Levenshtein distance. Moreover, it supports phonetic encoding. To install jellyfish, use pip:

pip install jellyfish

Using jellyfish to compute Levenshtein distance is much faster than pure Python implementations. Here is an example:

>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> import jellyfish
>>> jellyfish.levenshtein_distance(a,b)
35

Note that the Levenshtein distance is sensitive to case, so remember to use lower() if you care about the case:

>>> jellyfish.levenshtein_distance('Nuxt','nuxt')
1
>>> jellyfish.levenshtein_distance('nuxt','nuxt')
0

>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36

Finally, it is difficult to see the difference between two strings by these numbers. Therefore, it is good to normalize the Levenshtein distance between 0 to 1:

def normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    return levenshtein / max(s1.length(), s2.length());
}

Here are some examples:

>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36
>>> 36/max(len(a),len(b))
0.5294117647058824

>>> b='Django Unique Together (with foreign keys)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
52
>>> 52/max(len(a),len(b))
0.7647058823529411

>>> b='Django unique_together with a specifc filter (like: somefield=somevalue)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
59
>>> 59/max(len(a),len(b))
0.8194444444444444

We can see that the closer to 1 the result is, the more different two strings are.

0 人喜欢

评论区

暂无评论,来发布第一条评论吧!

弦圈热门内容

我翻译并整理了一些MathStackExchange的问题和回答

对于数学老手而言,阅读全英文数学甚至是全法语数学,都是可以做到的。但是对于数学萌新而言,阅读全英文的数学内容,可能会比较吃力,也需要花费更多的时间来进行阅读和理解。然而对于做数学的人而言,不懂英文就意味着会有大量优质的英文数学资源无法享用。国外比较有名的数学论坛包括MathStackExchange 与 MathOverflow,都拥有大量优秀的问题以及十分优质的回答,这往往能帮助你解决学习过程中遇到的难题。所以,我觉得可以翻译一些MathStackexchange与MathOverflow的优质内容,让更多的国内的数学爱好者能够接触到优秀的英文数学资源。目前我已翻译,并重新整理以下内容,中英对照(切换语言可见):如何构建一个比复数域$\mathbb C$还要大的域?ℝ的有限域扩张是ℝ或者同构于ℂ幂零理想层的局部截面是什么样的?我在哪可以找到一个数学笔友?范畴中的态射一定得保持结构吗?我在教材中找到了一些不一样的阿基米德性质的乘法形式如果我看数学看得很慢,这没问题吗?仿射概形上的概形什么时候仿射?如果两个对象的余极限同构,那么这两个对象同构?正弦函数的幂级数展开是否是柯西序列?任意一个 ...

10.27 弦圈问题分析以及改进计划

最近有不少对弦圈感兴趣的爱好者,在弦圈注册了账号,也有人参与了互动。对此,我在这感谢各位的支持和认可!😃不过经过这段时间,用户注册后的表现,也透露出目前弦圈存在的很多问题。首当其冲的就是首页,默认显示最新内容,用时间顺序排序,意味着大家在首页往往无法看到有趣的内容,也可能找不到他想看的内容。这也导致弦圈中优秀的内容被埋没。因此,针对这个问题,我自己设计了一个简单的热度算法来计算“热度”,然后通过“热度”来排序首页的热门内容。旧的热门内容就是单纯的通过阅读量排序,没有热度随着时间衰减的现象,这也意味着新内容往往容易被旧内容排挤掉。有了更好的热度算法,我就可以将打开首页默认显示最新内容,改为默认显示热门内容了😇。接着就是中英文混合的问题,这个首页已经解决了,首页看到的内容都会把其他语言的给过滤掉。但是圈子内的话,我没有强行设置只有一种语言,因为不太想一些优秀的英文内容被埋没。我有点想参考推特的做法:热门内容推荐的大多数都是一种语言(如中文),只有一两个是其他语言(如英文)。或者说还有一种方案:热门内容全是同一种语言,再增加一个选项”全部“,即查看圈子全部内容。至于数学圈首页,那些数学分支的 ...

为什么无限求和需要被有意义的?

我的提问:例如单位分解(partition of unity)中的求和以及抽象代数中的多项式表达式。回答:拥有无限多项的求和(或者说更加正式的“级数”)需要一些额外的条件来保证他们“表现良好”("well behaved")。否则你可能得到像以下这样的悖论:$$\begin{align} &S = 1 + 1 + 1 + \dots \\ &\Rightarrow 2S = 2 + 2 + 2 + \dots \\ &\Rightarrow 2S = (1+1) + (1+1) + (1+1) + \dots \\ &\Rightarrow 2S = 1 + 1 + 1 + \dots \\ &\Rightarrow 2S=S \\ &\Rightarrow S = 0 \end{align}$$一般地,额外的条件包含,要求除了有限数量的项都为$0$(数学简称中的“几乎所有”)或者收敛条件来确保求和有一个极限值。本问题问于2020年1月22号,当时我在读高三,提问的水平非常差😅,跟Peter Scholze这种高中就懂谱序列的没得比🙃。

抽象代数中如何执行归纳法?

我的提问:我无法理解在这个证明中,归纳法这个步骤是如何进行的。有人能帮帮我吗?感谢!回答:令$n = deg B$。他们通过对$m = deg A$做归纳法来证明那个陈述。基本情况是$m < n$。如果$m \geq n$,然后他们找到另一个多项式$A'$,在这种情况下,$A' = A - B a_m X^{m - n}$,并且它有比$m$更小的阶数。所以我们可以通过归纳假设来处理它。$A′$的商和余数表达式是用于找到$A$的。我想有两件事你可能会觉得困扰,以及为什么你没有认出归纳法。首先,基本情况不仅仅是一种情况,而是一堆情况。这里请注意,这是基本的:证明中的归纳步骤仅适用于$m\geq n$。同时注意,在这种情况下,证明$m=1$的工作量并不比证明$m<n$小:对于所有这些情况,这都是一行证明。你可能会觉得困扰的第二件事是,我们不仅对$m-1$使用归纳假设,对任何阶数严格小于$m$的多项式也使用归纳假设。这被称为完全归纳法或强归纳法:在归纳步骤中,你假设的是,命题不多于$m-1$时都是真的,而不仅仅是$m-1$。这在维基百科的“归纳法”页面上得到了很好的解释。

如何理解$\mathbb{Q}_{p}(p^{1/p^{\infty}})$?

我的提问:众所周知$\mathbb{Q}_{p}(p^{1/p^{\infty}})$被定义为$\bigcup_{n>0} \mathbb{Q}_{p}(p^{1/p^{n}})$,意思是邻接所有$p$的$p$幂根($p$-power roots of $p$)到混合特征域$\mathbb{Q}_{p}$。然而,我不太懂这个符号的意思$\mathbb{Q}_{p}(p^{1/p^{n}})$。这是如何联系到$p$的$p$幂根的?为何在这个记号中,$p$的幂是$1/p^{n}$?我认为$\mathbb{Q}_{p}(p^{1/p^{n}})$是$\mathbb Q_p$的一个割圆扩张,其中$p^{1/p^{n}}$是$n$次单位本原根(primitive $n$th root of unity)。但是似乎这说不通。并且我在另一个回答中看到$\mathbb{Q}_{p}(p^{1/p^{n}})$是一个分歧扩张(ramified extension)。谁能告诉我在哪里可以了解$\mathbb{Q}_{p}(p^{1/p^{n}})$?回答1:根据定义,$\Bbb Q_p(p^{1/p ...