·

How to find similar strings with python?

发布时间:2024-06-28 00:09:18阅读量:273
专业文章
转载请注明来源

Suppose that we have a string str together with a list of strings arr = [str1, str2, ... ]. How to find similar string of str in arr? To do this, we can calculate the similarity of str and elements in arr one by one. But how to calculate the degree of similarity between str and an element in arr? First, we need a measurement of the similarity of two strings. In fact, there are many methods that measure the similarity of strings. In this tutorial, we will use python to find similar strings of str.

In python, there are several packages that concern similarity of strings.

1. Difflib

The builtin module difflib is for comparing sequences. You can use the SequenceMatcher.ratio method to calculate the similarity. It returns a float in range [0, 1].

>>> from difflib import SequenceMatcher
>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> SequenceMatcher(None, a, b).ratio()
0.3853211009174312
>>> SequenceMatcher(None, "abcd", "abcd").ratio()
1.0

The more similar two strings are, the closer to 1 the ratio is.

You can also use the get_close_matches method.

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

>>> get_close_matches('manitori', ['manifold', 'torus', 'tori', 'differential'])
['tori', 'manifold']

However, this method can't be used for long strings comparsion. To compare long strings, you may need to split the strings.

>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> c='Mathjax loads slow. How to load local JS file with Nuxt?'
>>> d='How to render math in Vue or Nuxt?'
>>> e='Nginx is installed but command not found'
>>> f='Django ImageField max_length error when uploading image'
>>> get_close_matches(a, [b,c,d,e,f])
[]

The difflib module is pure python, not a python c library. Thus, it is not the fastest way.

2. Python-levenshtein

Python-levenshtein package is a Python C extension module. It is used for fast computation of Levenshtein distance. Levenshtein distance is a metric measuring the similarity of two strings. The bigger Levenshtein distance is, the more different two strings are.

To install this package, use pip:

pip install python-levenshtein

The following are the official examples of python-levenshtein:

>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
    Compute similarity of two strings.

    ratio(string1, string2)

    The similarity is a number between 0 and 1, it's usually equal or
    somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
    based on real minimal edit distance.

    Examples:
    >>> ratio('Hello world!', 'Holly grail!')
    0.58333333333333337
    >>> ratio('Brian', 'Jesus')
    0.0

>>> help(Levenshtein.distance)
distance(...)
    Compute absolute Levenshtein distance of two strings.

    distance(string1, string2)

    Examples (it's hard to spell Levenshtein correctly):
    >>> distance('Levenshtein', 'Lenvinsten')
    4
    >>> distance('Levenshtein', 'Levensthein')
    2
    >>> distance('Levenshtein', 'Levenshten')
    1
    >>> distance('Levenshtein', 'Levenshtein')
    0

3. Jellyfish

Jellyfish is a Python library that supports multiple methods for strings comparsion, including Levenshtein distance. Moreover, it supports phonetic encoding. To install jellyfish, use pip:

pip install jellyfish

Using jellyfish to compute Levenshtein distance is much faster than pure Python implementations. Here is an example:

>>> a='How to do syntax highlighting for code blocks in Nuxt?'
>>> b='How to setup PrismJS and Autoloader plugin with Nuxt 3?'
>>> import jellyfish
>>> jellyfish.levenshtein_distance(a,b)
35

Note that the Levenshtein distance is sensitive to case, so remember to use lower() if you care about the case:

>>> jellyfish.levenshtein_distance('Nuxt','nuxt')
1
>>> jellyfish.levenshtein_distance('nuxt','nuxt')
0

>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36

Finally, it is difficult to see the difference between two strings by these numbers. Therefore, it is good to normalize the Levenshtein distance between 0 to 1:

def normalizedLevenshteinDistance(double levenshtein, String s1, String s2) {
    return levenshtein / max(s1.length(), s2.length());
}

Here are some examples:

>>> a='How can I define a metric for uniqueness of strings on Django model?'
>>> b='How to specify uniqueness for a tuple of field in a Django model'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
36
>>> 36/max(len(a),len(b))
0.5294117647058824

>>> b='Django Unique Together (with foreign keys)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
52
>>> 52/max(len(a),len(b))
0.7647058823529411

>>> b='Django unique_together with a specifc filter (like: somefield=somevalue)'
>>> jellyfish.levenshtein_distance(a.lower(),b.lower())
59
>>> 59/max(len(a),len(b))
0.8194444444444444

We can see that the closer to 1 the result is, the more different two strings are.

评论区

暂无评论,来发布第一条评论吧!

弦圈热门内容

宛如来自空无的召唤——数学大师格罗腾迪克的生平(上)

作者简介:艾林‧杰克逊(Allyn Jackson)曾任美国数学学会会讯(Notices of the AMS)的副主编与总主笔,加州大学柏克莱分校数学硕士。她觉得能结合数学和写作两个非常不同的领域,面对各种数学课题和数学人物,收获很大。译者简介:翁秉仁为台湾大学数学系副教授。本文原文发表在 2004 年的 Notices of the AMS 51卷第 9 期,以下译文刊登在《数理人文》创刊号(2013 年 12 月)。媒体或机构如需转载,请联系《数理人文》杂志(微信号:math_hmat)。重点摘要格罗腾迪克是二十世纪的数学大师,为代数几何开启全新的面貌,数学影响仍方兴未艾。格罗腾迪克早年多舛,与父母颠沛流离。他的数学背景贫乏,一切出于自学,但天资奇高,在苦学深思与师友攻错下,终于成为一代宗师。格罗腾迪克以韦伊猜想为目标,从范畴论观点所铸造的新工具,连结了离散的数论世界与连续的拓扑世界,启迪了多位菲尔兹奖得主的工作。如果不把科学看成权力和宰制的工具,而是我们物种在时间长河进行的知识探险。每门科学好比和声一样,依时更迭,或广袤,或丰盈。就像顺着世世代代于焉展露的乐曲,所有主题的精致对 ...

如何构建一个比复数域$\mathbb{C}$还要大的域?

本文我们探讨这个问题:是否存在一种扩张复数域$\mathbb{C}$的方法,使得$\mathbb{C} \subset\mathbb{C}[a]$?或者$\mathbb{C}$是所有域扩张的终点?下面围绕这个问题,我们将提供两种扩张复数域$\mathbb{C}$的方法。方法1:$\mathbb{C}$的笛卡儿积$$P = {\Bbb C}\times{\Bbb C}\times\cdots$$并不是一个域,因为它有零因子:$$(0,1,0,1,\cdots)(1,0,1,0\cdots)=(0,0,0,0,\cdots)。$$但是将零因子商掉,就能得到一个域。令$\mathcal U$为$\Bbb N$上的一个nonprincipal ultrafilter。我们定义$$(a_1,a_2,\cdots)\sim(b_1,b_2,\cdots)$$当$$\{n\in\Bbb N\,\vert\, a_n=b_n\}\in\mathcal U。$$然后商$F = P/\sim$就是一个严格比$\mathbb{C}$大的域,我们称这个域为超积(英语:ultraproduct)。并且嵌入映射$ ...

如果两个对象的余极限同构,那么这两个对象同构?

令$A,B$为特征$p$的交换环。令$\phi_{A}:A\rightarrow A,\phi_{B}:B\rightarrow B$为Frobenius态射,即$p$次方映射。如果我们有 ${\rm{colim}}_{n\in\mathbb{N}}A\cong {\rm{colim}}_{n\in\mathbb{N}}B$,其中transition映射为Frobenius态射,那么我们可以得出$A\cong B$吗?答案:不能。回顾一下,一个$\mathbb{F}_p$-代数$R$是完美的,如果它的Frobenius映射$\varphi : R \ni r \mapsto r^p \in R$是一个同构。Frobenius态射的次方的余极限${\rm{colim}}_{n\in\mathbb{N}}R$是$\mathbb{F}_p$-代数$R$的完美化,并且它这样命名是因为它是完美$\mathbb{F}_p$-代数到$\mathbb{F}_p$-代数的包含映射的左伴随。这使得完美$\mathbb{F}_p$-代数构成了一个$\mathbb{F}_p$-代数的反射子范畴,这意味着在完美 ...

基变换映射$U\times_{X}X\rightarrow U\times_{Y}Y$

我的提问:令$X,Y$是概形。令$X\rightarrow Y,X\rightarrow X, Y\rightarrow Y$为概形态射。为什么态射$U\times_{X}X\rightarrow U\times_{Y}Y$是$X\rightarrow X\times_{Y}Y$通过$U\times_{Y}Y\rightarrow Y$的基变换。这是我尝试的图,其中三角形是交换的。但是我发现$(U\times_{Y}Y)\times_{Y}(X\times_{Y}Y)=U\times_{Y}X\times_{Y}Y=U\times_{Y}X$,即我无法得到想要的$U\times_{X}X$。我这是犯了什么错误?这是问题的上下文,来自朱歆文的论文Affine Grassmannians and the geometric Satake in mixed characteristic (arXiv link):引理 A.2. 对任何代数空间的平展态射$X\to Y$,由$\sigma_X$导出的相对Frobenius态射$X\to X\times_{Y,\sigma_Y}Y$是一个同构。证 ...