首页 / PYTHON / python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)

python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)

内容导读

互联网集市收集整理的这篇技术教程文章主要介绍了python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)，小编现在分享给大家，供广大互联网技能从业者学习和参考。文章包含3016字，纯文字阅读大概需要5分钟。

内容图文

python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)

下载this页并对其进行较小的编辑,将本段中的前65更改为68：

然后,我使用BeauifulSoup解析这两个源,并使用difflib对其进行比较.

url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read()  # get response as list of lines

url2 = 'file:///Users/Pyderman/projects/temp/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read()  # get response as list of lines
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)

soup = bs4.BeautifulSoup(content, "lxml")
soup2= bs4.BeautifulSoup(content2, "lxml")
diff = d.compare(list(soup.stripped_strings), list(soup2.stripped_strings))
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print change

打印更改将给出：

- The Achieving a Better Life Experience (ABLE) Act, H.R. 5771, legislation passed on December 19, 2014. It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).  This provision will apply to any individual who attains age 65 on or after December 19, 2015 (the one year anniversary of enactment of this bill).  Two new Universal Text Identifiers (UTIs), UTI WCP060 and WCP061 were created to comply with this change.
+ The Achieving a Better Life Experience (ABLE) Act, H.R. 5771, legislation passed on December 19, 2014. It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).  This provision will apply to any individual who attains age 65 on or after December 19, 2015 (the one year anniversary of enactment of this bill).  Two new Universal Text Identifiers (UTIs), UTI WCP060 and WCP061 were created to comply with this change.

因此,尽管有很小的变化,但它还是打印了整个段落.我想这是一件好事,它显示的是整个段落的差异而不是句子的差异,但是我们可以以某种方式使输出更细粒度吗？就目前而言,似乎我只想突出显示已更改的文本,就必须对这两个几乎完全相同的字符串进行一些额外的增量比较.

解决方法:

您可以使用nltk.sent_tokenize()将汤串分割成句子：

from nltk import sent_tokenize

sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]

diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print(change)

仅在检测到更改的地方打印适当的句子：

- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).

内容总结

以上是互联网集市为您收集整理的python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)全部内容，希望文章能够帮你解决python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)所遇到的程序开发问题。如果觉得互联网集市技术教程内容还不错，欢迎将互联网集市网站推荐给程序员好友。

内容备注

版权声明：本文内容由互联网用户自发贡献，该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容，请发送邮件至 gblab@vip.qq.com 举报，一经查实，本站将立刻删除。

内容手机端

扫描二维码推送至手机访问。

本文链接：https://qyyshop.com/info/695458.html

来源：【匿名】

【上一篇】Python：十进制数字的范围函数【下一篇】浅谈PHP运行Python脚本的方法

更多 ►

【python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)】教程文章相关的互联网学习教程文章

【Python】文件目录比较工具filecmp和difflib【代码】【图】

在一些运维场景中，常常需要比较两个环境中的应用目录结构（是否有文件/目录层面上的增删）以及比较两个环境中同名文件内容的不同（即文件层面上的改）。Python自带了两个内建模块可以很好地完成这个工作，filecmp和difflib。前者主要用于比较目录结构上的不同以及笼统的文件内容比较；后者用于比较两个文件具体内容上的不同。综合使用两个模块可以比较完备地做一次比较。【filecmp】　　filecmp提供一些方法可以很方便地进行对比两...

第43天：Python filecmp&difflib模块【代码】【图】

本节主要介绍两个 Python 中常用于比较数据的模块，一个是 filecmp 模块，另一个是 difflib 模块。其中，前者主要用于比较文件及目录，后者主要用于比较序列的类和函数，下面具体介绍两者的区别。filecmp 模块filecmp 模块作为 Python 提供的标准库之一，无需安装，模块定义了用于比较文件及目录的函数，对服务器上的文件目录的校验非常实用。cmp(f1,f2)函数cmp(f1,f2)函数用于比较两个文件是否相同，如果 f1 和 f2 相等则返回True...

通过difflib modul比较python中的列表【代码】

我正在尝试difflib库.我有两个列表：L_1和L_2包含字符串.我想知道这些序列是否相似(顺序不重要).L_1 = ["Bob", "Mary", "Hans"] L_2 = ["Bob", "Marie", "Háns"]应该可以.但L_1 = ["Nirdosch", "Mary", "Rolf"] L_2 = ["Bob", "Marie", "Háns"]应该没关系. 我想到了遍历第一个列表L_1并通过该方法匹配L_1的每个元素的想法difflib.get_close_matches()针对第二个列表L_2.如果存在比值较大的匹配项,那么假设0.7将其从L_2中删除并继续...

C字符串diff(a la Python的difflib)【代码】

我正在尝试对两个字符串进行比较,以确定它们是否仅在字符串结构的一个数字子集中有所不同.例如,varies_in_single_number_field('foo7bar', 'foo123bar') # Returns True, because 7 != 123, and there's only one varying # number region between the two strings.在Python中,我可以使用difflib来完成此任务：import difflib, doctestdef varies_in_single_number_field(str1, str2):"""A typical use case is as follows:>>> var...

python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)【代码】

下载this页并对其进行较小的编辑,将本段中的前65更改为68：然后,我使用BeauifulSoup解析这两个源,并使用difflib对其进行比较.url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM' response = urllib2.urlopen(url) content = response.read() # get response as list of linesurl2 = 'file:///Users/Pyderman/projects/temp/02092016062645AM-modified.html' response2 = urllib2.urlopen(url2) content...

Python中的高性能模糊字符串比较,使用Levenshtein或difflib【代码】

我正在进行临床信息规范化(拼写检查),其中我检查每个给定的单词对900,000字的医学词典.我更关注时间复杂度/性能. 我想做模糊字符串比较,但我不确定使用哪个库. 选项1：import Levenshtein Levenshtein.ratio('hello world', 'hello')Result: 0.625选项2：import difflib difflib.SequenceMatcher(None, 'hello world', 'hello').ratio()Result: 0.625在这个例子中,两者给出相同的答案.在这种情况下,你认为两者都表现相似吗？解决方...

python – 是否有一个替代`difflib.get_close_matches()`来返回索引(列表位置)而不是str列表？【代码】

我想使用像difflib.get_close_matches这样的东西,而不是最相似的字符串,我想获得索引(即列表中的位置). 列表的索引更灵活,因为可以将索引与其他数据结构相关联(与匹配的字符串相关). 例如,而不是：>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format'] >>> difflib.get_close_matches('Hello', words) ['hello', 'hallo', 'Hallo']我想要：>>> difflib.get_close_matches('Hello', wor...

python – SequenceMatcher.ratio如何在difflib中工作【代码】

我正在尝试python的difflib模块,我遇到了SequenceMatcher.所以,我尝试了以下示例,但无法理解发生了什么.>>> SequenceMatcher(None,"abc","a").ratio() 0.5>>> SequenceMatcher(None,"aabc","a").ratio() 0.4>>> SequenceMatcher(None,"aabc","aa").ratio() 0.6666666666666666现在,根据ratio：Return a measure of the sequences’ similarity as a float in the range[0, 1]. Where T is the total number of elements in both se...

Python入门之时间模块、datetime模块、difflib文件对比模块、sys模块【图】

一.时间模块导入：import times = '2019-7-14' print(time.strptime(s,'%Y-%m-%d')) s_time = '09:00:00' print(time.strptime(s_time,'%H:%M:%S')) 1.把元组的时间转换为时间戳tuple_time = time.localtime() print(tuple_time) print(time.mktime(tuple_time))2.把元组时间转换为字符串时间print(time.strftime('%m-%d',tuple_time)) print(time.strftime('%Y-%m-%d',tuple_time)) print(time.strftime('%F',tuple_time)) pri...

python – 如何使用difflib.SequenceMatcher获取多个匹配项？【代码】

我使用difflib来识别较长序列中短字符串的所有匹配项.但是,当有多个匹配时,difflib似乎只返回一个：> sm = difflib.SequenceMatcher(None, a='ACT', b='ACTGACT') > sm.get_matching_blocks() [Match(a=0, b=0, size=3), Match(a=3, b=7, size=0)]我预期的输出是：[Match(a=0, b=0, size=3), Match(a=0, b=4, size=3), Match(a=3, b=7, size=0)]事实上,字符串ACTGACT包含两个ACT匹配,位于0和4位,大小为3(在字符串末尾加上另一个大小...

Python使用difflib模块比较两个文件内容异同，同时输出html易浏览【代码】

因工作需求，需要对比连个文件异同，并输出html格式来对比。 #!/usr/bin/python # -*- coding: utf-8 -*-import sys import difflibdef read_file(filename):try:with open(filename, 'r') as f:return f.readlines()except IOError:print("ERROR: 没有找到文件:%s或读取文件失败！" % filename)sys.exit(1)def compare_file(file1, file2, out_file):file1_content = read_file(file1)file2_content = read_file(file2)d = diffli...

PYTHON - 技术教程分类

Python3 教程 Python3 简介 Python3 环境搭建 Python3 基础语法 Python3 基本数据类型 Python3 解释器 Python3 注释 Python3 运算符 Python3 数字(Number) Python3 字符串 Python3 列表 Python3 元组 Python3 字典 Python3 集合 Python3 编程第一步 Python3 条件控制 Python3 循环语句 Python3 迭代器与生成器 Python3 函数 Python3 数据结构 Python3 模块 Python3 输入和输出 Python3 File Python3 OS Python3 错误和异常 Python3 面向对象 Python3 命名空间/作用域 Python3 标准库概览 Python3 实例 Python3 CGI编程 Python3 MySQL(PyMySQL) Python3 网络编程 Python3 SMTP发送邮件 Python3 多线程 Python3 日期和时间 Python3 内置函数 Python3 MongoDB Python3 urllib python 全部

PYTHON - 最热教程

python如何统计字符串中字母个数？使用Python进行微信公众号开发（三）回...Python+PyQT5的子线程更新UI界面的实例 python时间戳怎么获得？如何获得当前时...vscode调试python时提示无法将“conda”...python接口自动化全局变量access_token...python收取邮件(腾讯企业邮箱)python如何绘制降水图详解python并发获取snmp信息及性能测试...怎么卸载Python3.6？

首页 / PYTHON / python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)

python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)

内容导读

内容图文

内容总结

内容备注

内容手机端

【python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)】教程文章相关的互联网学习教程文章

【Python】文件目录比较工具filecmp和difflib【代码】【图】

第43天：Python filecmp&difflib模块【代码】【图】

通过difflib modul比较python中的列表【代码】

C字符串diff(a la Python的difflib)【代码】

python-从difflib获取更细粒度的diff(或对diff进行后处理以实现相同效果的方法)【代码】

Python中的高性能模糊字符串比较,使用Levenshtein或difflib【代码】

python – 是否有一个替代`difflib.get_close_matches()`来返回索引(列表位置)而不是str列表？【代码】

python – SequenceMatcher.ratio如何在difflib中工作【代码】

Python入门之时间模块、datetime模块、difflib文件对比模块、sys模块【图】

python – 如何使用difflib.SequenceMatcher获取多个匹配项？【代码】

Python使用difflib模块比较两个文件内容异同，同时输出html易浏览【代码】

PYTHON - 相关标签

PYTHON - 技术教程分类

PYTHON - 最新教程

PYTHON - 最热教程