首页 / 算法 / UnicodeDecodeError,ascii处理python中的Snowball词干算法

UnicodeDecodeError,ascii处理python中的Snowball词干算法

内容导读

互联网集市收集整理的这篇技术教程文章主要介绍了UnicodeDecodeError,ascii处理python中的Snowball词干算法，小编现在分享给大家，供广大互联网技能从业者学习和参考。文章包含3897字，纯文字阅读大概需要6分钟。

内容图文

UnicodeDecodeError,ascii处理python中的Snowball词干算法

我在将常规文件读入我已编写的程序时遇到一些麻烦.我目前遇到的问题是pdf基于某种突变的utf-8,其中包括一个BOM,它会在我的整个操作中引发一个问题.在我的应用程序中,我正在使用需要ascii输入的Snowball词干算法.有许多主题涉及到为utf-8解决错误,但是没有一个涉及将它们发送到Snowball算法,或者考虑ascii是我想要的最终结果.目前我使用的文件是使用标准ANSI编码的记事本文件.我得到的具体错误信息是这样的：

File "C:\Users\svictoroff\Desktop\Alleyoop\Python_Scripts\Keywords.py", line 38, in Map_Sentence_To_Keywords
    Word = Word.encode('ascii', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128)

我的理解是在python中,包括ignore参数只会传递遇到的任何非ascii字符,这样我就会绕过任何BOM或特殊字符,但显然不是这种情况.调用的实际代码在这里：

def Map_Sentence_To_Keywords(Sentence, Keywords):
    '''Takes in a sentence and a list of Keywords, returns a tuple where the
    first element is the sentence, and the second element is a set of
    all keywords appearing in the sentence. Uses Snowball algorithm'''
    Equivalence = stem.SnowballStemmer('english')
    Found = []
    Sentence = re.sub(r'^(\W*?)(.*)(\n?)$', r'\2', Sentence)
    Words = Sentence.split()
    for Word in Words:
        Word = Word.lower().strip()
        Word = Word.encode('ascii', 'ignore')
        Word = Equivalence.stem(Word)
        Found.append(Word)
    return (Sentence, Found)

通过将一般非贪婪的非字符正则表达式删除包含在字符串的前面,我也希望删除故障字符,但实际情况并非如此.除了ascii之外,我还尝试了许多其他编码,并且严格的base64编码可以工作,但对我的应用程序来说非常不理想.有关如何以自动方式解决此问题的任何想法？

Element的初始解码失败,但在实际传递给编码器时返回unicode错误.

for Element in Curriculum_Elements:
        try:
            Element = Element.decode('utf-8-sig')
        except:
            print Element 
        Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))

def scraping(File):
    '''Takes in txt file of curriculum, removes all newlines and returns that occur     after a lowercase character, then splits at all remaining newlines'''
    Curriculum_Elements = []
    Document = open(File, 'rb').read()
    Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
    Curriculum_Elements = Document.split('\r\n')
    return Curriculum_Elements

显示的代码生成了所见的课程元素.

 for Element in Curriculum_Elements:
        try:
            Element = unicode(Element, 'utf-8-sig', 'ignore')
        except:
            print Element

这种类型转换的hackaround实际上有效,但是转换回ascii有点不稳定.返回此错误：

Warning (from warnings module):
  File "C:\Python27\lib\encodings\utf_8_sig.py", line 19
    if input[:3] == codecs.BOM_UTF8:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

解决方法:

尝试首先将UTF-8输入解码为unicode字符串,然后将其编码为ASCII(忽略非ASCII).编码已经编码的字符串真的没有意义.

input = file.read()   # Replace with your file input code...
input = input.decode('utf-8-sig')   # '-sig' handles BOM

# Now isinstance(input, unicode) is True

# ...
Sentence = Sentence.encode('ascii', 'ignore')

在编辑之后,我看到您在使用ASCII编码之前已经尝试解码字符串.但是,在文件的内容已经被操作之后,似乎解码发生得太晚了.这可能会导致问题,因为并非每个UTF-8字节都是一个字符(某些字符需要几个字节才能编码).想象一下将任何字符串转换为as和bs序列的编码.在解码它之前你不想操纵它,因为即使在未编码的字符串中没有任何内容,你也会看到as和bs – UTF-8出现同样的问题,尽管因为大多数字节而更加巧妙真的是人物.

所以,在你做任何其他事情之前解码一次：

def scraping(File):
    '''Takes in txt file of curriculum, removes all newlines and returns that occur     after a lowercase character, then splits at all remaining newlines'''
    Curriculum_Elements = []
    Document = open(File, 'rb').read().decode('utf-8-sig')
    Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
    Curriculum_Elements = Document.split('\r\n')
    return Curriculum_Elements

# ...

for Element in Curriculum_Elements:
    Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))

您的原始Map_Sentence_To_Keywords函数应该无需修改即可使用,但我建议在拆分之前编码为ASCII,以提高效率/可读性.

内容总结

以上是互联网集市为您收集整理的UnicodeDecodeError,ascii处理python中的Snowball词干算法全部内容，希望文章能够帮你解决UnicodeDecodeError,ascii处理python中的Snowball词干算法所遇到的程序开发问题。如果觉得互联网集市技术教程内容还不错，欢迎将互联网集市网站推荐给程序员好友。

内容备注

版权声明：本文内容由互联网用户自发贡献，该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容，请发送邮件至 gblab@vip.qq.com 举报，一经查实，本站将立刻删除。

内容手机端

扫描二维码推送至手机访问。

本文链接：https://qyyshop.com/info/720522.html

来源：【匿名】

首页 / 算法 / UnicodeDecodeError,ascii处理python中的Snowball词干算法

UnicodeDecodeError,ascii处理python中的Snowball词干算法

内容导读

内容图文

内容总结

内容备注

内容手机端

【UnicodeDecodeError,ascii处理python中的Snowball词干算法】教程文章相关的互联网学习教程文章

PHP中的自然排序算法,支持Unicode？【代码】

UnicodeDecodeError,ascii处理python中的Snowball词干算法【代码】

UNICODE - 相关标签

PYTHON - 相关标签

DECODE - 相关标签

算法 - 最新教程

算法 - 最热教程