将语料库字典排序为OrderedDict的最快方法 – python
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了将语料库字典排序为OrderedDict的最快方法 – python,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含2573字,纯文字阅读大概需要4分钟。
内容图文
![将语料库字典排序为OrderedDict的最快方法 – python](/upload/InfoBanner/zyjiaocheng/824/6be34789db184817bdeed671abe051c6.jpg)
鉴于语料库/文本本身:
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
Although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
You have requested a debate on this subject in the course of the next few days , during this part @-@ session .
In the meantime , I should like to observe a minute ' s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union .
我可以简单地这样做以获得一个字频率的字典:
>>> word_freq = Counter()
>>> for line in text.split('\n'):
... for word in line.split():
... word_freq[word]+=1
...
但如果目标是从最高频率到最低频率实现有序字典,我将不得不这样做:
>>> from collections import OrderedDict
>>> sorted_word_freq = OrderedDict()
>>> for word, freq in word_freq.most_common():
... sorted_word_freq[word] = freq
...
想象一下,我在Counter对象中有10亿个键,迭代通过most_common()会有一次复杂的语料库(非唯一实例)和词汇表(唯一键).
注意:Counter.most_common()将调用ad-hoc sorted(),请参阅https://hg.python.org/cpython/file/e38470b49d3c/Lib/collections.py#l472
鉴于此,我看到以下使用numpy.argsort()的代码:
>>> import numpy as np
>>> words = word_freq.keys()
>>> freqs = word_freq.values()
>>> sorted_word_index = np.argsort(freqs) # lowest to highest
>>> sorted_word_freq_with_numpy = OrderedDict()
>>> for idx in reversed(sorted_word_index):
... sorted_word_freq_with_numpy[words[idx]] = freqs[idx]
...
哪个更快?
有没有其他更快的方法从计数器获得这样的OrderedDict?
除了OrderedDict之外,还有其他python对象可以实现相同的排序键值对吗?
假设内存不是问题.鉴于120 GB的内存,保持10亿个键值对不应该有太多问题吗?假设10亿个密钥每个密钥平均有20个字符,每个值都有一个整数.
解决方法:
Pandas中的Series对象是可能感兴趣的键值对(可以具有非唯一键)的数组.它有一个sort方法,按值排序并在Cython中实现.这是一个排序长度为一百万的数组的示例:
In [39]:
import pandas as pd
import numpy as np
arr = np.arange(1e6)
np.random.shuffle(arr)
s = pd.Series(arr, index=np.arange(1e6))
%timeit s.sort()
%timeit sorted(arr)
1 loops, best of 3: 85.8 ms per loop
1 loops, best of 3: 1.15 s per loop
给定一个普通的Python dict,你可以通过调用来构造一个Series:
my_series = pd.Series(my_dict)
然后按值排序
my_series.sort()
内容总结
以上是互联网集市为您收集整理的将语料库字典排序为OrderedDict的最快方法 – python全部内容,希望文章能够帮你解决将语料库字典排序为OrderedDict的最快方法 – python所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。