首页 / PYTHON / Python大型多列表高效查询
Python大型多列表高效查询
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了Python大型多列表高效查询,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含5386字,纯文字阅读大概需要8分钟。
内容图文
![Python大型多列表高效查询](/upload/InfoBanner/zyjiaocheng/805/6edfbb46755e488d9e9d7f642e5722e2.jpg)
我正在尝试创建如何使用Python操作由CSV表组成的海量数据库的示例.
我想找到一种方法来模拟通过一些list()传播的表中的高效索引查询
以下示例在3.2Ghz Core i5中需要24秒
#!/usr/bin/env python
import csv
MAINDIR = "../"
pf = open (MAINDIR+"atp_players.csv")
players = [p for p in csv.reader(pf)]
rf = open (MAINDIR+"atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:10]:
player = filter(lambda x: x[0]==i[2],players)[0]
print "%s(%s),(%s) Points: %s"%(player[2],player[5],player[3],i[3])
对于this dataset.
将非常感谢更有效或更pythonic的方式.
解决方法:
您可以使用itertools.islice而不是读取所有行并使用itertools.ifilter:
import csv
from itertools import islice,ifilter
MAINDIR = "../"
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = list(csv.reader(pf))
rankings = csv.reader(rf)
# only get first ten rows using islice
for i in islice(rankings, None, 10):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")
不太确定什么过滤器(lambda x:x [0] == i [2],玩家)[0]正在做什么,你似乎每次都在搜索整个玩家列表而只是保留第一个元素.使用第一个元素作为键对列表进行排序可能需要付费,并使用二分搜索或构建一个dict,第一个元素作为键,行作为值然后只进行查找.
import csv
from itertools import islice,ifilter
from collections import OrderedDict
MAINDIR = "../"
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = OrderedDict((row[0],row) for row in csv.reader(pf))
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
# now constant work getting row as opposed to 0(n)
player = players.get(i[2])
你使用什么默认值,或者如果需要,你必须决定.
如果在每行的开头有重复元素但只想返回第一个匹配项:
with open(MAINDIR + "atp_players.csv") as pf, open(MAINDIR + "atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[key] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])
输出:
Djokovic(SRB),(R) Points: 11360
Federer(SUI),(R) Points: 9625
Nadal(ESP),(L) Points: 6585
Wawrinka(SUI),(R) Points: 5120
Nishikori(JPN),(R) Points: 5025
Murray(GBR),(R) Points: 4675
Berdych(CZE),(R) Points: 4600
Raonic(CAN),(R) Points: 4440
Cilic(CRO),(R) Points: 4150
Ferrer(ESP),(R) Points: 4045
十个玩家的代码时间显示ifilter是最快的,但是当我们提高排名时,我们会看到dict获胜,而你的代码有多么糟糕:
In [33]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv") players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:10]:
player = filter(lambda x: x[0]==i[2],players)[0]
....:
10 loops, best of 3: 123 ms per loop
In [34]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf: players = list(csv.reader(pf))
rankings = csv.reader(rf) # only get first ten rows using islice
for i in islice(rankings, None, 10):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")
....:
10 loops, best of 3: 43.6 ms per loop
In [35]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[row[0]] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 10):
player = players.get(i[2])
pass
....:
10 loops, best of 3: 50.7 ms per loop
现在有100名玩家,你会看到dict和10的速度一样快.构建dict的成本已经被恒定的时间查找所抵消:
In [38]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open("/tennis_atp-master/atp_rankings_current.csv") as rf:
players = list(csv.reader(pf))
rankings = csv.reader(rf)
# only get first ten rows using islice
for i in islice(rankings, None, 100):
# ifilter won't create a list, gives values in the fly
player = next(ifilter(lambda x: x[0] == i[2], players),"")
....:
10 loops, best of 3: 120 ms per loop
In [39]: %%timeit
with open("/tennis_atp-master/atp_players.csv") as pf, open( "/tennis_atp-master/atp_rankings_current.csv") as rf:
players = {}
for row in csv.reader(pf):
key = row[0]
if key in players:
continue
players[row[0]] = row
rankings = csv.reader(rf)
for i in islice(rankings, None, 100):
player = players.get(i[2])
pass
....:
10 loops, best of 3: 50.7 ms per loop
In [40]: %%timeit
MAINDIR = "tennis_atp-master/"
pf = open ("/tennis_atp-master/atp_players.csv")
players = [p for p in csv.reader(pf)]
rf =open( "/tennis_atp-master/atp_rankings_current.csv")
rankings = [r for r in csv.reader(rf)]
for i in rankings[:100]:
player = filter(lambda x: x[0]==i[2],players)[0]
....:
1 loops, best of 3: 806 ms per loop
250名球员:
# your code
1 loops, best of 3: 1.86 s per loop
# dict
10 loops, best of 3: 50.7 ms per loop
# ifilter
10 loops, best of 3: 483 ms per loop
最终测试循环整个排名:
# your code
1 loops, best of 3: 2min 40s per loop
# dict
10 loops, best of 3: 67 ms per loop
# ifilter
1 loops, best of 3: 1min 3s per loop
所以你可以看到,当我们循环更多的排名时,dict选项是迄今为止最有效的运行时,并且将非常好地扩展.
内容总结
以上是互联网集市为您收集整理的Python大型多列表高效查询全部内容,希望文章能够帮你解决Python大型多列表高效查询所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。