python – 如何使用Scrapy获取图像文件
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了python – 如何使用Scrapy获取图像文件,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含5436字,纯文字阅读大概需要8分钟。
内容图文
![python – 如何使用Scrapy获取图像文件](/upload/InfoBanner/zyjiaocheng/770/9f1ad5860420423b85225f8a4cd9d3b5.jpg)
我刚刚开始使用Scrapy而我正在尝试抓取图像文件.这是我的代码.
items.py
from scrapy.item import Item, Field
class TutorialItem(Item):
image_urls = Field(
images = Field()
pass
settings.py
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGE_STORE = '/Users/rnd/Desktop/Scrapy-0.16.5/tutorial/images'
pipelines.py
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
class TutorialPipeline(object):
def process_item(self, item, spider):
return item
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
tutorial_spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import TutorialItem
class TutorialSpider(BaseSpider):
name = "tutorial"
allowed_domains = ["roxie.com"]
start_urls = ["http://www.roxie.com/events/details.cfm?eventid=581D228B%2DB338%2DF449%2DBD69027D7D878A7F"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
item = TutorialItem()
link = hxs.select('//div[@id="eventdescription"]//img/@src').extract()
item['image_urls'] = ["http://www.roxie.com" + link]
return item
印刷日志 – 命令>> scrapy crawl tutorial -o roxie.json -t json
2013-06-19 17:29:06-0700 [scrapy] INFO: Scrapy 0.16.5 started (bot: tutorial)
/System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python/twisted/web/microdom.py:181: SyntaxWarning: assertion is always true, perhaps remove parentheses?
assert (oldChild.parentNode is self,
2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-06-19 17:29:06-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.16.5', 'scrapy')
File "/Library/Python/2.6/site-packages/setuptools-0.6c12dev_r88846-py2.6.egg/pkg_resources.py", line 489, in run_script
File "/Library/Python/2.6/site-packages/setuptools-0.6c12dev_r88846-py2.6.egg/pkg_resources.py", line 1207, in run_script
# we assume here that our metadata may be nested inside a "basket"
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
execute()
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py", line 76, in _run_print_help
func(*a, **kw)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/commands/crawl.py", line 43, in run
spider = self.crawler.spiders.create(spname, **opts.spargs)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/command.py", line 33, in crawler
self._crawler.configure()
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/crawler.py", line 41, in configure
self.engine = ExecutionEngine(self, self._spider_closed)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/core/engine.py", line 63, in __init__
self.scraper = Scraper(crawler)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/core/scraper.py", line 66, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/middleware.py", line 50, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/middleware.py", line 29, in from_settings
mwcls = load_object(clspath)
File "/Library/Python/2.6/site-packages/Scrapy-0.16.5-py2.6.egg/scrapy/utils/misc.py", line 39, in load_object
raise ImportError, "Error loading object '%s': %s" % (path, e)
ImportError: Error loading object 'scrapy.contrib.pipeline.images.ImagesPipeline': No module named PIL
看起来需要PIL,所以我安装了.
PIL 1.1.7 is already the active version in easy-install.pth
Installing pilconvert.py script to /usr/local/bin
Installing pildriver.py script to /usr/local/bin
Installing pilfile.py script to /usr/local/bin
Installing pilfont.py script to /usr/local/bin
Installing pilprint.py script to /usr/local/bin
Using /Library/Python/2.6/site-packages/PIL-1.1.7-py2.6-macosx-10.6-universal.egg
Processing dependencies for pil
Finished processing dependencies for pil
但是,它不起作用.你能让我知道我错过了什么吗?提前致谢!
解决方法:
是的,当我开始从某些网站抓取图片时,我遇到了同样的问题.我在CentOs6.5,python2.7.6工作.我解决了它,如下所示:
yum install easy_install
easy_install pip
然后以root身份登录并使用命令:pip install image.一切正常.
如果你在Ubuntu工作,我认为诀窍就是:sudo apt-get install easy_install,我认为下一个将是相同的.
我希望它会有所帮助.
内容总结
以上是互联网集市为您收集整理的python – 如何使用Scrapy获取图像文件全部内容,希望文章能够帮你解决python – 如何使用Scrapy获取图像文件所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。