python – 用scrapy抓取多个页面
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了python – 用scrapy抓取多个页面,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含2635字,纯文字阅读大概需要4分钟。
内容图文
我正在尝试使用scrapy来抓取一个包含多页信息的网站.
我的代码是:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["http://www.tcgplayer.com/"]
start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]
def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[@class='magicCard']")
for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
yield item
我试图刮掉所有页面,直到它到达页面的末尾…有时会有比其他页面更多的页面,因此很难准确说出页码的结束位置.
解决方法:
想法是增加pageNumber,直到找不到标题.如果页面上没有标题 – 抛出CloseSpider异常来阻止蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from tcgplayer1.items import Tcgplayer1Item
URL = "http://store.tcgplayer.com/magic/journey-into-nyx?pageNumber=%d"
class MySpider(BaseSpider):
name = "tcg"
allowed_domains = ["tcgplayer.com"]
start_urls = [URL % 1]
def __init__(self):
self.page_number = 1
def parse(self, response):
print self.page_number
print "----------"
sel = Selector(response)
titles = sel.xpath("//div[@class='magicCard']")
if not titles:
raise CloseSpider('No more pages')
for title in titles:
item = Tcgplayer1Item()
item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
vendor = title.xpath(".//tr[@class='vendor ']")
item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
yield item
self.page_number += 1
yield Request(URL % self.page_number)
这个特殊的蜘蛛会抛出所有8页数据,然后停止.
希望有所帮助.
内容总结
以上是互联网集市为您收集整理的python – 用scrapy抓取多个页面全部内容,希望文章能够帮你解决python – 用scrapy抓取多个页面所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。