如何告诉python scrapy移动到下一个起始URL
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了如何告诉python scrapy移动到下一个起始URL,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含2958字,纯文字阅读大概需要5分钟。
内容图文
![如何告诉python scrapy移动到下一个起始URL](/upload/InfoBanner/zyjiaocheng/793/0fb740042d2340ada859ed59e13d63b8.jpg)
我写了一个scrapy蜘蛛,它有很多start_urls并在这些网址中提取电子邮件地址.该脚本需要很长时间才能执行,因此我想告诉Scrapy在找到电子邮件并移动到下一个站点时停止抓取特定站点.
编辑:添加代码
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
import csv
from urlparse import urlparse
from entreprise.items import MailItem
class MailSpider(CrawlSpider):
name = "mail"
start_urls = []
allowed_domains = []
with open('scraped_data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(reader)
for row in reader:
url = row[5].strip()
if (url.strip() != ""):
start_urls.append(url)
fragments = urlparse(url).hostname.split(".")
hostname = ".".join(len(fragments[-2]) < 4 and fragments[-3:] or fragments[-2:])
allowed_domains.append(hostname)
rules = [
Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item')
]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
for mail in hxs.select('//body//text()').re(r'[\w.-]+@[\w.-]+'):
item = MailItem()
item['url'] = response.url
item['mail'] = mail
items.append(item)
return items
解决方法:
我们的想法是使用start_requests方法来决定接下来要抓取的网址.此外,我们将跟踪是否在parsed_hostnames类级别集中解析了主机名的电子邮件.
此外,我已经改变了从url获取主机名的方式,现在使用urlparse.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import csv
from urlparse import urlparse
class MailItem(Item):
url = Field()
mail = Field()
class MailSpider(CrawlSpider):
name = "mail"
parsed_hostnames= set()
allowed_domains = []
rules = [
Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item')
]
def start_requests(self):
with open('scraped_data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(reader)
for row in reader:
url = row[5].strip()
if url:
hostname = urlparse(url).hostname
if hostname not in self.parsed_hostnames:
if hostname not in self.allowed_domains:
self.allowed_domains.append(hostname)
self.rules[0].link_extractor.allow_domains.add(hostname)
self.rules[1].link_extractor.allow_domains.add(hostname)
yield self.make_requests_from_url(url)
else:
self.allowed_domains.remove(hostname)
self.rules[0].link_extractor.allow_domains.remove(hostname)
self.rules[1].link_extractor.allow_domains.remove(hostname)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
for mail in hxs.select('//body//text()').re(r'[\w.-]+@[\w.-]+'):
item = MailItem()
item['url'] = response.url
item['mail'] = mail
items.append(item)
hostname = urlparse(response.url).hostname
self.parsed_hostnames.add(hostname)
return items
应该在理论上工作.希望有所帮助.
内容总结
以上是互联网集市为您收集整理的如何告诉python scrapy移动到下一个起始URL全部内容,希望文章能够帮你解决如何告诉python scrapy移动到下一个起始URL所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。