首页 / PYTHON / python-网站抓取和屏幕截图
python-网站抓取和屏幕截图
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了python-网站抓取和屏幕截图,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含2703字,纯文字阅读大概需要4分钟。
内容图文
![python-网站抓取和屏幕截图](/upload/InfoBanner/zyjiaocheng/679/9f96086d45a84cefb2f62a282d532151.jpg)
我正在使用scrapy抓取一个网站,并将内部/外部链接存储在我的i??tems类中.
有什么办法可以取消链接时捕获的屏幕快照?
注意:该网站具有登录授权表.
我的代码(spider.py)
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import urlparse
from scrapy import log
class MySpider(CrawlSpider):
items = []
failed_urls = []
duplicate_responses = []
name = 'myspiders'
allowed_domains = ['someurl.com']
login_page = 'someurl.com/login_form'
start_urls = 'someurl.com/'
rules = [Rule(SgmlLinkExtractor(deny=('logged_out', 'logout',)), follow=True, callback='parse_start_url')]
def start_requests(self):
yield Request(
url=self.login_page,
callback=self.login,
dont_filter=False
)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formnumber=1,
formdata={'username': 'username', 'password': 'password' },
callback=self.check_login_response)
def check_login_response(self, response):
"""Check the response returned by a login request to see if we are
successfully logged in.
"""
if "Logout" in response.body:
self.log("Successfully logged in. Let's start crawling! :%s" % response, level=log.INFO)
self.log("Response Url : %s" % response.url, level=log.INFO)
yield Request(url=self.start_urls)
else:
self.log("Bad times :(", loglevel=log.INFO)
def parse_start_url(self, response):
# Scrape data from page
hxs = HtmlXPathSelector(response)
self.log('response came in from : %s' % (response), level=log.INFO)
# check for some important page to crawl
if response.url == 'someurl.com/medical/patient-info' :
self.log('yes I am here', level=log.INFO)
urls = hxs.select('//a/@href').extract()
urls = list(set(urls))
for url in urls :
self.log('URL extracted : %s' % url, level=log.INFO)
item = DmozItem()
if response.status == 404 or response.status == 500:
self.failed_urls.append(response.url)
self.log('failed_url : %s' % self.failed_urls, level=log.INFO)
item['failed_urls'] = self.failed_urls
else :
if url.startswith('http') :
if url.startswith('someurl.com'):
item['internal_link'] = url
# Need to capture screenshot of the extracted url here
self.log('internal_link :%s' % url, level=log.INFO)
else :
item['external_link'] = url
# Need to capture screenshot of the extracted url here
self.log('external_link :%s' % url, level=log.INFO)
self.items.append(item)
self.items = list(set(self.items))
return self.items
else :
self.log('did not recieved expected response', level=log.INFO)
更新:我正在使用虚拟机(通过腻子登录)
解决方法:
您可以看一下splash这样的渲染服务器
内容总结
以上是互联网集市为您收集整理的python-网站抓取和屏幕截图全部内容,希望文章能够帮你解决python-网站抓取和屏幕截图所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。