首页 / PYTHON / python-从第二组链接中抓取,抓取页面

python-从第二组链接中抓取,抓取页面

内容导读

互联网集市收集整理的这篇技术教程文章主要介绍了python-从第二组链接中抓取,抓取页面，小编现在分享给大家，供广大互联网技能从业者学习和参考。文章包含4533字，纯文字阅读大概需要7分钟。

内容图文

我今天一直在浏览Scrapy文档,并尝试在一个真实的示例中获得-https://docs.scrapy.org/en/latest/intro/tutorial.html#our-first-spider的工作版本.我的示例稍有不同,因为它有2个下一页,即

start_url > city page > unit page

这是我要从中获取数据的单位页面.

我的代码：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://www.unitestudents.com/',
            ]

    def parse(self, response):
        for quote in response.css('div.property-body'):
            yield {
                'name': quote.xpath('//span/a/text()').extract(),
                'type': quote.xpath('//div/h4/text()').extract(),
                'price_amens': quote.xpath('//div/p/text()').extract(),
                'distance_beds': quote.xpath('//li/p/text()').extract()
            }

            # Purpose is to crawl links of cities
            next_page = response.css('a.listing-item__link::attr(href)').extract_first()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

            # Purpose is to crawl links of units
            next_unit_page = response.css(response.css('a.text-highlight__inner::attr(href)').extract_first())
            if next_unit_page is not None:
                                          next_unit_page = response.urljoin(next_unit_page)
                                          yield scrapy.Request(next_unit_page, callback=self.parse)

但是当我运行它时,我得到：

INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

因此,我认为我的代码未设置为检索上述流程中的链接,但不确定如何做到最好？

更新流程：

Main page > City page > Building page > Unit page

仍然是我要从中获取数据的单位页面.

更新的代码：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://www.unitestudents.com/',
            ]

    def parse(self, response):
        for quote in response.css('div.site-wrapper'):
            yield {
                'area_name': quote.xpath('//div/ul/li/a/span/text()').extract(),
                'type': quote.xpath('//div/div/div/h1/span/text()').extract(),
                'period': quote.xpath('/html/body/div/div/section/div/form/h4/span/text()').extract(),
                'duration_weekly': quote.xpath('//html/body/div/div/section/div/form/div/div/em/text()').extract(),
                'guide_total': quote.xpath('//html/body/div/div/section/div/form/div/div/p/text()').extract(),              
                'amenities': quote.xpath('//div/div/div/ul/li/p/text()').extract(),              
            }

            # Purpose is to crawl links of cities
            next_page = response.xpath('//html/body/div/footer/div/div/div/ul/li/a[@class="listing-item__link"]/@href').extract()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

            # Purpose is to crawl links of units
            next_unit_page = response.xpath('//li/div/h3/span/a/@href').extract()
            if next_unit_page is not None:
                                          next_unit_page = response.urljoin(next_unit_page)
                                          yield scrapy.Request(next_unit_page, callback=self.parse)

            # Purpose to crawl crawl pages on full unit info

            last_unit_page = response.xpath('//div/div/div[@class="content__btn"]/a/@href').extract()
            if last_unit_page is not None:
                last_unit_page = response.urljoin(last_unit_page)
                yield scrapy.Request(last_unit_page, callback=self.parse)

解决方法:

让我们从逻辑开始：

>抓取首页-获取所有城市
>抓取城市页面-获取所有单元网址
>抓取单元页面-获取所有所需数据

我已经举了一个例子,说明如何在下面的抓蜘蛛中实现这一点.我找不到您在示例代码中提到的所有信息,但是希望该代码足够清晰,以使您了解它的作用以及如何添加所需的信息.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://www.unitestudents.com/',
            ]

    # Step 1
    def parse(self, response):
        for city in response.xpath('//select[@id="frm_homeSelect_city"]/option[not(contains(text(),"Select your city"))]/text()').extract(): # Select all cities listed in the select (exclude the "Select your city" option)
            yield scrapy.Request(response.urljoin("/"+city), callback=self.parse_citypage)

    # Step 2
    def parse_citypage(self, response):
        for url in response.xpath('//div[@class="property-header"]/h3/span/a/@href').extract(): #Select for each property the url
            yield scrapy.Request(response.urljoin(url), callback=self.parse_unitpage)

        # I could not find any pagination. Otherwise it would go here.

    # Step 3
    def parse_unitpage(self, response):
        unitTypes = response.xpath('//div[@class="room-type-block"]/h5/text()').extract() + response.xpath('//h4[@class="content__header"]/text()').extract()
        for unitType in unitTypes: # There can be multiple unit types so we yield an item for each unit type we can find.
            yield {
                'name': response.xpath('//h1/span/text()').extract_first(),
                'type': unitType,
                # 'price': response.xpath('XPATH GOES HERE'), # Could not find a price on the page
                # 'distance_beds': response.xpath('XPATH GOES HERE') # Could not find such info
            }

我认为代码非常干净和简单.注释应阐明为什么我选择使用for循环.如果不清楚,请告诉我,我将尽力解释.

内容总结

以上是互联网集市为您收集整理的python-从第二组链接中抓取,抓取页面全部内容，希望文章能够帮你解决python-从第二组链接中抓取,抓取页面所遇到的程序开发问题。如果觉得互联网集市技术教程内容还不错，欢迎将互联网集市网站推荐给程序员好友。

内容备注

版权声明：本文内容由互联网用户自发贡献，该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容，请发送邮件至 gblab@vip.qq.com 举报，一经查实，本站将立刻删除。

内容手机端

扫描二维码推送至手机访问。

本文链接：https://qyyshop.com/info/689400.html

来源：【匿名】

【上一篇】python函数 | 匿名函数【下一篇】浅谈PHP运行Python脚本的方法

更多 ►

【python-从第二组链接中抓取,抓取页面】教程文章相关的互联网学习教程文章

Python入门小练习 002 批量下载网页链接中的图片【代码】

我们常常需要下载网页上很多喜欢的图片，但是面对几十甚至上百张的图片，一个一个去另存为肯定是个很差的体验。我们可以用urllib包获取html的源码，再以正则表达式把匹配的图片链接放入一个list中，使用for循环来依次下载list中的链接。 import re import urllib a = raw_input("Please input a URL: ") s = urllib.urlopen(a) s2 = s.read()def image(s2):reg = r‘src="(.*?\.jpg)" pic_ext‘compile_reg = re.compile(reg)imag...

python3 网页爬虫图片下载无效链接处理 try except

代码比较粗糙，主要是备忘容易出错的地方。供自己以后查阅。#图片下载import reimport urllib.request #python3中模块名和2.x（urllib）的不一样site=‘https://world.taobao.com/item/530762904536.htm?spm=a21bp.7806943.topsale_XX.4.jcjxZC‘page=urllib.request.urlopen(site)html=page.read()html=html.decode(‘utf-8‘) #读取下来的网页源码需要转换成utf-8格式reg=r‘src="//(gd.*?jpg)‘imgre=re.compile(reg)imgl...

小工具：使用Python自动生成MD风格链接【代码】

很久之前我在Github上搞了一个LeetCode的仓库，但一直没怎么维护。最近发现自己刷了不少LC的题目了，想搬运到这个仓库上。玩Github最重要的当然是写README了，MD的逼格决定了项目牛逼不牛逼。但是让我一个一个去手写项目中的链接那是不可能的，这辈子都不可能手写，只有写脚本自动生成才能满足装逼的样子。import os import os.path # 根目录 rootdir="E:/gitTest/LeetCode/" list=[] result=[] # 定义链接前缀 prefix="https://gi...

python下载链接内容【代码】

下面代码下载京东注册码，可接收参数a.py num dir#!/usr/bin/python #code utf-8import urllib import time import sys import osurlbase = ‘https://authcode.jd.com/verify/image?a=0&acid=52b9316d-c9ab-4169-b39e-1217deaede7b&uid=52b9316d-c9ab-4169-b39‘‘e-1217deaede7b&srcid=reg&is=7c63fc289d9ce9f3ba8304f74c1b9f19&yys=‘picnum = 10 filedir = os.getcwd() + ‘/‘ lenarg = len(sys.argv) if lenarg == 2:picnum ...

python BeautifulSoup获取网页链接的文字内容【代码】

这里和获取链接略有不同，不是得到链接到url，而是获取每个链接的文字内容#!/opt/yrd_soft/bin/pythonimport re import urllib2 import requests import lxml from bs4 import BeautifulSoupurl = ‘http://www.baidu.com‘#page=urllib2.urlopen(url) page=requests.get(url).text pagesoup=BeautifulSoup(page,‘lxml‘) for link in pagesoup.find_all(name=‘a‘,attrs={"href":re.compile(r‘^http:‘)}): print link.g...

Python通过SSH隧道链接Kafka【代码】

Python通过SSH隧道链接Kafka最近有一个需求需要连接Kafka，但是它只允许内网链接，但是有些服务跑在服务器上总没有在我本机调试起来爽，毕竟很多开发工具还是在客户端机器上用的熟练。于是我想到了通过SSH连接Kafka，至于怎么连接可以通过XShell、Proxifier等等，由于个人还是觉得自己写更灵活，所以我是用Python里的sshtunnel写的（有需要后面我也可以分享下），个人喜好啊，你们自行选择。由于笔者这里的Kafka环境使用Zookeeper做...

python基础《python链接数据库》【代码】【图】

python访问数据库本文案例基于runoob数据库下，51job表演示1，MySQL的链接import pymysql# 打开数据库连接 db = pymysql.connect("localhost", "root", "123456", "runoob")# 使用 cursor() 方法创建一个游标对象 cursor cursor = db.cursor()# 使用 execute() 方法执行 SQL，如果表存在则删除 cursor.execute("DROP TABLE IF EXISTS employee")# 使用预处理语句创建表 sql = """CREATE TABLE EMPLOYEE (FIRST_NAME CHAR(20) NOT ...

python分析网页上所有超链接的方法【代码】

本文实例讲述了python分析网页上所有超链接的方法。分享给大家供大家参考。具体实现方法如下： import urllib, htmllib, formatter website = urllib.urlopen("http://yourweb.com") data = website.read() website.close() format = formatter.AbstractFormatter(formatter.NullWriter()) ptext = htmllib.HTMLParser(format) ptext.feed(data) for link in ptext.anchorlist:print(link)希望本文所述对大家的Python程序设计有所帮...

python调用C动态链接库【代码】

Python调用C库比较简单，不经过任何封装打包成so，再使用python的ctypes调用即可。1. C语言文件：pycall.c#include <stdio.h> #include <stdlib.h>int foo(int a, int b) {printf("you input %d and %d\n",a,b);return a+b; } 2. gcc编译成动态库libpycall.so: gcc -o libpycall.so -shared -fPIC pycall.c3. python调用动态库的文件：pycall.pyimport ctypes ll = ctypes.cdll.LoadLibrary lib = ll("./libpycall.so") num = lib....

python利用django实现简单的登录和注册，并利用session实现了链接数据库【代码】

利用session实现与数据库链接,登录模块（在views.py）def login(request):# return HttpResponseRedirect(‘/‘)# 判断是否post方式，如果是则进行下面的表单处理if request.method == ‘POST‘:rs = Users.objects.filter(email=request.POST.get(‘email‘), #django的filter方法是从数据库的取得匹配的结果，返回一个对象列表，如果记录不存在的话，它会返回[]。比如我数据库里有一条记录，记录的name的值是Python的话，我用st...

Python-urllib库parse模块解析链接常用方法【代码】

版权声明：本文为博主学习记录，转载请注明出处()urlparse()# urllib.parse.urlparse(urlstring,scheme=‘‘,allow_fragments=True) # urlstring : 这个是必填项,即待解析的URL result = urlparse(‘http://www.baidu.com/index.html;user?id=5#comment‘) print(type(result),result) # scheme : 它是默认的协议,只有在URL中不包含scheme信息时生效 result = urlparse(‘www.baidu.com/index.html;user?id=5#comment‘,s...

Python3常用知识库链接

入门教程Python 3 菜鸟教程Python教程廖雪峰的官方网站环境Python Releases for WindowsDownload PyCharm 文档Python 3.8.4rc1 文档PyCharm Help 社区Python中文社区知乎Pythoner集中营简书原文：https://www.cnblogs.com/soulxj/p/13253205.html

python实现网页链接提取的方法分享

复制代码代码如下:#encoding:utf-8import socketimport htmllib,formatterdef open_socket(host,servname): s=socket.socket(socket.AF_INET,socket.SOCK_STREAM) port=socket.getservbyname(servname) s.connect((host,port)) return shost=‘‘host=input(‘请输入网址\n‘)mysocket=open_socket(host,‘http‘)message=‘GET http://%s/\n\n‘%(host,)mysocket.send(message)file=mysocket.makefile()htmldata=fil...

python附录-builtins.py模块str类源码（含str官方文档链接）【代码】

python附录-builtins.py模块str类源码str官方文档链接：https://docs.python.org/3/library/stdtypes.html#text-sequence-type-strbuiltins.pyclass str(object): """ str(object=‘‘) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded usi...

python 站点资源链接简易爬虫【代码】

此脚本用于爬站点的下载链接，最终输出到txt文档中。如果是没有防盗链设置的站点，也可以使用脚本中的下载函数尝试直接下载。本脚本是为了短期特定目标设计的，如果使用它爬其它特征的资源链接需自行修改配置语句。python初学者，请多多指正。# -*- coding: utf-8 -*- import re import urllib import os import urllib2 import requests import time#download the file def download(page, url):local_filename =url.split(‘/‘...

首页 / PYTHON / python-从第二组链接中抓取,抓取页面

python-从第二组链接中抓取,抓取页面

内容导读

内容图文

内容总结

内容备注

内容手机端

【python-从第二组链接中抓取,抓取页面】教程文章相关的互联网学习教程文章

Python入门小练习 002 批量下载网页链接中的图片【代码】

python3 网页爬虫图片下载无效链接处理 try except

小工具：使用Python自动生成MD风格链接【代码】

python下载链接内容【代码】

python BeautifulSoup获取网页链接的文字内容【代码】

Python通过SSH隧道链接Kafka【代码】

python基础《python链接数据库》【代码】【图】

python分析网页上所有超链接的方法【代码】

python调用C动态链接库【代码】

python利用django实现简单的登录和注册，并利用session实现了链接数据库【代码】

Python-urllib库parse模块解析链接常用方法【代码】

Python3常用知识库链接

python实现网页链接提取的方法分享

python附录-builtins.py模块str类源码（含str官方文档链接）【代码】

python 站点资源链接简易爬虫【代码】

PYTHON - 相关标签

链接 - 相关标签

PYTHON - 技术教程分类

PYTHON - 最新教程

PYTHON - 最热教程