从爬虫看多进程开发

内容导读

互联网集市收集整理的这篇技术教程文章主要介绍了从爬虫看多进程开发，小编现在分享给大家，供广大互联网技能从业者学习和参考。文章包含4433字，纯文字阅读大概需要7分钟。

内容图文

简介

因为写英文应用文与写作需要参考新闻信息，但是，我脑子里除了报纸没有其他更好的信息整合平台。遂打算下载renming日报

参考链接

https://www.liaoxuefeng.com/wiki/1016959663602400/1017628290184064
https://blog.csdn.net/qq_38161040/article/details/88366427
https://blog.csdn.net/baidu_28479651/article/details/76158051?utm_source=blogxgwz7

code 第一版

70%手动 30%自动需要频繁的创建文件夹和更改下载次数

# coding = UTF-8
# 爬取自己编写的html链接中的PDF文档,网址：file:///E:/ZjuTH/Documents/pythonCode/pythontest.html

import urllib.request
import re
import os

# open the url and read
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    page.close()
    return html

# compile the regular expressions and find
# all stuff we need
def getUrl(html):
    reg = r'([A-Z]\d+)' #匹配了G176200001
    url_re = re.compile(reg)
    url_lst = url_re.findall(html.decode('UTF-8')) #返回匹配的数组
    return(url_lst)

def getFile(url):
    file_name = url.split('/')[-1]
    u = urllib.request.urlopen(url)
    f = open(file_name, 'wb')

    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        f.write(buffer)
    f.close()
    print ("Sucessful to download" + " " + file_name)




if __name__ == '__main__':
    tmp = "http://paper.people.com.cn/rmrb/page/2020-03/26/01/rmrb20200326";
    for i in range(20):
        #print(i)
        # http://paper.people.com.cn/rmrb/page/2020-03/26/02/rmrb2020032602.pdf
        # http://paper.people.com.cn/rmrb/page/2020-03/26/03/rmrb2020032603.pdf
        if(i+1 <10):
            getFile("http://paper.people.com.cn/rmrb/page/2020-03/07/0"+str(i+1)+"/rmrb202003070"+str(i+1)+".pdf")
        else:
            getFile("http://paper.people.com.cn/rmrb/page/2020-03/07/"+str(i+1)+"/rmrb20200307"+str(i+1)+".pdf")

code 第二版自动创建文件夹版本

下载速度较慢需要等待

# coding = UTF-8
# 爬取自己编写的html链接中的PDF文档,网址：file:///E:/ZjuTH/Documents/pythonCode/pythontest.html

import urllib.request
import re
import os
import shutil

# open the url and read
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    page.close()
    return html

# compile the regular expressions and find
# all stuff we need
def getUrl(html):
    reg = r'([A-Z]\d+)' #匹配了G176200001
    url_re = re.compile(reg)
    url_lst = url_re.findall(html.decode('UTF-8')) #返回匹配的数组
    return(url_lst)

def getFile(url):
    file_name = url.split('/')[-1]
    u = urllib.request.urlopen(url)
    f = open(file_name, 'wb')

    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        f.write(buffer)
    f.close()
    print ("Sucessful to download" + " " + file_name)
    return file_name

if __name__ == '__main__':
    for i in range(29):
        folderName=""
        data = str(i+1);
        if(i+1 < 10):
            data = "0"+data;
        folderName = "02"+data;
        os.mkdir(folderName)
       
        for j in range(20):
            fineName = ""

            try:
                if(j+1 <10):
                    fileName = "http://paper.people.com.cn/rmrb/page/2020-02/"+data+"/0"+str(j+1)+"/rmrb202002"+data+"0"+str(j+1)+".pdf";
                    tmp = getFile(fileName)
                else:
                    fileName = "http://paper.people.com.cn/rmrb/page/2020-02/"+data+"/"+str(j+1)+"/rmrb202002"+data+str(j+1)+".pdf";
                    tmp = getFile(fileName)
                shutil.move(tmp,folderName)
            except OSError:
                pass
            continue

code 多进程下载

超级爽

# coding = UTF-8
# 爬取自己编写的html链接中的PDF文档,网址：file:///E:/ZjuTH/Documents/pythonCode/pythontest.html

import urllib.request
import re
import os
import shutil
from multiprocessing import Pool
import time
# open the url and read
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    page.close()
    return html

# compile the regular expressions and find
# all stuff we need
def getUrl(html):
    reg = r'([A-Z]\d+)' #匹配了G176200001
    url_re = re.compile(reg)
    url_lst = url_re.findall(html.decode('UTF-8')) #返回匹配的数组
    return(url_lst)

def getFile(url):
    file_name = url.split('/')[-1]
    u = urllib.request.urlopen(url)
    f = open(file_name, 'wb')

    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        f.write(buffer)
    f.close()
    print ("Sucessful to download" + " " + file_name)
    return file_name
def download(i):
    folderName=""
    data = str(i+1);
    if(i+1 < 10):
        data = "0"+data;
    folderName = "01"+data;
    os.mkdir(folderName)
   
    for j in range(20):
        fineName = ""

        try:
            if(j+1 <10):
                fileName = "http://paper.people.com.cn/rmrb/page/2020-01/"+data+"/0"+str(j+1)+"/rmrb202001"+data+"0"+str(j+1)+".pdf";
                tmp = getFile(fileName)
            else:
                fileName = "http://paper.people.com.cn/rmrb/page/2020-01/"+data+"/"+str(j+1)+"/rmrb202001"+data+str(j+1)+".pdf";
                tmp = getFile(fileName)
            shutil.move(tmp,folderName)
        except OSError:
            pass
        continue

if __name__ == '__main__':
    p = Pool(31)
    for i in range(31):
        p.apply_async(download, args = (i,))
    p.close()    
    p.join()
    print('All subprocesses done.')

内容总结

以上是互联网集市为您收集整理的从爬虫看多进程开发全部内容，希望文章能够帮你解决从爬虫看多进程开发所遇到的程序开发问题。如果觉得互联网集市技术教程内容还不错，欢迎将互联网集市网站推荐给程序员好友。

内容备注

版权声明：本文内容由互联网用户自发贡献，该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容，请发送邮件至 gblab@vip.qq.com 举报，一经查实，本站将立刻删除。

内容手机端

扫描二维码推送至手机访问。

本文链接：https://qyyshop.com/info/939553.html

来源：【匿名】

【上一篇】python爬虫多进程,多线程,协程以及组合应用的效率对比--以爬取小说全文为例【下一篇】PHP使用swoole实现多线程爬虫

更多 ►

【从爬虫看多进程开发】教程文章相关的互联网学习教程文章

以梨视频为例分析页面请求抓取网页数据。本次抓取梨视频生活分类页面下的部分视频数据，并保存到本地。一、分析网页　　打开抓取网页，查看网页代码结构，发现网页结构里面存放视频的地址并不是真正的视频地址。　　　　　　　　　　进入视频详情页面查看后，可以在response中找到真正的视频地址。保存这个地址的并不是标签，而是一个变量，我们使用re来解析这个变量，提取信息。　　　　　　　　二、代码实现"""使用多线程爬取梨...

Python有了asyncio和aiohttp在爬虫这类型IO任务中多线程/多进程还有存在的必要吗？【代码】

最近正在学习Python中的异步编程，看了一些博客后做了一些小测验：对比asyncio+aiohttp的爬虫和asyncio+aiohttp+concurrent.futures(线程池/进程池)在效率中的差异，注释：在爬虫中我几乎没有使用任何计算性任务，为了探测异步的性能，全部都只是做了网络IO请求，就是说aiohttp把网页get完就程序就done了。结果发现前者的效率比后者还要高。我询问了另外一位博主，(提供代码的博主没回我信息)，他说使用concurrent.futures的话因为...

Python多线程、异步＋多进程爬虫实现代码

安装Tornado 省事点可以直接用grequests库，下面用的是tornado的异步client。异步用到了tornado，根据官方文档的例子修改得到一个简单的异步爬虫类。可以参考下最新的文档学习下。 pip install tornado 异步爬虫#!/usr/bin/env python # -*- coding:utf-8 -*-import time from datetime import timedelta from tornado import httpclient, gen, ioloop, queues import tracebackclass AsySpider(object):"""A simple class of as...

Python多进程爬虫东方财富盘口异动数据+Python读写Mysql与Pandas读写Mysql效率对比【代码】【图】

先上个图看下网页版数据、mysql结构化数据通过Python读写mysql执行时间为：1477s，而通过Pandas读写mysql执行时间为：47s，方法2速度几乎是方法1的30倍。在于IO读写上，Python多线程显得非常鸡肋，具体分析可参考：https://cuiqingcai.com/3325.html 1、Python读写Mysql# -*- coding: utf-8 -*- import pandas as pd import tushare as ts import pymysql import time import requests import json from multiprocessing ...

Python多进程爬虫东方财富盘口异动数据+Python读写Mysql与Pandas读写Mysql效率对比【图】

Python使用多进程提高网络爬虫的爬取速度，爬取多项目必备技能【图】

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理本文来自腾讯云，作者：Python小屋屋主多线程技术并不能充分利用硬件资源和大幅度提高系统吞吐量，类似需求应使用多进程编程技术满足。以爬取中国工程院院士简介和照片为例，参考代码如下，请自行分析目标网页结构并与参考代码进行比对。另外需要注意，该程序最好在cmd命令提示符环境执行。声明：爬虫系列文章仅...

爬虫-多进程（10）【代码】

#1. 实例化Thread #2. 继承Thread类 import time from threading import Threaddef sleep_task(sleep_time):print("sleep {} seconds start!".format(sleep_time))time.sleep(sleep_time)print("sleep {} seconds end!".format(sleep_time))class SleepThread(Thread):def __init__(self, sleep_time):self.sleep_time = sleep_timesuper().__init__()def run(self):print("sleep {} seconds start!".format(self.sleep_time))time...

从爬虫看多进程开发【代码】

简介因为写英文应用文与写作需要参考新闻信息，但是，我脑子里除了报纸没有其他更好的信息整合平台。遂打算下载renming日报参考链接 https://www.liaoxuefeng.com/wiki/1016959663602400/1017628290184064 https://blog.csdn.net/qq_38161040/article/details/88366427 https://blog.csdn.net/baidu_28479651/article/details/76158051?utm_source=blogxgwz7 code 第一版70%手动 30%自动需要频繁的创建文件夹和更改下载次数# co...

python爬虫多进程,多线程,协程以及组合应用的效率对比--以爬取小说全文为例【代码】【图】

本篇将测试爬取单本小说下：利用多进程，多线程，协程，以及多进程加多线程，多进程加协程组合应用的效率。以爬取--笔趣阁--大道争锋为例测试相关组合的性能。多线程代码如下： # -*- coding: utf-8 -*- """ Created on Wed Mar 4 10:39:55 2020@author: wenzhe.tian多进程+多线程多进程+协程 """book_name_list=[大道争锋]####### 开始工作 import time from concurrent.futures import ThreadPoolExecutor import requests...

爬虫多进程，etree和xpath【代码】

from pprint import pprintfrom queue import Queuefrom lxml import etreeimport requests# 导入进程池from multiprocessing.dummy import Poolimport timeclass QuiShi: def __init__(self): self.temp_url = "http://www.lovehhy.net/Joke/Detail/QSBK/{0}" self.headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537...

多进程爬虫【代码】

import requests from multiprocessing import Pool import re from requests.exceptions import RequestException import jsondef get_one_page(url):try:res = requests.get(url)if res.status_code == 200:return res.textreturn Noneexcept RequestException:return None def parse_one_page(html):pat = re.compile(<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a+.*?>(.*?)</a>.*?star">(.*?)</p>.*?relea...

爬虫 - 相关标签

爬虫程序爬虫代理ip 爬虫代码爬虫工程师爬虫工具爬虫是什么爬虫原理

爬虫 - 最热教程

论Python爬虫与MySQL数据库交互的坑 Python爬虫实战教程：爬取网易新闻；爬...【Python爬虫实践】异步XHR爬取在线编...node爬虫进阶之——登录 Python3爬虫实例之网易云音乐爬虫 java网页爬虫正则表达式 c# – 尝试使用ZeroMQ构建分布式爬虫 Python爬虫抓取技术的门道，大师级总结 [Python] [爬虫] 1.批量政府网站的招投...Python爬虫入门【10】：电子书多线程爬...

首页 / 爬虫 / 从爬虫看多进程开发

从爬虫看多进程开发

内容导读

内容图文

简介

参考链接

code 第一版

code 第二版自动创建文件夹版本

code 多进程下载

内容总结

内容备注

内容手机端

【从爬虫看多进程开发】教程文章相关的互联网学习教程文章

爬虫——使用多进程爬取视频数据【代码】【图】

Python有了asyncio和aiohttp在爬虫这类型IO任务中多线程/多进程还有存在的必要吗？【代码】

Python多线程、异步＋多进程爬虫实现代码

Python多进程爬虫东方财富盘口异动数据+Python读写Mysql与Pandas读写Mysql效率对比【代码】【图】

Python多进程爬虫东方财富盘口异动数据+Python读写Mysql与Pandas读写Mysql效率对比【图】

Python使用多进程提高网络爬虫的爬取速度，爬取多项目必备技能【图】

爬虫-多进程（10）【代码】

从爬虫看多进程开发【代码】

python爬虫多进程,多线程,协程以及组合应用的效率对比--以爬取小说全文为例【代码】【图】

爬虫多进程，etree和xpath【代码】

多进程爬虫【代码】

爬虫 - 相关标签

开发 - 相关标签

爬虫 - 最新教程

爬虫 - 最热教程

首页 / 爬虫 / 从爬虫看多进程开发

从爬虫看多进程开发

内容导读

内容图文

简介

参考链接

code 第一版

code 第二版 自动创建文件夹版本

code 多进程下载

内容总结

内容备注

内容手机端

【从爬虫看多进程开发】教程文章相关的互联网学习教程文章

爬虫 - 相关标签

开发 - 相关标签

爬虫 - 最新教程

爬虫 - 最热教程

code 第二版自动创建文件夹版本