首页 / 爬虫 / 进程线程协程在爬虫中的使用例子
进程线程协程在爬虫中的使用例子
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了进程线程协程在爬虫中的使用例子,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含3575字,纯文字阅读大概需要6分钟。
内容图文
![进程线程协程在爬虫中的使用例子](/upload/InfoBanner/zyjiaocheng/974/6d55ae2eb0d748be81260db30be16d25.jpg)
一.多进程
import requests
from lxml import etree
from multiprocessing import Pool
from multiprocessing import Manager
# url = 'http://www.mzitu.com/'
# url2 = 'http://www.mzitu.com/page/2/'
# url3 = 'http://www.mzitu.com/page/3/'
def get_all_pages(base_url, q):
for i in range(196):
url = base_url.format(i+1)
q.put((url, get_one_page))
break
def get_one_page(url, q):
response = requests.get(url, timeout=3)
html_ele = etree.HTML(response.text)
href_list = html_ele.xpath('//ul[@id="pins"]/li/a/@href')
for href in href_list:
q.put((href, get_detailed_page))
def get_detailed_page(url, q):
response = requests.get(url, timeout=3)
html_ele = etree.HTML(response.text)
max_num = html_ele.xpath('//div[@class="pagenavi"]/a[last()-1]/span/text()')[0]
for i in range(int(max_num)):
detail_url = url + '/' + str(i+1) + '/'
q.put((detail_url, get_one_img_url))
def get_one_img_url(url, q):
response = requests.get(url, timeout=3)
html_ele = etree.HTML(response.text)
print('current url = ', url)
src = html_ele.xpath('//div[@class="main-image"]//img/@src')[0]
q.put((src, download_image, url))
def download_image(url, referer):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
'referer': referer,
}
file_dir = 'download'
import os
if not os.path.exists(file_dir):
os.mkdir(file_dir)
filename = file_dir + "/" + url.split('/')[-1]
print('downloading ......', url)
response = requests.get(url, headers = headers, timeout=3)
with open(filename, 'wb') as f:
f.write(response.content)
if __name__ == '__main__':
pool = Pool(processes=16)
# 进程池中共享数据的结构
q = Manager().Queue()
import time
start_time = time.time()
base_url = 'http://www.mzitu.com/page/{}/'
q.put((base_url, get_all_pages))
while True:
try:
item = q.get(timeout=5)
except:
break
url = item[0]
func = item[1]
if len(item) == 3:
referer = item[2]
pool.apply_async(func=func, args=(url, referer))
#func(url, referer)
else:
pool.apply_async(func=func, args=(url, q))
#func(url, request_list)
print('loops over')
pool.close()
pool.join()
print('时间是:' , time.time() - start_time)
二.多线程
import threading
import time
def runWemen(name):
print(name + '很漂亮')
time.sleep(10)
print(name + '和老王的故事')
def runMen(name):
print(name + '很帅气')
time.sleep(7)
print(name + '和小三的故事')
def runConclusion():
print('为什么男生外遇叫小三,女生外遇叫老王')
print('因为王比三多个棍')
# 生成一个多线程或者多进程的类
# 开启这个多线程或者多进程
# 等待他们执行结束
if __name__ == '__main__':
# t1 = threading.Thread(target=runWemen, args=('萌萌',))
# t2 = threading.Thread(target=runMen, args=('王琨',))
# t3 = threading.Thread(target=runConclusion)
#
# t1.start()
# t2.start()
# t3.start()
#
# t1.join()
# t2.join()
# t3.join()
runWemen('萌萌')
runMen('王琨')
runConclusion()
三.多协程
port requests
import time
import gevent
from gevent import monkey
monkey.patch_all()
def get_content_from_url(url):
print(url)
response = requests.get(url, verify=False)
print('接收到的数据大小是:', len(response.text), url)
# if __name__ == '__main__':
# start_time = time.time()
# urlList = ['https://www.python.org', 'https://github.com/', 'https://www.taobao.com/']
# for url in urlList:
# get_content_from_url(url)
#
# end_time = time.time()
# print('花费的时间是:', end_time - start_time)
if __name__ == '__main__':
start_time = time.time()
gList = []
urlList = ['https://www.python.org', 'https://github.com/', 'https://www.taobao.com/']
for url in urlList:
g = gevent.spawn(get_content_from_url, url)
gList.append(g)
gevent.joinall(gList)
end_time = time.time()
print('花费的时间是:', end_time - start_time)
内容总结
以上是互联网集市为您收集整理的进程线程协程在爬虫中的使用例子全部内容,希望文章能够帮你解决进程线程协程在爬虫中的使用例子所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。