首页 / 爬虫 / python 站点资源链接简易爬虫

python 站点资源链接简易爬虫

内容导读

互联网集市收集整理的这篇技术教程文章主要介绍了python 站点资源链接简易爬虫，小编现在分享给大家，供广大互联网技能从业者学习和参考。文章包含2690字，纯文字阅读大概需要4分钟。

内容图文

此脚本用于爬站点的下载链接，最终输出到txt文档中。

如果是没有防盗链设置的站点，也可以使用脚本中的下载函数尝试直接下载。

本脚本是为了短期特定目标设计的，如果使用它爬其它特征的资源链接需自行修改配置语句。

python初学者，请多多指正。

# -*- coding: utf-8 -*-  
import re
import urllib
import os
import urllib2
import requests
import time


#download the file
def download(page, url):
	local_filename =url.split(‘/‘)[-1] + page + ‘.jpg‘
	r = requests.get(url, stream=True)
	with open(local_filename, ‘wb‘) as f:
		for chunk in r.iter_content(chunk_size = 1024): 
			if chunk: # filter out keep-alive new chunks
				f.write(chunk)
                f.flush() 

	return local_filename


# download(‘123‘,‘http://cdn.woi3d.com/openfiles/86509/88619.jpg-preview‘)
# download(‘11‘, ‘http://www.cnblogs.com/images/logo_small.gif‘)


#turn the data array into urls array
def print_urls(urls):
	output_urls = []
	for link in urls:
		start_link = link.find(‘"‘)
		end_link = link.find(‘"‘, start_link+1)
		output_link = link[start_link+1: end_link]
		if output_link.find(‘http‘) == -1:
			output_link = ‘http://www.woi3d.com‘ + output_link
		if link.count(‘"‘) > 2:
			continue
		else:
			output_urls.append(output_link)
	return output_urls


def output_download_link_page(page):
	url = page
	s = urllib.urlopen(url).read()
	urls = []
	img_urls = ‘no image on‘ + page
	new_stl_urls = []

	title = re.findall(r‘<h1>.+<\/h1>‘, s, re.I)
	if len(title) != 0:
		title = title[0]
	else:
		title = ‘no title‘

	img_urls = print_urls(re.findall(r‘href=".*?\.jpg.*?"‘, s, re.I))
	if len(img_urls) != 0:
		img_urls = img_urls[0]
	else:
		img_urls = ‘no image‘ + page	

	stl_urls = print_urls (set(re.findall(r‘href="/download/.*?"‘, s, re.I)))

	for url in stl_urls:
		#url = urllib2.urlopen(url).url
		url = requests.get(url).url
		new_stl_urls.append(url)

	urls.append(title) 
	urls.append(img_urls) 
	urls = urls + new_stl_urls

	return urls

#print output_download_link_page(‘http://www.woi3d.com/thing/46876‘)

#output all links to download
def output_all_pages(site):
	s = urllib.urlopen(site).read()
	page = re.findall(r‘href="/thing/.*?"‘, s, re.I)
	page = set(page)
	return print_urls(page)


#output all the sites to download
def generate_sites(start, end):
	sites = []
	for  num in range(start, end):
		sites.append(‘http://www.woi3d.com/popular?query=&pg=‘ + str(num))
	return sites


#write all the results to a txt file
file_new = open (‘1.txt‘, ‘r+‘)
url_pakage = []
sites = generate_sites(40, 46)
count = 0

for site in sites:
	print site
	file_new.write( ‘\n‘ + site)
	pages = output_all_pages(site)
	for page in pages:
		urls = output_download_link_page(page)
		#
		if len(urls) >= 10:
			continue
		count = count + 1
		for url in urls:
			file_new.write(url + ‘\n‘)
	print ‘done‘
	time.sleep(10)

file_new.close()
print ‘all done. all..‘ + str(count) + ‘..models‘

原文：http://www.cnblogs.com/mingtan/p/6933755.html

内容总结

以上是互联网集市为您收集整理的python 站点资源链接简易爬虫全部内容，希望文章能够帮你解决python 站点资源链接简易爬虫所遇到的程序开发问题。如果觉得互联网集市技术教程内容还不错，欢迎将互联网集市网站推荐给程序员好友。

内容备注

版权声明：本文内容由互联网用户自发贡献，该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容，请发送邮件至 gblab@vip.qq.com 举报，一经查实，本站将立刻删除。

内容手机端

扫描二维码推送至手机访问。

本文链接：https://qyyshop.com/info/1045090.html

来源：【匿名】

【上一篇】爬虫--百度贴吧每一页中的图片【下一篇】PHP使用swoole实现多线程爬虫

更多 ►

【python 站点资源链接简易爬虫】教程文章相关的互联网学习教程文章

python3 网页爬虫图片下载无效链接处理 try except

代码比较粗糙，主要是备忘容易出错的地方。供自己以后查阅。#图片下载import reimport urllib.request #python3中模块名和2.x（urllib）的不一样site=‘https://world.taobao.com/item/530762904536.htm?spm=a21bp.7806943.topsale_XX.4.jcjxZC‘page=urllib.request.urlopen(site)html=page.read()html=html.decode(‘utf-8‘) #读取下来的网页源码需要转换成utf-8格式reg=r‘src="//(gd.*?jpg)‘imgre=re.compile(reg)imgl...

python 站点资源链接简易爬虫【代码】

此脚本用于爬站点的下载链接，最终输出到txt文档中。如果是没有防盗链设置的站点，也可以使用脚本中的下载函数尝试直接下载。本脚本是为了短期特定目标设计的，如果使用它爬其它特征的资源链接需自行修改配置语句。python初学者，请多多指正。# -*- coding: utf-8 -*- import re import urllib import os import urllib2 import requests import time#download the file def download(page, url):local_filename =url.split(‘/‘...

php爬虫链接池和链接调度怎么写？【图】

链接池功能：1、存储链接；2、对链接去重；3、对链接设置优先级。实践方案有：1、保存在数据库；2、redis；3、内存集合；4、队列。链接池和抓取调度的代码怎么写？回复内容：链接池功能：1、存储链接；2、对链接去重；3、对链接设置优先级。实践方案有：1、保存在数据库；2、redis；3、内存集合；4、队列。链接池和抓取调度的代码怎么写？先将所有待爬取的链接取到，然后将每个链接的抓取动作放到队列中，推荐使用beanstal...

php爬虫抓取的链接怎么存储成队列？

扩展链接函数写完后，把链接存储成队列的函数怎么写呢？//扩展链接函数public function extractLink($page){$matches=array();$pat="#href=\"(http://xxxx/yyy/zzz.php\?id=\d+$)\"# i";preg_match_all($pat,$page,$matches,PREG_PATTERN_ORDER);for($i=0;$i 有个视频上说链接库的功能包括：1、存储链接；2、对链接去重；3、对链接设置优先级。实践方案有：1、保存在数据库；2、redis；3、内存集合；4、队列。但是说到这里视频有...

Python实现抓取页面上链接的简单爬虫分享【图】

除了C/C++以外，我也接触过不少流行的语言，PHP、java、javascript、python，其中python可以说是操作起来最方便，缺点最少的语言了。前几天想写爬虫，后来跟朋友商量了一下，决定过几天再一起写。爬虫里重要的一部分是抓取页面中的链接，我在这里简单的实现一下。首先我们需要用到一个开源的模块，requests。这不是python自带的模块，需要从网上下载、解压与安装：代码如下: $ curl -OL https://github.com/kennethreitz/request...

python爬虫——三步爬得电影天堂电影下载链接，30多行代码即可搞定：【代码】【图】

python爬虫——三步爬得电影天堂电影下载链接，30多行代码即可搞定：本次我们选择的爬虫对象是：https://www.dy2018.com/index.html 具体的三个步骤：1.定位到2020必看片 2.从2020必看片中提取到子页面地址 3.进去子页面，拿到迅雷下载链接话不多说，上代码： import requests import re#s1，定位到2020必看片 domain="https://www.dy2018.com" resp=requests.get(domain,verify=False)#去掉安全验证 resp.enco...

python爬虫之解析链接【代码】

解析链接 1. urlparse() & urlunparse() urlparse() 是对url链接识别和分段的，API用法如下： urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)他的三个参数： urlstring: 这是一个必须项，即待解析的url。scheme: 它是默认协议。假如这个链接没有带协议信息，会将这个作为默认协议。 from urllib.parse import urlparseresult = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https') prin...

python爬虫把url链接编码成gbk2312格式过程解析【图】

1. 问题　　抓取某个网站，发现请求参数是乱码格式，??这是点击 TextView，发现请求参数如下图所示??3. 那么=%B9%FA%CE%F1%D4%BA%B7%A2%D5%B9%D1%D0%BE%BF%D6%D0%D0%C4是什么东西啊??解码后是 =国务院发展研究中心代码实现：　　content = "我爱中国" import urllib res = urllib.quote(content.encode('gb2312')) print res print "11111111", type(res)以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

Python爬虫如何获取页面内所有URL链接？本文详解【图】

如何获取一个页面内所有URL链接？在Python中可以使用urllib对网页进行爬取，然后利用Beautiful Soup对爬取的页面进行解析，提取出所有的URL。什么是Beautiful Soup？ Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。 Beautiful Soup自动将输入文档转换为Unicode编码，输出...

Python爬虫如何利用浏览器如何JSON数据，如获取淘宝天猫的评论链接？【图】

浏览器：Chrome工具：右键检查(N)步骤：1.打开淘宝/天猫2.右键检查3.随便点击一个商品进入购买界面4.点击监控工具 Network -- Json5.点击商品评论6.下拉到评论翻页处7.点击监控工具Clear功能，清空列表8.点击任意页翻页，监控工具中就出现了该页的评论Json脚本9.点击该Json脚本10.点击 Headers - General ,复制评论链接Request URL11.查看评论Json内容，可看Preview，逐个点开下三角查看12.重复以上步骤，可获取其他页评论解答...

【Python3网络爬虫开发实战】 3.1.3-解析链接

【摘要】前面说过，urllib库里还提供了parse这个模块，它定义了处理URL的标准接口，例如实现URL各部分的抽取、合并以及链接转换。它支持如下协议的URL处理：file、ftp、gopher、hdl、http、https、imap、mailto、 mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、 sip、sips、snews、svn、svn+ssh、telnet和wais。本节中，我们介绍一下该模块中常用的方法来看一下它的便捷之处。 1. urlparse() 该方法可以实现URL的识别和分...

1.4.4python链接爬虫（每天一更）

# -*- coding: utf-8 -*- Created on 2019年5月7日@author: 薛卫卫 import re import urllib.requestdef download(url, user_agent="wswp",num_retries=2):print("Downloading: " , url)headers = { User-agent: user_agent}request = urllib.request.Request(url, headers=headers)try:html = urllib.request.urlopen(request).read()except urllib.request.URLError as e:print(Download error: , e.reason)html = Noneif num_...

爬虫用java实现一个简易爬取网页超链接的程序【代码】【图】

` 爬取结果截取部分 <a href="http://news.163.com/special/2019qglh/" class="zt_link" target="_blank" title="2019全国两会">2019全国两会_网易新闻_网易网</a> <a class="ntes-nav-index-title ntes-nav-entry-wide c-fl" href="http://www.163.com/" title="网易首页">网易首页</a> <a href="http://www.163.com/#f=topnav" class="ntes-nav-select-title ntes-nav-entry-bgblack JS_NTES_LOG_FE" data-module-name="n_topnav...

python爬虫入门---获取某一网站所有超链接【代码】

需要先安装requests库和bs4库import requests from bs4 import BeautifulSoupdef getHTMLText(url):try:#获取服务器的响应内容，并设置最大请求时间为6秒res = requests.get(url, timeout = 6)#判断返回状态码是否为200res.raise_for_status()#设置真正的编码res.encoding = res.apparent_encoding#返回网页HTML代码return res.textexcept:return 产生异常#目标网页 url = https://www.cnblogs.com/huwt/demo = getHTMLText(url)#解...

python3爬虫链接+表格+图片

# -*- coding: utf-8 -*- import urllib.request import http.cookiejar from bs4 import BeautifulSoup import requests import csv import time import re import urllib from urllib.parse import quote import stringdef get_url_2():with open('F:/python/二级目录网址.csv')as f:f_csv = csv.reader(f)link_list =[]for link1 in f_csv:link_list.append(link1)return link_listdef get_url_weizhuang(head={'Connection': '...

爬虫 - 最热教程

论Python爬虫与MySQL数据库交互的坑 Python爬虫实战教程：爬取网易新闻；爬...【Python爬虫实践】异步XHR爬取在线编...node爬虫进阶之——登录 Python3爬虫实例之网易云音乐爬虫 Python爬虫入门【10】：电子书多线程爬...c# – 尝试使用ZeroMQ构建分布式爬虫 Python爬虫抓取技术的门道，大师级总结 java网页爬虫正则表达式 [Python] [爬虫] 1.批量政府网站的招投...

首页 / 爬虫 / python 站点资源链接简易爬虫

python 站点资源链接简易爬虫

内容导读

内容图文

内容总结

内容备注

内容手机端

【python 站点资源链接简易爬虫】教程文章相关的互联网学习教程文章

爬虫 - 最新教程

爬虫 - 最热教程