python – 用美丽的汤刮内部链接
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了python – 用美丽的汤刮内部链接,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含3812字,纯文字阅读大概需要6分钟。
内容图文
![python – 用美丽的汤刮内部链接](/upload/InfoBanner/zyjiaocheng/749/d38a61f627484223bf7da4b0e2041671.jpg)
我编写了一个python代码来获取与给定URL对应的Web页面,并将该页面上的所有链接解析为链接库.接下来,它从刚刚创建的存储库中获取任何url的内容,将此新内容中的链接解析到存储库中,并继续对存储库中的所有链接执行此过程,直到停止或获取给定数量的链接之后.
这里代码:
import BeautifulSoup
import urllib2
import itertools
import random
class Crawler(object):
"""docstring for Crawler"""
def __init__(self):
self.soup = None # Beautiful Soup object
self.current_page = "http://www.python.org/" # Current page's address
self.links = set() # Queue with every links fetched
self.visited_links = set()
self.counter = 0 # Simple counter for debug purpose
def open(self):
# Open url
print self.counter , ":", self.current_page
res = urllib2.urlopen(self.current_page)
html_code = res.read()
self.visited_links.add(self.current_page)
# Fetch every links
self.soup = BeautifulSoup.BeautifulSoup(html_code)
page_links = []
try :
page_links = itertools.ifilter( # Only deal with absolute links
lambda href: 'http://' in href,
( a.get('href') for a in self.soup.findAll('a') ) )
except Exception: # Magnificent exception handling
pass
# Update links
self.links = self.links.union( set(page_links) )
# Choose a random url from non-visited set
self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
self.counter+=1
def run(self):
# Crawl 3 webpages (or stop if all url has been fetched)
while len(self.visited_links) < 3 or (self.visited_links == self.links):
self.open()
for link in self.links:
print link
if __name__ == '__main__':
C = Crawler()
C.run()
此代码不提取内部链接(仅限绝对形成的超链接)
如何获取以“/”或“#”或“.”开头的内部链接
解决方法:
好吧,你的代码已经告诉你发生了什么.在你的lambda中,你只是抓住以http://开头的绝对链接(你没有抓住https FWIW).您应该获取所有链接并检查它们是否以http开头.如果他们不这样做,那么它们就是一个相对链接,因为您知道current_page是什么,那么您可以使用它来创建绝对链接.
这是对代码的修改.请原谅我的Python,因为它有点生疏,但我运行它,它适用于我的Python 2.7.你需要清理它并添加一些边缘/错误检测,但你得到了要点:
#!/usr/bin/python
from bs4 import BeautifulSoup
import urllib2
import itertools
import random
import urlparse
class Crawler(object):
"""docstring for Crawler"""
def __init__(self):
self.soup = None # Beautiful Soup object
self.current_page = "http://www.python.org/" # Current page's address
self.links = set() # Queue with every links fetched
self.visited_links = set()
self.counter = 0 # Simple counter for debug purpose
def open(self):
# Open url
print self.counter , ":", self.current_page
res = urllib2.urlopen(self.current_page)
html_code = res.read()
self.visited_links.add(self.current_page)
# Fetch every links
self.soup = BeautifulSoup(html_code)
page_links = []
try :
for link in [h.get('href') for h in self.soup.find_all('a')]:
print "Found link: '" + link + "'"
if link.startswith('http'):
page_links.append(link)
print "Adding link" + link + "\n"
elif link.startswith('/'):
parts = urlparse.urlparse(self.current_page)
page_links.append(parts.scheme + '://' + parts.netloc + link)
print "Adding link " + parts.scheme + '://' + parts.netloc + link + "\n"
else:
page_links.append(self.current_page+link)
print "Adding link " + self.current_page+link + "\n"
except Exception, ex: # Magnificent exception handling
print ex
# Update links
self.links = self.links.union( set(page_links) )
# Choose a random url from non-visited set
self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
self.counter+=1
def run(self):
# Crawl 3 webpages (or stop if all url has been fetched)
while len(self.visited_links) < 3 or (self.visited_links == self.links):
self.open()
for link in self.links:
print link
if __name__ == '__main__':
C = Crawler()
C.run()
内容总结
以上是互联网集市为您收集整理的python – 用美丽的汤刮内部链接全部内容,希望文章能够帮你解决python – 用美丽的汤刮内部链接所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。