首页 / 爬虫 / 爬虫之pyquery库
爬虫之pyquery库
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了爬虫之pyquery库,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含15050字,纯文字阅读大概需要22分钟。
内容图文
![爬虫之pyquery库](/upload/InfoBanner/zyjiaocheng/1312/6b460c717ce94d0b8808ec356a380e4e.jpg)
官方文档:https://pyquery.readthedocs.io/en/latest/
PyQuery是一个强大又灵活的网页解析库。如果你觉得正则写起来太麻烦、BeautifulSoup语法太难记,而你熟悉jQury的语法,那么PyQuery就是你的绝佳选择。
一、开始
字符串初始化:
from pyquery import PyQuery as pq d = pq("<html>哈哈哈</html>") # 现在d就相当于jQuery的$print(d("html"))
URL初始化:
from pyquery import PyQuery as pq d = pq(url="https://www.baidu.com") print(d("head"))
文件初始化:
from pyquery import PyQuery as pq d = pq(filename=‘demo.html‘) # filename指定文件路径print(d("head"))
二、基本CSS选择器
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """from pyquery import PyQuery as pq d = pq(html) print(d("#container .list li"))
三、查找元素
子元素
d("css选择器").find("li")
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """from pyquery import PyQuery as pq d = pq(html) items = d(".list") print(type(items)) # <class ‘pyquery.pyquery.PyQuery‘> li = items.find("li") print(type(li)) # <class ‘pyquery.pyquery.PyQuery‘>print(li) """ <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> """
父元素
d("css选择器").parent(<css选择器(可无)>)
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) items = d(".list") parents = items.parents() print(parents) """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> """
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) items = d(".list") parents = items.parents(".wrap") print(parents) """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """
兄弟元素
d("css选择器").siblings(<css选择器(可无)>)
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) li = d(".list .item-0.active") print(li.siblings()) """ <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0">first item</li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> """
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) li = d(".list .item-0.active") print(li.siblings(".active")) """ <li class="item-1 active"><a href="link4.html">fourth item</a></li> """
四、遍历
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) li = d("li").items() print(type(li)) # <class ‘generator‘>for i in li: print(i) """ <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> """
五、获取信息
获取属性
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) a = d(".item-0.active a") print(a.attr("href")) print(a.attr.href)
获取文本
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) a = d(".item-0.active a") print(a.text()) """ third item """
获取html
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) li = d(".item-0.active") print(li) print(li.html()) """ <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <a href="link3.html"><span class="bold">third item</span></a> """
六、DOM操作
addClass()、removeClass()
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) li = d(".item-0.active") print(li) li.removeClass("active") print(li) li.addClass("active") print(li) """ <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> """
attr()、css()
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) li = d(".item-0.active") print(li) li.attr("name", "link") print(li) li.css("font-size", "14px") print(li) """ <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li> """
remove()
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> Hello, World. <p>This is a paragraph.</p> </div> """from pyquery import PyQuery as pq d = pq(html) wrap = d(".wrap") print(wrap.text()) """ Hello, World. This is a paragraph. """ wrap.find("p").remove() print(wrap.text()) # Hello, World.
其他DOM方法
https://pyquery.readthedocs.io/en/latest/api.html
七、伪类选择器
![技术分享图片](/img/jia.gif)
![技术分享图片](/img/jian.gif)
html = """ <div class="wrap"> <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </div> """from pyquery import PyQuery as pq d = pq(html) li = d("li:first-child") print(li) # <li class="item-0">first item</li> li = d("li:last-child") print(li) # <li class="item-0"><a href="link5.html">fifth item</a></li> li = d("li:nth-child(2)") print(li) # <li class="item-1"><a href="link2.html">second item</a></li> li = d("li:gt(2)") # 从0开始计数,索引大于2print(li) """ <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> """ li = d("li:nth-child(2n)") # 获取偶数顺序的元素(从1开始)print(li) """ <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> """ li = d("li:contains(second)") # 根据文本匹配,匹配文本包含second的标签print(li) # <li class="item-1"><a href="link2.html">second item</a></li>
更多选择器:http://www.w3school.com.cn/cssref/css_selectors.asp
原文:https://www.cnblogs.com/believepd/p/10657877.html
内容总结
以上是互联网集市为您收集整理的爬虫之pyquery库全部内容,希望文章能够帮你解决爬虫之pyquery库所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。