python-BeautifulSoup如何在标记后提取文本
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了python-BeautifulSoup如何在标记后提取文本,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含1575字,纯文字阅读大概需要3分钟。
内容图文
![python-BeautifulSoup如何在标记后提取文本](/upload/InfoBanner/zyjiaocheng/690/3273bd56240c46ddbcfd53703cc10cc7.jpg)
我不知道如何使用BeautifulSoup到达下一段,以及如何提取所需的特定文本.我是Python和BS4的新手.
我的HTML如下:
<div class="inner-content">
<div class="bred"></div>
<div class="clrbth"></div>
<h1></h1>
<h4></h4>
...
...
...
<p></p>
<p></p>
<p>
<!--This text I don't want -->
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
<br></br>
<!-- The text I want to extract using BeautifulSoup-->
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).
</p>
<p></p>
<p></p>
...
...
...
<div class="bred"></div>
<div class="clrbth"></div>
<h1></h1>
</div>
请告诉我如何从HTML中提取上述文本.谢谢.
解决方法:
您可以使用find_all()方法和limit参数来获取html中的第三个p标签.接下来,使用.find返回第三段中的第一个br标签.从那里可以使用.next_siblings方法返回generator object和.join函数.
>>> third_p = soup.find_all('p', limit=3)[-1]
>>> ''.join(third_p.find('br').next_siblings)
内容总结
以上是互联网集市为您收集整理的python-BeautifulSoup如何在标记后提取文本全部内容,希望文章能够帮你解决python-BeautifulSoup如何在标记后提取文本所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。