爬虫
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了爬虫,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含3383字,纯文字阅读大概需要5分钟。
内容图文
Write a program to scrape https://github.com, using requests and Beautiful Soup. The goal is to get, for a given GitHub username, i.e., https://github.com/google, a list of repositories with their GitHub-assigned programming language, the number of forks as well as the number of stars a repository has.
Note that the repositories may spread out across several pages, we only focus on the second page, and return the result in a DataFrame. The output format is shown as follows (Note: The repository list may change dynamically over time, following result is only for reference),return it as result2
:
Hint:
- The url of Github can take two query strings, username and page number. For example, if we want to find all the repository of google on the third page, we can use the following url: https://github.com/google?page=3
- Use
get_text(strip=True)
to remove the white spaces, new lines of the text. - You may encounter the exception:
ConnectionError: HTTPConnectionPool(host=‘xxx.xx.xxx.xxx’, port=xxxx): Max retries exceeded with url:xx
if usingrequests
to establish multiple connections without closing. You can use the following example code to avoid such exception.
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
url = 'xx'
r = s.get(url,params=xx)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
def getRepo(username):
# YOUR CODE HERE
# we only focus on the **second** page
page = 2
params = "page=" + str(page)
url = "https://github.com/" + username
# 使用requests设置请求属性,发起get请求,获取body
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
r = s.get(url, params=params)
text = r.text
# 解析为html
beautiful_soup_repositories = BeautifulSoup(text, "html.parser")
# 获取类名为"public source d-block py-4 border-bottom"的li标签
all_li = beautiful_soup_repositories.find_all(name='li',
attrs={"class": "public source d-block py-4 border-bottom"})
names = []
programming_languages = []
stars_numbers = []
forks_numbers = []
# 遍历所有li标签
for li in all_li:
a = li.find(name="a", attrs={"itemprop": "name codeRepository"})
# 获取标签内容(排除两边的空格)
name = a.get_text(strip=True)
print("name -> " + name)
names.append(name)
span = li.find(name="span", attrs={"itemprop": "programmingLanguage"})
programming_language = span.get_text(strip=True)
print("programming_language -> " + programming_language)
programming_languages.append(programming_language)
r = s.get(url + "/" + name)
text = r.text
beautiful_soup_repository = BeautifulSoup(text, "html.parser")
href = "/" + username + "/" + name + "/" + "stargazers"
a = beautiful_soup_repository.find(name="a", attrs={"href": href})
stars_number = a.get_text(strip=True)
print("stars_number -> " + stars_number)
stars_numbers.append(stars_number)
href = "/" + username + "/" + name + "/" + "network" + "/" + "members"
a = beautiful_soup_repository.find(name="a", attrs={"href": href})
forks_number = a.get_text(strip=True)
print("forks_number -> " + forks_number)
forks_numbers.append(forks_number)
print()
# dataframe设置显示所有的行和列,并且不限制列的宽度
data = {"Repository": names, "Language": programming_languages, "Forks": forks_numbers, "Stars": stars_numbers}
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', None)
result2 = pd.DataFrame(data=data)
return result2
result2 = getRepo("google")
print(result2)
内容总结
以上是互联网集市为您收集整理的爬虫全部内容,希望文章能够帮你解决爬虫所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。