Python for Data Science - Web scraping
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了Python for Data Science - Web scraping,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含10081字,纯文字阅读大概需要15分钟。
内容图文
![Python for Data Science - Web scraping](/upload/InfoBanner/zyjiaocheng/611/c4c8da4b146845fd8428d9b1e89b9aa9.jpg)
Chapter 6 - Data Sourcing via Web
Segment 4 - Web scraping
from bs4 import BeautifulSoup
import urllib.request
from IPython.display import HTML
import re
r = urllib.request.urlopen('https://analytics.usa.gov/').read()
soup = BeautifulSoup(r, "lxml")
type(soup)
bs4.BeautifulSoup
print(soup.prettify()[:100])
<!DOCTYPE html>
<html lang="en">
<!-- Initalize title and data source variables -->
<head>
<!--
for link in soup.find_all('a'):
print(link.get('href'))
/
#explanation
https://analytics.usa.gov/data/
https://open.gsa.gov/api/dap/
data/
#top-pages-realtime
#top-pages-7-days
#top-pages-30-days
https://analytics.usa.gov/data/live/all-pages-realtime.csv
https://analytics.usa.gov/data/live/all-domains-30-days.csv
https://www.digitalgov.gov/services/dap/
https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
https://support.google.com/analytics/answer/2763052?hl=en
https://analytics.usa.gov/data/live/second-level-domains.csv
https://analytics.usa.gov/data/live/sites.csv
mailto:DAP@support.digitalgov.gov
https://analytics.usa.gov/data/
https://open.gsa.gov/api/dap/
mailto:DAP@support.digitalgov.gov
https://github.com/GSA/analytics.usa.gov/issues
https://github.com/GSA/analytics.usa.gov
https://github.com/18F/analytics-reporter
http://www.gsa.gov/
https://www.digitalgov.gov/services/dap/
https://cloud.gov/
print(soup.get_text())
?
?
?
?
?
?
?
?
?
?
?
?
?
?
analytics.usa.gov | The US government's web traffic.
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
analytics.usa.gov
?
?
About this site
Data | API
?
?
Select an agency
All Participating Websites
Agency for International Development
Department of Agriculture
Department of Commerce
Department of Defense
Department of Education
Department of Energy
Department of Health and Human Services
Department of Homeland Security
Department of Housing and Urban Development
Department of Justice
Department of Labor
Department of State
Department of Transportation
Department of Veterans Affairs
Department of the Interior
Department of the Treasury
Environmental Protection Agency
Executive Office of the President
General Services Administration
National Aeronautics and Space Administration
National Archives and Records Administration
National Science Foundation
Nuclear Regulatory Commission
Office of Personnel Management
Postal Service
Small Business Administration
Social Security Administration
?
?
?
?
?
?
?
?
...
people on government websites now
Visits Today
Eastern Time
?
?
?
?
?
Visits in the Past 90 Days
There were ... visits over the past 90 days.
Devices
?
?
?
?
Based on rough network segmentation data, we estimate that less than 5% of all traffic across all agencies comes from US federal government networks.
Much more detailed data is available in downloadable CSV and JSON. This includes data on combined browser and OS usage.
?
?
Browsers
?
?
?
?
Internet Explorer
?
?
?
?
Operating Systems
?
?
?
?
Windows
?
?
?
?
?
?
Visitor Locations Right Now
Cities
?
?
?
?
?
Countries
?
?
?
?
United States & Territories
?
?
?
International
?
?
?
?
?
?
?
Top Pages
Now
7 Days
30 Days
?
?
People on a single, specific page now. We only count pages with at least 10 people on the page.
Download the full dataset.
?
?
?
?
Visits over the last week to domains, including traffic to all pages within that domain.
?
?
?
?
Visits over the last month to domains, including traffic to all pages within that domain. We only count pages with at least 1,000 visits in the last month.
Download the full dataset.
?
?
?
?
?
Top Downloads
Total file downloads yesterday on government domains.
?
?
?
?
?
?
?
About this Site
These data provide a window into how people are interacting with the government online.
The data come from a unified Google Analytics account for U.S. federal government agencies known as the Digital Analytics Program.
This program helps government agencies understand how people find, access, and use government services online. The program does not track individuals,
and anonymizes the IP addresses of visitors.
Not every government website is represented in these data.
Currently, the Digital Analytics Program collects web traffic from around 400 executive branch government domains,
across about 5,700 total websites,
including every cabinet department.
We continue to pursue and add more sites frequently; to add your site, email the Digital Analytics Program.
?
?
Download the data
You can download the data here. Available in JSON and CSV format.
Additionally, you can access data via our API project (currently in Beta).
A note on sampling
Due to varying Google Analytics API sampling thresholds and the sheer volume of data in this project, some non-realtime reports may be subject to sampling.
The data are intended to represent trends and numbers may not be precise.
?
?
?
?
?
Have a question or problem?
Get in touch.
?
?
Suggest a feature or report an issue
?
?
?
?
View our code on GitHub
View our code for the data on GitHub
?
?
?
?
?
?
?
?
?
Analytics.usa.gov is a project of GSA’s Digital Analytics Program.
This website is hosted on cloud.gov.
?
?
?
?
?
?
?
?
?
?
?
print(soup.prettify()[0:1000])
<!DOCTYPE html>
<html lang="en">
<!-- Initalize title and data source variables -->
<head>
<!--
Hi! Welcome to our source code.
This dashboard uses data from the Digital Analytics Program, a US
government team inside the General Services Administration.
For a detailed tech breakdown of how 18F and friends built this site:
https://18f.gsa.gov/2015/03/19/how-we-built-analytics-usa-gov/
This is a fully open source project, and your contributions are welcome.
Frontend static site: https://github.com/18F/analytics.usa.gov
Backend data reporting: https://github.com/18F/analytics-reporter
-->
<meta charset="utf-8"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="NjbZn6hQe7OwV-nTsa6nLmtrOUcSGPRyFjxm5zkmCcg" name="google-site-verification"/>
<link href="/css/vendor/css/uswds.v0.9.6.css" rel="stylesheet"/>
<link href="/css/public_analytics.css" rel="stylesheet"/>
<link href="/images/analytics-favicon.ico" rel="ic
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
print(link)
type(link)
<a href="https://analytics.usa.gov/data/">Data</a>
<a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank">API</a>
<a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
<a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
<a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
<a href="https://analytics.usa.gov/data/">download the data here.</a>
<a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank"> API project</a>
<a class="usa-button usa-button-secondary-inverse" href="https://github.com/GSA/analytics.usa.gov/issues">
<img alt="Github Icon" class="github-icon" src="/images/github-logo-white.svg"/>
Suggest a feature or report an issue
</a>
<a href="https://github.com/GSA/analytics.usa.gov">
<img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
View our code on GitHub</a>
<a href="https://github.com/18F/analytics-reporter">
<img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
View our code for the data on GitHub</a>
<a href="http://www.gsa.gov/">
<img alt="GSA" src="/images/gsa-logo.svg"/>
</a>
<a href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a href="https://cloud.gov/">cloud.gov</a>
bs4.element.Tag
file = open("parsed_data.txt", "w")
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
soup_link = str(link)
print(soup_link)
file.write(soup_link)
file.flush()
file.close()
<a href="https://analytics.usa.gov/data/">Data</a>
<a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank">API</a>
<a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
<a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
<a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
<a href="https://analytics.usa.gov/data/">download the data here.</a>
<a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank"> API project</a>
<a class="usa-button usa-button-secondary-inverse" href="https://github.com/GSA/analytics.usa.gov/issues">
<img alt="Github Icon" class="github-icon" src="/images/github-logo-white.svg"/>
Suggest a feature or report an issue
</a>
<a href="https://github.com/GSA/analytics.usa.gov">
<img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
View our code on GitHub</a>
<a href="https://github.com/18F/analytics-reporter">
<img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
View our code for the data on GitHub</a>
<a href="http://www.gsa.gov/">
<img alt="GSA" src="/images/gsa-logo.svg"/>
</a>
<a href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a href="https://cloud.gov/">cloud.gov</a>
%pwd
'/home/ericwei/Ex_Files_Python_Data_Science_EssT_Pt_1/Exercise Files/06_04_begin'
内容总结
以上是互联网集市为您收集整理的Python for Data Science - Web scraping全部内容,希望文章能够帮你解决Python for Data Science - Web scraping所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。