[譯]使用BeautifulSoup和Python從網頁中提取文本


如果您要花時間瀏覽網頁,您可能遇到的一項任務就是從HTML中刪除可見的文本內容。
如果您使用的是Python,我們可以使用BeautifulSoup來完成此任務。

設置提取

首先,我們需要獲取一些HTML。我將使用Troy Hunt最近關於“Collection#1”Data Breach的博客文章。
以下是您下載HTML的方法:

import requests
url = 'https: //www.troyhunt.com/the-773-million-record-collection-1-data-reach/'res = 
requests.get(url)
html_page = res.content

現在,我們有了HTML ..但是那里會有很多混亂。我們如何提取我們想要的信息?

創建 beautiful soup

我們將使用Beautiful Soup來解析HTML,如下所示:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page, 'html.parser')

找到文字

BeautifulSoup提供了一種從HTML中查找文本內容(即非HTML)的簡單方法:

text = soup.find_all(text=True)

但是,這將為我們提供一些我們不想要的信息。
查看以下語句的輸出:

set([t.parent.name for t in text])
# {'label', 'h4', 'ol', '[document]', 'a', 'h1', 'noscript', 'span', 'header', 'ul', 'html', 'section', 'article', 'em', 'meta', 'title', 'body', 'aside', 'footer', 'div', 'form', 'nav', 'p', 'head', 'link', 'strong', 'h6', 'br', 'li', 'h3', 'h5', 'input', 'blockquote', 'main', 'script', 'figure'}

這里有一些我們可能不想要的項目:

[document]

  • noscript
  • header
  • html
  • meta
  • head
  • input
  • script

對於其他人,您應該檢查以查看您想要的。

提取有價值的文字

現在我們可以看到我們的寶貴元素,我們可以構建我們的輸出:

output = ''
blacklist = [
	'[document]',
	'noscript',
	'header',
	'html',
	'meta',
	'head', 
	'input',
	'script',
	# there may be more elements you don't want, such as "style", etc.
]

for t in text:
	if t.parent.name not in blacklist:
		output += '{} '.format(t)

完整的腳本

最后,這是從網頁獲取文本的完整Python腳本:

import requests
from bs4 import BeautifulSoup

url = 'https://www.troyhunt.com/the-773-million-record-collection-1-data-reach/'
res = requests.get(url)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
	'[document]',
	'noscript',
	'header',
	'html',
	'meta',
	'head', 
	'input',
	'script',
	# there may be more elements you don't want, such as "style", etc.
]

for t in text:
	if t.parent.name not in blacklist:
		output += '{} '.format(t)

print(output)

改進

如果你output現在看,你會發現我們有一些我們不想要的東西。

標題中有一些文字:

Home \n \n \n Workshops \n \n \n Speaking \n \n \n Media \n \n \n About \n \n \n Contact \n \n \n Sponsor \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n   \n \n \n \n Sponsored by:

還有一些來自頁腳的文字:

\n \n \n \n \n \n Weekly Update 122 \n \n \n \n \n Weekly Update 121 \n \n \n \n \n \n \n \n Subscribe  \n \n \n \n \n \n \n \n \n \n Subscribe Now! \n \n \n \n \r\n            Send new blog posts: \n   daily \n   weekly \n \n \n \n Hey, just quickly confirm you\'re not a robot: \n  Submitting... \n Got it! Check your email, click the confirmation link I just sent you and we\'re done. \n \n \n \n \n \n \n \n Copyright 2019, Troy Hunt \n This work is licensed under a  Creative Commons Attribution 4.0 International License . In other words, share generously but provide attribution. \n \n \n Disclaimer \n Opinions expressed here are my own and may not reflect those of people I work with, my mates, my wife, the kids etc. Unless I\'m quoting someone, they\'re just my own views. \n \n \n Published with Ghost \n This site runs entirely on  Ghost  and is made possible thanks to their kind support. Read more about  why I chose to use Ghost . \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n   \n \n \n \n \n '

如果您只是從單個站點提取文本,您可以查看HTML並找到一種方法來僅從頁面中解析出有價值的內容。
不幸的是,互聯網是一個混亂的地方,你很難在HTML語義上找到共識。
祝好運!

原文來源:https://matix.io/extract-text-from-webpage-using-beautifulsoup-and-python/


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM