BeautifulSoup 剔除 HTML script 腳本,刪除指定 class標簽
剔除 script
方式一:
[s.extract() for s in soup("script")]
方式二:
def H5_filter(self):
'''
對爬取的 H5 進行過濾
:return:
'''
page = self.crawl_succ_page()
soup = BeautifulSoup(page, 'lxml')
# 獲取文本消息
title = soup.select('.rich_media_title')[0].get_text()
tags = soup.find_all()
for tag in tags:
if tag.name == 'script':
tag.decompose() # 剔除所有 script 腳本
filter_script_body = soup.find('body') # 只拿 body
article = soup.find('body').text
return filter_script_body, article, title
刪除指定 class
for span in soup.find_all('span', {'class': 'weapp_display_element js_weapp_display_element'}): # 剔除指定 class
span.decompose()
如果要刪除帶有特定id的div,例如decompose(),則可以使用
soup.find('div', id="main-content").decompose()