Python爬蟲：lxml模塊分析並獲取網頁內容

本文轉載自查看原文 2018-12-28 09:05 616

運用css選擇器：

# -*- coding: utf-8 -*-
from lxml import html
page_html = ''' <html><body> <input id="input_id" value="input value" name="input_a"> </body></html> '''
page_tree = html.fromstring(page_html.decode('utf-8'))
ele = page_tree.cssselect('#input_id')  # 用css選擇器的id選擇器獲取網頁內容
print html.tostring(ele[0]) # <input id="input_id" value="input value" name="input_a">
print ele         # [<InputElement 30133f0 name='input_a' type='text'>]
print ele[0]      # <InputElement 30133f0 name='input_a' type='text'>
print ele[0].get('value')   # input value

獲取標簽里的內容：

# -*- coding: utf-8 -*-
from lxml import html
page_html = ''' <html><body> <div class="cl">DIV1</div> <div class="cl">DIV2</div> </body></html> '''
page_tree = html.fromstring(page_html.decode('utf-8'))
ele = page_tree.cssselect('body')[0].findall("div") # findall尋找所有的直接子標簽
print ele[0].text_content().strip() # DIV1

若提示如下錯誤：
from lxml import html
ImportError: DLL load failed: %1 is not a valid Win32 application.
嘗試重新安裝lxml模塊：

python -m pip uninstall lxml
python -m pip install lxml==3.6.0

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 java獲取網頁內容解決爬蟲網頁內容亂碼問題 java爬蟲爬取網頁內容前，對網頁內容的編碼格式進行判斷的方式 Python3網絡爬蟲：requests爬取動態網頁內容 JAVA使用Gecco爬蟲抓取網頁內容(附Demo) C++抓網頁/獲取網頁內容獲取網頁內容生成html，並將某些標簽屬性進行修改 (基於python3.6) C#獲取網頁內容的三種方式 golang使用chrome headless獲取網頁內容獲取網頁內容區域各種高/寬匯總