lmth1 一個用Python編寫的便捷網頁信息提取工具

本文轉載自查看原文 2012-02-15 22:03 8946

lmth1 一個便捷的網頁信息提取工具

0, Why lmth1?

玩Python的人十有八九用過urllib，扒數據的十有八九用過BeautifulSoup。我也不例外，平時抓數據幾乎全用BeautifulSoup。
BeautifulSoup的功能挺不錯，但就是API挫了點，用起來不順。相對於中規中矩的API，我更中意jQuery的Fluent API。所以，花了兩個晚上，以BeautifulSoup作為基礎，搞了兩個庫lmth和lmth1：lmth提供基本功能，並負責Hpath解析；lmth1提供Fluent API，進行數據抓取。

lmth1的接口非常簡單，它的實現更簡單——不超過300行代碼。但它的功能很強大，你很快就會看到，lmth1是如何用一行代碼實現BeautifulSoup十行代碼的功能的，而且，更易讀。

1, 簡介

如題。

使用前請將lmth.py, lmth1.py以及beautifulsoup.py放至Python的環境目錄下。

2, Hpath

Hpath是一種我定義的一種類似於Xpath的HTML路徑查詢表達式，它的語法非常簡單——幾個例子就能說明白。如果需要嚴格的定義，請參考2.2的BNF定義。

2.1 實例闡述

注意，這里的例子所提到的獲取元素，均為在目標節點下所獲得的元素。

采用的實例HTML:

1 <! DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
2 < html xmlns ="http://www.w3.org/1999/xhtml" >
3 < head >
4      < title >Untitled Page </ title >
5 </ head >
6 < body >
7 < h1 id ="title" >Page list </ h1 >
8 < div id ="content" class ="sites" >
9      < a href ="http://www.google.com/" class ="good" >Google </ a >
10      < a href ="http://www.yahoo.com/" class ="good" >Yahoo </ a >
11      < a href ="http://www.baidu.com/" class ="asshole" >Baidu </ a >
12      < a href ="http://www.bing.com/" class ="excellent" >Bing </ a >
13 </ div >
14 < div id ="tbl" >
15      < ul >
16      < li class ="odd" >1 </ li >
17      < li class ="even" >2 </ li >
18      < li class ="odd" >3 </ li >
19      < li class ="even" >4 </ li >
20      < li class ="odd" >5 </ li >
21      < li class ="even" >6 </ li >
22      </ ul >
23 </ div >
24 </ body >
25 </ html >

2.1.1 基本表達式

作用：獲取所有li元素
結果：

[
     <li class= " odd ">1</li>,
     <li class= " even ">2</li>,
     <li class= " odd ">3</li>,
     <li class= " even ">4</li>,
     <li class= " odd ">5</li>,
     <li class= " even ">6</li>
]

div[id=tbl]

作用：獲取所有id屬性為tbl的div元素
提示：通過屬性過濾來進行更精准的查找
結果：

div[id=content, class=sites]

作用：獲取所有id屬性為name且class屬性為grey的div元素
提示：你可以同時設定多個屬性值，屬性對之間用逗號分隔
結果：

<div id= " content " class= " sites ">
<a href= " http://www.google.com/ " class= " good ">Google</a>
<a href= " http://www.yahoo.com/ " class= " good ">Yahoo</a>
<a href= " http://www.baidu.com/ " class= " asshole ">Baidu</a>
<a href= " http://www.bing.com/ " class= " excellent ">Bing</a>
</div>

div[@id]

作用：獲取所有div元素的id屬性值
提示：你需要在需獲取的屬性值前加一個@符
結果：

[
' content ',
' tbl '
]

div[id=content]/a[@href]

作用：獲取所有id屬性為name的元素下面的p元素的href屬性值
結果：

[
      ' http://www.google.com ',
      ' http://www.yahoo.com ',
      ' http://www.baidu.com ',
      ' http://www.bing.com '
]

2.1.2 高級表達式

a[ class={excellent|good}, @ class, @href]

作用：獲取所有class屬性為excellent或good的元素下面的a元素的class屬性和href屬性
提示：大括號里面的是正則表達式，利用它可以實現或操作
結果：

[
     {
          ' href ': ' http://www.google.com ',
          ' class ': ' good '
     },
     {
          ' href ': ' http://www.yahoo.com ',
          ' class ': ' good '
     },
     {
          ' href ': ' http://www.bing.com ',
          ' class ': ' excellent '
     }
]

div[id={con.+}]/a[ class={ass.+}, @ class, @ # ]

作用：獲取所有id屬性以con做前綴的div元素下面的class屬性以ass為前綴的元素的id屬性以及內容
提示：@#代表要獲取元素的內容（innertext）
結果：

{
' # ': ' Baidu ',
' class ': ' asshole '
}

ul/li[ class={e.+}, @ # ]

作用：獲取所有id屬性以post做前綴的元素下面的以數字為id的p元素下面的a元素的href屬性及內容
提示：也可以利用正則表達式進行模糊查詢
結果：

[
      ' 2 ',
      ' 4 ',
      ' 6 '
]

2.2 Hpath的BNF定義

沒玩過編譯的可以忽略這一節。
玩過編譯的看了就明白。

hpath ::= hpart { " / " hpart}
hpart ::= ele_name [ " [ " attrs " ] " ]
attrs ::= pred_attrs [ " , " get_attrs ]
pred_attrs ::= pred_attr { " , " pred_attr }
get_attrs ::= get_attr { " , " get_attr }
get_attr ::= " @ "string [ " ( " attr_alias " ) "]
attr_alias ::= string
pred_attr ::= string " = " value
value ::= string | regex_value
regex_value ::= " { " string " } "

3, 選擇元素

lmth1提供了非常簡便的API來進行HTML元素的獲取。
為了方便，請輸入以下代碼：

from lmth1 import Url

這樣可以省去lmth1這個看起來有些詭異的前綴 :)

這里以http://files.cnblogs.com/figure9/test.xml這個鏈接上的文件為例（該文件內容和之前的實例HTML是一樣的）：

3.1 選擇單個元素

Url( ' http://files.cnblogs.com/figure9/test.xml ').elem( ' div ')

作用：從http://files.cnblogs.com/figure9/test.xml鏈接上獲取第一個div元素。
結果：

3.2 選擇多個元素

Url( ' http://files.cnblogs.com/figure9/test.xml ').elems( ' li ')

作用：從http://files.cnblogs.com/figure9/test.xml鏈接上獲取所有li元素。
結果：

[
     <li class= " odd ">1</li>,
     <li class= " even ">2</li>,
     <li class= " odd ">3</li>,
     <li class= " even ">4</li>,
     <li class= " odd ">5</li>,
     <li class= " even ">6</li>
]

3.3 鏈式選擇

Url( ' http://files.cnblogs.com/figure9/test.xml ').elem( ' div ').elem( ' a ')

作用：從http://files.cnblogs.com/figure9/test.xml鏈接上獲取第一個div元素下面的a元素。
提示：這里只是為了演示鏈式選擇，更好的選擇是使用Url('http://files.cnblogs.com/figure9/test.xml').elem('div/a')，效果等同。
結果：

<a href= " http://www.google.com/ " class= " good ">Google</a>

Url( ' http://files.cnblogs.com/figure9/test.xml ').elems( ' div ')[-1].elems( ' li[class=odd] ')[-1]

作用：從http://files.cnblogs.com/figure9/test.xml鏈接上獲取最后一個div元素的最后一個class屬性為odd的li元素。
提示：結合elems和Python的列表操作，可以獲得強大的表達能力。
結果：

Url( ' http://files.cnblogs.com/figure9/test.xml ').elems( ' div ')[-1].elems( ' li ')[::2]

作用：從http://files.cnblogs.com/figure9/test.xml鏈接上獲取最后一個div元素的序數為奇數的li元素。
提示：Don't forget the slices!
結果：

[
     <li class= " odd ">1</li>,
     <li class= " odd ">3</li>,
     <li class= " odd ">5</li>
]

4, 獲取屬性

有時我們需要對本地的HTML文件進行操作，所以我在lmth引入了Path這個類，用來處理本地的文件。

請把http://files.cnblogs.com/figure9/test.xml的文件保存到本地，這里假定它被保存在d:\test.xml路徑。

同樣，為了方便，請輸入以下代碼：

from lmth1 import Path

這樣可以省去lmth1這個看起來有些詭異的前綴 :)

4.1 獲取單個屬性

attr = Path(r ' d:\test.xml ').attr( ' li[class=even, @#(content)] ')

作用：從d:\test.xml文件獲取第一個class屬性為even的li元素的內容，然后為這個屬性取名為content。
提示：@開頭表示要取的屬性，()里代表屬性的別名，#表示元素的內容，可以直接用名字獲取屬性值。
結果：

attr =>
{
content: ' 2 '
}

attr.content =>
' 2 '

4.2 獲取多個屬性

attrs = Path(r ' d:\test.xml ').attrs( ' a[@href(link), @class(category), @#(title)] ')

作用：從d:\test.xml文件獲取所有a元素的href（設置別名為link）、class屬性（設置別名為category）和內容（設置別名為title）。
提示：@開頭表示要取的屬性，()里代表屬性的別名，#表示元素的內容，可以直接用名字獲取屬性值。
結果：

attrs =>
[
     {
         category:u ' good ',
         link:u ' http://www.google.com ',
         title:u ' Google '
     },
    {
         category:u ' good ',
         link:u ' http://www.yahoo.com ',
         title:u ' Yahoo '
     },
    {
         category:u ' asshole ',
         link:u ' http://www.baidu.com ',
         title:u ' Baidu '
     },
     {
         category:u ' excellent ',
         link:u ' http://www.bing.com ',
         title:u ' Bing '
     }
]

attrs[-1] =>

{
     category:u ' excellent ',
     link:u ' http://www.bing.com ',
     title:u ' Bing '
}

print attrs[2].title, ' is an ', attrs[2].category =>

Baidu is an asshole

4.3 鏈式選擇

可以在elem和elems選擇器之后應用attr和attrs選擇器。注意，你不能在attr和attrs選擇器之后應用其它的選擇器。

Path(r ' d:\test.xml ').elems( ' div[id=content]/a ')[::2].attrs( ' [@href(link), @class(category)] ')

作用：從d:\test.xml文件獲取id屬性為content的div元素下面的所有序數為奇數的a元素的href（別名為link）和class屬性（別名為category）。
提示：當Hpath沒有元素名，僅由要獲取的屬性名組成時，其獲取的屬性為當前元素的屬性。

結果：

[
     {
         category:u ' good ',
         link:u ' http://www.google.com '
     },
     {
         category:u ' asshole ',
         link:u ' http://www.baidu.com '
     }
]

5, 其它功能

5.1 URL生成

在日常的html獲取中，經常需要生成大量的URL，盡管這樣的工作在Python中一兩行就可以搞定，但為了避免不必要的重復，我在lmth1中提供了Urls類，並提供了兩個基礎方法，用來生成Urls對象。
Urls對象保存了若干個Url實例，其中每一個實例都可以直接進行選擇操作。

請先執行下面的代碼：

from lmth1 import Urls

通過數字批量生成Url：

lmth.Urls.from_indice( ' http://www.bing.com/page/ ', 1, 5)

作用：以為前綴，生成后綴從1到5的Url
結果：

http://www.bing.com/1
http://www.bing.com/2
http://www.bing.com/3
http://www.bing.com/4
http://www.bing.com/5

lmth.Urls.from_indice( ' http://www.bing.com/page/ ', 1, 4, 3)

作用：以http://www.bing.com/page/作為前綴，生成后綴從1到5的Url，其中默認寬度為3，用0填充
提示：第四個參數用來設置數字寬度，對於一些網站，這是很必要的
結果：

http://www.bing.com/001
http://www.bing.com/002
http://www.bing.com/003
http://www.bing.com/004
http://www.bing.com/005

通過后綴批量生成Url：

Urls.from_postfixes( ' http://www.baidu.com/ ', [ ' isfool ', ' isasshole ', ' ismoron '])

作用：以http://www.baidu.com/作為前綴，迭代后面的后綴列表，批量生成Url
結果：

http://www.baidu.com/isfool
http://www.baidu.com/isasshole
http://www.baidu.com/ismoron

5.2 字符編碼

lmth1的默認編碼是UTF-8，可以滿足絕大多數網站的需求。然而，在讀取某些中文的網站時，仍然會出現亂碼，因此，lmth1允許手動設置編碼。對於一些亂碼的中文網站，將編碼換為gb18030可以解決問題。

Url( ' http://www.bing.com/ ', ' gb18030 ')

作用：生成編碼為gb18030的Url對象
提示：默認的編碼為UTF-8

lmth.Urls.from_indice( ' http://www.bing.com/page/ ', 1, 5, code_str= ' gb18030 ')

作用：批量生成編碼為gb18030的Url對象
提示：默認的編碼為UTF-8

6, 參考

1, BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
2, Martin Fowler: Domain-Specific Languages.
3, Internal-DSL: http://en.wikipedia.org/wiki/Domain-specific_language
4, Fluent Interface: http://en.wikipedia.org/wiki/Fluent_interface

源代碼下載：

http://files.cnblogs.com/figure9/lmth1withBS.7z

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python網絡爬蟲與信息提取 Python網絡爬蟲與信息提取（一） python——beautifulsoup標簽搜索以及信息提取 Python網絡爬蟲與信息提取（三）—— Re模塊 Python網絡爬蟲與信息提取（一）（入門篇） Python網絡爬蟲與信息提取[request庫的應用](單元一) PDF的信息提取的問題 1.文本摘要和信息提取【學習筆記】PYTHON網絡爬蟲與信息提取(北理工嵩天) Python網絡爬蟲與信息提取-中國大學排名（2021年）