黃聰：C#解析HTML DOM解析類 HtmlParser.Net 下載

本文轉載自查看原文 2016-01-19 17:48 2175 C#學習/ 黃聰

下載地址：HtmlParser.Net.rar

背景：

HTMLParser原本是一個在sourceforge上的一個Java開源項目，使用這個Java類庫可以用來線性地或嵌套地解析HTML文本。他的功能強大和開源等特性吸引了大量Web信息提取的工作者。然而，許多.net開發者朋友一直在尋找一種能在.net中使用的HTMLParser類庫，筆者將介紹Winista.HTMLParser類庫，對比於其他原本數量就非常少的.net版HTMLParser類庫，Winista的版本的類庫結構可以說更接近於原始Java版本。
該類庫目前分為Utltimate、Pro、Lite和Community四個版本，前三個版本都是收費的。只有Community版本可以免費下載並查看所有的源碼。

（一）Filter類
Filter一看就知道，肯定是對結果進行過濾，取得需要的內容。HTMLParser在org.htmlparser.filters包之內一共界說了16個差別的Filter，也可以分為幾類。
判定類Filter：
TagNameFilter
HasAttributeFilter
HasChildFilter
HasParentFilter
HasSiblingFilter
IsEqualFilter
邏輯運算Filter：
AndFilter
NotFilter
OrFilter
XorFilter
其他Filter：
NodeClassFilter
StringFilter
LinkStringFilter
LinkRegexFilter
RegexFilter
CssSelectorNodeFilter

所有的Filter類都實現了org.htmlparser.NodeFilter接口。這個接口只有一個主要函數：
boolean accept (Node node);
各個子類分別實現這個函數，用於判定輸入的Node是否相符這個Filter的過濾條件，假如相符，返回true，不然返回false。

（二）判定類Filter
2.1 TagNameFilter
TabNameFilter是最輕易理解的一個Filter，憑據Tag的名字進行過濾。

下面是用於測試的HTML文件：
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>新靈感網站自動更新系統-www.xinlg.com</title>< /head>
<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">

新靈感網站自動更新系統-www.xinlg.com
<a href="http://www.xinlg.com">新靈感網站自動更新系統-www.xinlg.com</a>
</div>
新靈感網站自動更新系統-www.xinlg.com
</div>
</body>
</html>
測試源代碼：（這里只列出了Main函數，全部源代碼請參考 HTMLParser使用入門（2）- Node內容，自己添加import局部）
public static void main(String[] args) {

try{
Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() );

// 這里是控制測試的局部，后面的例子修改的就是這個地方。
NodeFilter filter = new TagNameFilter ("DIV");
NodeList nodes = parser.extractAllNodesThatMatch(filter);

if(nodes!=null) {
for (int i = 0; i < nodes.size(); i++) {
Node textnode = (Node) nodes.elementAt(i);

message("getText:"+textnode.getText());
message("=================================================");
}
}
}
catch( Exception e ) {
e.printStackTrace();
}
}
輸出結果：
getText:div id="top_main"
=================================================
getText:div id="logoindex"
=================================================
可以看出文件中兩個Div節點都被取出了。下面可以針對這兩個DIV節點進行操縱.

2.2 HasChildFilter
下面讓我們看看HasChildFilter。方才看到這個Filter的時候，我想雖然地認為這個Filter返回的是有Child的Tag。直接初始化了一個
NodeFilter filter = new HasChildFilter();
結果挪用NodeList nodes = parser.extractAllNodesThatMatch(filter);的時候HasChildFilter內部直接產生 NullPointerException。讀了一下HasChildFilter的源代碼，才發覺，實際HasChildFilter是返回有相符條件的子節點的節點，需要另外一個Filter作為過濾子節點的參數。缺省的結構函數雖然可以初始化，但是由於子節點的Filter是null，所以使用的時候產生了Exception。從這點來看，HTMLParser的源代碼還有很多可以優化的的地方。呵呵。

修改源代碼：
NodeFilter innerFilter = new TagNameFilter ("DIV");
NodeFilter filter = new HasChildFilter(innerFilter);
NodeList nodes = parser.extractAllNodesThatMatch(filter);
輸出結果：
getText:body
=================================================
getText:div id="top_main"
=================================================
可以看到，輸出的是兩個有DIV子Tag的Tag節點。（body有子節點DIV "top_main"，"top_main"有子節點"logoindex"。

注重HasChildFilter還有一個結構函數：
public HasChildFilter (NodeFilter filter, boolean recursive)
假如recursive是false，則只對第一級子節點進行過濾。好比前面的例子，body和top_main都是在第一級的子節點里就有DIV節點，所以匹配上了。假如我們用下面的要領挪用：
NodeFilter filter = new HasChildFilter( innerFilter, true );
輸出結果：
getText:html xmlns="http://www.w3.org/1999/xhtml"
=================================================
getText:body
=================================================
getText:div id="top_main"
=================================================
可以看到輸出結果中多了一個html xmlns="http://www.w3.org/1999/xhtml"，這個是整個HTML頁面的節點（根節點），雖然這個節點下直接沒有DIV節點，但是它的子節點body下面有DIV節點，所以它也被匹配上了。

2.3 HasAttributeFilter
HasAttributeFilter有3個結構函數：
public HasAttributeFilter ();
public HasAttributeFilter (String attribute);
public HasAttributeFilter (String attribute, String value);
這個Filter可以匹配出包括制命名字的屬性，或者制定屬性為指定值的節點。還是用例子說明比較輕易。

挪用要領1:
NodeFilter filter = new HasAttributeFilter();
NodeList nodes = parser.extractAllNodesThatMatch(filter);
輸出結果：

什么也沒有輸出。

挪用要領2:
NodeFilter filter = new HasAttributeFilter( "id" );
NodeList nodes = parser.extractAllNodesThatMatch(filter);
輸出結果：
getText:div id="top_main"
=================================================
getText:div id="logoindex"
=================================================

挪用要領3:
NodeFilter filter = new HasAttributeFilter( "id", "logoindex" );
NodeList nodes = parser.extractAllNodesThatMatch(filter);
輸出結果：
getText:div id="logoindex"
=================================================

很簡單吧。呵呵

2.4 其他判定列Filter
HasParentFilter和HasSiblingFilter的效用與HasChildFilter類似，大眾自己試一下就應該了解了。

IsEqualFilter的結構函數參數是一個Node：
public IsEqualFilter (Node node) {
mNode = node;
}
accept函數也很簡單：
public boolean accept (Node node) {
return (mNode == node);
}
不需要過多說明了。

（三）邏輯運算Filter
前面介紹的都是簡單的Filter，只能針對某種簡單類型的條件進行過濾。HTMLParser支持對付簡單類型的Filter進行組合，從而實現紛亂的條件。原理和一般編程語言的邏輯運算是一樣的。
3.1 AndFilter
AndFilter可以把兩種Filter進行組合，只有同時滿足條件的Node才會被過濾。
測試源代碼：
NodeFilter filterID = new HasAttributeFilter( "id" );
NodeFilter filterChild = new HasChildFilter(filterA);
NodeFilter filter = new AndFilter(filterID, filterChild);
輸出結果：
getText:div id="logoindex"
=================================================

3.2 OrFilter
把前面的AndFilter換成OrFilter
測試源代碼：
NodeFilter filterID = new HasAttributeFilter( "id" );
NodeFilter filterChild = new HasChildFilter(filterA);
NodeFilter filter = new OrFilter(filterID, filterChild);
輸出結果：
getText:div id="top_main"
=================================================
getText:div id="logoindex"
=================================================

3.3 NotFilter
把前面的AndFilter換成NotFilter
測試源代碼：
NodeFilter filterID = new HasAttributeFilter( "id" );
NodeFilter filterChild = new HasChildFilter(filterA);
NodeFilter filter = new NotFilter(new OrFilter(filterID, filterChild));
輸出結果：
getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
=================================================
getText:

=================================================
getText:head
=================================================
getText:meta http-equiv="Content-Type" content="text/html; charset=gb2312"
=================================================
getText:title
=================================================
getText:新靈感網站自動更新系統-www.xinlg.com
=================================================
getText:/title
=================================================
getText:/head
=================================================
getText:

=================================================
getText:html xmlns="http://www.w3.org/1999/xhtml"
=================================================
getText:

=================================================
getText:body
=================================================
getText:

=================================================
getText:

=================================================
getText:這是注釋
=================================================
getText:
新靈感網站自動更新系統-www.xinlg.com

=================================================
getText:a href="http://www.xinlg.com"
=================================================
getText:新靈感網站自動更新系統-www.xinlg.com
=================================================
getText:/a
=================================================
getText:

=================================================
getText:/div
=================================================
getText:
新靈感網站自動更新系統-www.xinlg.com

=================================================
getText:/div
=================================================
getText:

=================================================
getText:/body
=================================================
getText:

=================================================
getText:/html
=================================================
getText:

=================================================

除了前面3.2中輸出的幾個Tag，其余的Tag都在這里了。

3.4 XorFilter
把前面的AndFilter換成NotFilter
測試源代碼：
NodeFilter filterID = new HasAttributeFilter( "id" );
NodeFilter filterChild = new HasChildFilter(filterA);
NodeFilter filter = new XorFilter(filterID, filterChild);
輸出結果：
getText:div id="top_main"
=================================================

（四）其他Filter：
4.1 NodeClassFilter
這個Filter用於判定節點類型是否是某個特定的Node類型。在HTMLParser使用入門（2）- Node內容中我們已經了解了Node的差別類型，這個Filter就可以針對類型進行過濾。
測試源代碼：
NodeFilter filter = new NodeClassFilter(RemarkNode.class);
NodeList nodes = parser.extractAllNodesThatMatch(filter);
輸出結果：
getText:這是注釋
=================================================
可以看到只有RemarkNode（注釋）被輸出了。

4.2 StringFilter
這個Filter用於過濾顯示字符串中包括制定內容的Tag。注重是可顯示的字符串，不可顯示的字符串中的內容（例如注釋，鏈接等等）不會被顯示。
修改一下例子源代碼：
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>新靈感網站自動更新系統-title-www.xinlg.com</title>& lt;/head>
<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">

新靈感網站自動更新系統-字符串1-www.xinlg.com
<a href="http://www.xinlg.com">新靈感網站自動更新系統-鏈接文本-www.xinlg.com</a>
</div>
新靈感網站自動更新系統-字符串2-www.xinlg.com
</div>
</body>
</html>

測試源代碼：
NodeFilter filter = new StringFilter("www.xinlg.com");
NodeList nodes = parser.extractAllNodesThatMatch(filter);
輸出結果：
getText:新靈感網站自動更新系統-title-www.xinlg.com

=================================================
getText:
新靈感網站自動更新系統-字符串1-www.xinlg.com

=================================================
getText:新靈感網站自動更新系統-鏈接文本-www.xinlg.com

=================================================
getText:
新靈感網站自動更新系統-字符串2-www.xinlg.com
=================================================
可以看到包括title，兩個內容字符串和鏈接的文本字符串的Tag都被輸出了，但是注釋和鏈接Tag自己沒有輸出。

4.3 LinkStringFilter
這個Filter用於判定鏈接中是否包括某個特定的字符串，可以用來過濾出指向某個特定網站的鏈接。
測試源代碼：
NodeFilter filter = new LinkStringFilter("www.xinlg.com");
NodeList nodes = parser.extractAllNodesThatMatch(filter);
輸出結果：
getText:a href="http://www.xinlg.com"
=================================================

4.4 其他幾個Filter
其他幾個Filter也是憑據字符串對差別的域進行判定，與前面這些的區別主要就是支持正則表達式。這個不在本文的討論范疇以內，大眾可以自己實驗一下。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 黃聰：C#類似Jquery的html解析類HtmlAgilityPack基礎類介紹及運用使用HtmlParser解析HTML (C#版) Python—解析HTML頁面（HTMLParser） python之HTMLParser解析HTML文檔 Python 用HTMLParser解析HTML文件 HTML解析類，讓你不使用正則也能輕松獲取HTML相關元素 -C# .NET HTML解析類，讓你不使用正則也能輕松獲取HTML相關元素 -C# .NET python開發_HTMLParser_html文檔解析 python自帶的用於解析HTML的庫HtmlParser Python HTML解析模塊HTMLParser(爬蟲工具)