說明
在編程的時候或者寫網絡爬蟲的時候,經常需要對html進行解析,抽取其中有用的數據。一款好的工具是特別有用的,能提供很多的幫助,網上有很多這樣的工具,比如:htmlcleaner、htmlparser
經使用比較:感覺 htmlcleaner 比 htmlparser 好用,尤其是htmlcleaner 的 xpath特好用。
htmlcleaner 下載地址:htmlcleaner2_1.jar 源碼下載:htmlcleaner2_1-all.zip
下面針對htmlcleaner進行舉例說明,需求為:取出title,name=”my_href” 的鏈接,div的class=”d_1″下的所有li內容。
html-clean-demo.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
<!
DOCTYPE
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd " >
<
head
>
<
meta
http-equiv
=
"Content-Type"
content
=
"text/html; charset=GBK"
/>
<
meta
http-equiv
=
"Content-Language"
content
=
"zh-CN"
/>
<
title
> html clean demo </
title
>
</
head
>
<
body
>
<
div
class
=
"d_1"
>
<
ul
>
<
li
> bar </
li
>
<
li
> foo </
li
>
<
li
> gzz </
li
>
</
ul
>
</
div
>
<
div
>
<
ul
>
<
li
> <
a
name
=
"my_href"
href
=
"1.html"
> text-1 </
a
> </
li
>
<
li
> <
a
name
=
"my_href"
href
=
"2.html"
> text-2 </
a
> </
li
>
<
li
> <
a
name
=
"my_href"
href
=
"3.html"
> text-3 </
a
> </
li
>
<
li
> <
a
name
=
"my_href"
href
=
"4.html"
> text-4 </
a
> </
li
>
</
ul
>
</
div
>
</
body
>
</
html
>
|
HtmlCleanerDemo.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
package
com.chenlb;
import
java.io.File;
import
org.htmlcleaner.HtmlCleaner;
import
org.htmlcleaner.TagNode;
/**
* htmlcleaner 使用示例.
*
*/
public
class
HtmlCleanerDemo {
public
static
void
main(String[] args)
throws
Exception {
HtmlCleaner cleaner =
new
HtmlCleaner();
TagNode node = cleaner.clean(
new
File(
"html/html-clean-demo.html"
),
"GBK"
);
//按tag取.
Object[] ns = node.getElementsByName(
"title"
,
true
);
//標題
if
(ns.length >
0
) {
System.out.println(
"title="
+((TagNode)ns[
0
]).getText());
}
System.out.println(
"ul/li:"
);
//按xpath取
ns = node.evaluateXPath(
"//div[@class='d_1']//li"
);
for
(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println(
"\ttext="
+n.getText());
}
System.out.println(
"a:"
);
//按屬性值取
ns = node.getElementsByAttValue(
"name"
,
"my_href"
,
true
,
true
);
for
(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println(
"\thref="
+n.getAttributeByName(
"href"
)+
", text="
+n.getText());
}
}
}
|
cleaner.clean()中的參數,可以是文件,可以是url,可以是字符串內容。
比較常用的應該是evaluateXPath、 [EFDXKKXFXKF] 、getElementsByAttValue、getElementsByName方法了。另外說明下,htmlcleaner 對不規范的html兼容性比較好。