说明
在编程的时候或者写网络爬虫的时候,经常需要对html进行解析,抽取其中有用的数据。一款好的工具是特别有用的,能提供很多的帮助,网上有很多这样的工具,比如:htmlcleaner、htmlparser
经使用比较:感觉 htmlcleaner 比 htmlparser 好用,尤其是htmlcleaner 的 xpath特好用。
htmlcleaner 下载地址:htmlcleaner2_1.jar 源码下载:htmlcleaner2_1-all.zip
下面针对htmlcleaner进行举例说明,需求为:取出title,name=”my_href” 的链接,div的class=”d_1″下的所有li内容。
html-clean-demo.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
<!
DOCTYPE
html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd " >
<
head
>
<
meta
http-equiv
=
"Content-Type"
content
=
"text/html; charset=GBK"
/>
<
meta
http-equiv
=
"Content-Language"
content
=
"zh-CN"
/>
<
title
> html clean demo </
title
>
</
head
>
<
body
>
<
div
class
=
"d_1"
>
<
ul
>
<
li
> bar </
li
>
<
li
> foo </
li
>
<
li
> gzz </
li
>
</
ul
>
</
div
>
<
div
>
<
ul
>
<
li
> <
a
name
=
"my_href"
href
=
"1.html"
> text-1 </
a
> </
li
>
<
li
> <
a
name
=
"my_href"
href
=
"2.html"
> text-2 </
a
> </
li
>
<
li
> <
a
name
=
"my_href"
href
=
"3.html"
> text-3 </
a
> </
li
>
<
li
> <
a
name
=
"my_href"
href
=
"4.html"
> text-4 </
a
> </
li
>
</
ul
>
</
div
>
</
body
>
</
html
>
|
HtmlCleanerDemo.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
package
com.chenlb;
import
java.io.File;
import
org.htmlcleaner.HtmlCleaner;
import
org.htmlcleaner.TagNode;
/**
* htmlcleaner 使用示例.
*
*/
public
class
HtmlCleanerDemo {
public
static
void
main(String[] args)
throws
Exception {
HtmlCleaner cleaner =
new
HtmlCleaner();
TagNode node = cleaner.clean(
new
File(
"html/html-clean-demo.html"
),
"GBK"
);
//按tag取.
Object[] ns = node.getElementsByName(
"title"
,
true
);
//标题
if
(ns.length >
0
) {
System.out.println(
"title="
+((TagNode)ns[
0
]).getText());
}
System.out.println(
"ul/li:"
);
//按xpath取
ns = node.evaluateXPath(
"//div[@class='d_1']//li"
);
for
(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println(
"\ttext="
+n.getText());
}
System.out.println(
"a:"
);
//按属性值取
ns = node.getElementsByAttValue(
"name"
,
"my_href"
,
true
,
true
);
for
(Object on : ns) {
TagNode n = (TagNode) on;
System.out.println(
"\thref="
+n.getAttributeByName(
"href"
)+
", text="
+n.getText());
}
}
}
|
cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。
比较常用的应该是evaluateXPath、 [EFDXKKXFXKF] 、getElementsByAttValue、getElementsByName方法了。另外说明下,htmlcleaner 对不规范的html兼容性比较好。