Java爬蟲利器HTML解析工具-Jsoup

本文轉載自查看原文 2019-06-21 17:34 6799 Java

Jsoup簡介

Java爬蟲解析HTML文檔的工具有：htmlparser, Jsoup。本文將會詳細介紹Jsoup的使用方法，10分鍾搞定Java爬蟲HTML解析。

Jsoup可以直接解析某個URL地址、HTML文本內容，它提供非常豐富的處理Dom樹的API。如果你使用過JQuery，那你一定會非常熟悉。

Jsoup最強大的莫過於它的CSS選擇器支持了。比如：document.select("div.content > div#image > ul > li:eq(2)。

包引入方法

Maven

添加下面的依賴聲明即可，最新版本是（1.12.1）

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.11.3</version>
</dependency>

Gradle

// jsoup HTML parser library @ https://jsoup.org/
compile 'org.jsoup:jsoup:1.11.3'

源碼安裝

當然也可以直接把jar包下載下來，下載地址：https://jsoup.org/download

# git獲取代碼
git clone https://github.com/jhy/jsoup.git
cd jsoup
mvn install

# 下載代碼
curl -Lo jsoup.zip https://github.com/jhy/jsoup/archive/master.zip
unzip jsoup.zip
cd jsoup-master
mvn install

Jsoup解析方法

Jsoup支持四種方式解析Document，即可以輸入四種內容得到一個Document：

解析字符串
解析body片段
從一個URL解析
從一個文件解析

字符串解析示例

字符串中必須包含head和body元素。

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

HTML片段解析

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

從URL解析

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

還可以攜帶cookie等參數：

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

從文件解析

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Jsoup遍歷DOM樹的方法

使用標准的DOM方法

Jsoup封裝並實現了DOM里面常用的元素遍歷方法：

根據id查找元素: getElementById(String id)
根據標簽查找元素: getElementsByTag(String tag)
根據class查找元素: getElementsByClass(String className)
根據屬性查找元素: getElementsByAttribute(String key)
兄弟遍歷方法: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
層級之間遍歷: parent(), children(), child(int index)

這些方法會返回Element或者Elements節點對象，這些對象可以使用下面的方法獲取一些屬性：

attr(String key): 獲取某個屬性值
attributes(): 獲取節點的所有屬性
id(): 獲取節點的id
className(): 獲取當前節點的class名稱
classNames(): 獲取當前節點的所有class名稱
text(): 獲取當前節點的textNode內容
html(): 獲取當前節點的 inner HTML
outerHtml(): 獲取當前節點的 outer HTML
data(): 獲取當前節點的內容，用於script或者style標簽等
tag(): 獲取標簽
tagName(): 獲取當前節點的標簽名稱

有了這些API，就像JQuery一樣很便利的操作DOM。

強大的CSS選擇器支持

你可能會說htmlparse支持xpath，可以很方便的定位某個元素，而不用一層一層地遍歷DOM樹。調用方法如下：

document.select(String selector): 選擇匹配選擇器的元素，返回是Elements對象
document.selectFirst(String selector): 選擇匹配選擇器的第一個元素，返回是一個Element對象
element.select(String selector): 也可以直接在Element對象上執行選擇方法

Jsoup能夠完美的支持CSS的選擇器語法，可以說對應有前端經驗的開發者來說簡直是福音，不用特意去學習XPath的語法。

比如一個XPath： //*[@id="docs"]/div[1]/h4/a，可以轉成等效的CSS選擇器語法： document.select("#docs > div:eq(1) > h4 > a").attr("href");。

看下面的示例：

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://baidu.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png

Element masthead = doc.select("div.masthead").first(); // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

下面列出一些常見的選擇器：

標簽選擇(如div): tag
id選擇(#logo): #id
class選擇(.head): .class
屬性選擇([href]): [attribute]
屬性值選擇: [attr=value]
屬性前綴匹配: [^attr]
屬性簡單正則匹配: [attr^=value], [attr$=value], [attr*=value], [attr~=regex]

另外還支持下面的組合選擇器：

element#id: (div#logo: 選取id為logo的div元素)
element.class: (div.content: 選擇class包括content的div元素)
element[attr]: (a[href]: 選擇包含href的a元素)
ancestor child: (div p: 選擇div元素的所有p后代元素)
parent > child: (p > span: 選擇p元素的直接子元素中的span元素)
siblingA + siblingB: (div.head + div: 選取div.head的下一個兄弟div元素)
siblingA ~ siblingX: (h1 ~ p: 選取h1后面的所有p兄弟元素)
el, el, el: (div.content, div.footer: 同時選取div.content和div.footer)

當然，還支持偽元素選擇器：

:lt(n): (div#logo > li:lt(2): 選擇id為logo的div元素的前3個li子元素)
:gt(n)
:eq(n)
:has(selector)
:not(selector)
:contains(text)

詳細可參考官方選擇器語法文檔： https://jsoup.org/cookbook/extracting-data/selector-syntax

Jsoup修改DOM樹結構

當然Jsoup還支持修改DOM樹結構，真的很像JQuery。

// 設置屬性
doc.select("div.comments a").attr("rel", "nofollow");

// 設置class
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");

下面的API可以直接操作DOM樹結構：

text(String value): 設置內容
html(String value): 直接替換HTML結構
append(String html): 元素后面添加節點
prepend(String html): 元素前面添加節點
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)

參考文檔

Jsoup官網地址： https://jsoup.org/
Jsoup官網指導文檔： https://jsoup.org/cookbook/
Jsoupjar包下載地址： https://jsoup.org/download
JsoupCSS選擇器參考: https://jsoup.org/cookbook/extracting-data/selector-syntax

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Java上的jQuery？解析HTML利器—Jsoup Java爬蟲系列三：使用Jsoup解析HTML Java中的Html解析：使用jsoup Java 的HTML 解析器-----jsoup 爬蟲-jsoup解析使用 jsoup 解析HTML 初識Jsoup之解析HTML HTML抽取工具Jsoup Java爬蟲（Jsoup與WebDriver） Java網絡爬蟲 Jsoup