XML解析之Jsoup

本文轉載自查看原文 2019-08-03 22:37 576 xml/ Jsoup/ HTML

操作xml文件

解析（讀取）：將文檔中的數據解讀到內存中
寫入：將內存中的數據保存到XML文檔中。持久化的存儲

解析xml的方式

DOM：將標記語言文檔一次性加載進內存，在內存中形成一顆dom樹
- 優點：
  
  操作方便，可以對文檔進行CRUD(增刪改查)的所有操作
- 缺點：
  
  占內存
SAX:逐行讀取，基於事件驅動
- 優點
  
  不占內存
- 缺點
  
  只能讀取

常用的解析器：

JAXP:sum公司提供的解析器，支持dom和sax兩種思想
DOM4J：優秀的解析器
Jsoup:一款Java 的HTML解析器，可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API，可通過DOM，CSS以及類似於jQuery的操作方法來取出和操作數據。
PULL：android系統內置解析器

Jsoup

快速入門

從URL，文件或字符串中刮取並解析HTML

查找和提取數據，使用DOM遍歷或CSS選擇器

操縱HTML元素，屬性和文本

根據安全的白名單清理用戶提交的內容，以防止XSS攻擊

輸出整潔的HTML

參考

步驟：

導入jar包
獲取Document對象
獲取對應的標簽Element對象
獲取數據

代碼：

xml文件：

<?xml version="1.0" encoding="UTF-8" ?>
 <students>
 	<student number="heima_0001">
 		<name id="cat">tom</name>
 		<age>18</age>
 		<sex>male</sex>
 	</student>
	<student number="heima_0002">
		<name>jack</name>
		<age>12</age>
		<sex>male</sex>
	</student>
 </students>

測試代碼:

public class JsoupTest {
    public static void main(String[] args) throws IOException {
        //獲得路徑path
        String path = JsoupTest.class.getClassLoader().getResource("student.xml").getPath();
        //解析
        Document document = Jsoup.parse(new File(path), "utf-8");
        //獲取元素
        Elements elements = document.getElementsByTag("name");
        System.out.println(elements.size());
        //獲取數據
        for (int i = 0; i < elements.size(); i++) {
            System.out.println(elements.get(i).text());
        }
    }
}

對象的使用

Jsoup:工具類，可以解析html或xml文檔，返回Document
1. parse方法
  1. 解析xml或html對象
```
public static Document parse(File in,String charsetName)throws IOException
```
    Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
  2. 解析xml或html字符串
```
 public static Document parse(String html)
```
    Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a <base href> tag.
  3. 通過網絡路徑獲取指定的html或xml的文檔對象
```
  public static Document parse(URL url,int timeoutMillis)throws IOException
```
    Fetch a URL, and parse it as HTML. Provided for compatibility; in most cases use [connect(String)](file:///C:/Users/ada/AppData/Local/Temp/360zip$Temp/360$3/day32_xml/03_參考/jsoup/jsoup-1.11.2-javadoc/org/jsoup/Jsoup.html#connect-java.lang.String-)
    
    The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8.
    等；
Document ：文檔對象。代表內存中的dom樹
1. 獲取Element對象
  1. 根據標簽名獲取對象集合
```
public Elements getElementsByTag(String tagName)
```
  Finds elements, including and recursively under this element, with the specified tag name.
  1. 根據屬性名稱獲取對象集合
```
public Elements getElementsByAttribute(String key)
```
  Find elements that have a named attribute set. Case insensitive.
  1. 根據對應的屬性名和值獲取元素對象集合
```
public Elements getElementsByAttributeValue(String key, String value)
```
  Find elements that have an attribute with the specific value. Case insensitive.
  1. 根據ID屬性獲取唯一的element
```
public Element lastElementSibling()
```
  Gets the last element sibling of this element
Elements ：Element對象的集合。可以當作ArrayList 來使用
Element ：元素對象
1. 獲取子元素對象
2. 獲取屬性值
  1. String attr(String key):根據屬性名稱獲取屬性值
3. 獲取文本內容
  1. String text():獲取文本內容
  2. String html();獲取標簽體的所有內容
Node ：節點對象
- Document和Element的父類

快速查詢方式

selector:選擇器

使用的方法：Elements select(String cssQuery)

樣例：

public class JsoupTest {
    public static void main(String[] args) throws IOException {
        //獲得路徑path
        String path = JsoupTest.class.getClassLoader().getResource("student.xml").getPath();
        //解析
        Document document = Jsoup.parse(new File(path), "utf-8");
        //查詢name標簽
        Elements elements = document.select("name");
        System.out.println(elements.get(0).text());
        //查詢id
        Elements id = document.select("#cat");
        System.out.println(elements.get(0).select("name").text());
        System.out.println("******************");
        //查找student中number等於heima_0001
        Elements select = document.select("student[number=\"heima_0001\"]");
        System.out.println(select);
        System.out.println("******************");
        //查找student中number等於heima_0001中的age子標簽
        Elements select1 = document.select("student[number=\"heima_0001\"]>age");
        System.out.println(select1);
    }
}

XPath:

解釋：

XPath 是一門在 XML 文檔中查找信息的語言。

XPath 是 XSLT 中的主要元素。

XQuery 和 XPointer 均構建於 XPath 表達式之上

使用Jsoup的xpath需要額外導入jar包

查詢w3cschool參考手冊，使用xpath語法完成

public class JsoupXpath {
    public static void main(String[] args) throws IOException, XpathSyntaxErrorException {
        //獲得路徑path
        String path = JsoupTest.class.getClassLoader().getResource("student.xml").getPath();
        //解析
        Document document = Jsoup.parse(new File(path), "utf-8");
        //劇創建JXDocumnet對象
        JXDocument jxDocument=new JXDocument(document);
        //結合xpath語法查詢
        List<JXNode> jxNodes = jxDocument.selN("//student");
        System.out.println(jxNodes);

        System.out.println("__________________________");
        List<JXNode> jxNode = jxDocument.selN("//student[@number='heima_0001']");
        System.out.println(jxNode);
    }
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Jsoup解析Xml{詳解} jsoup -- xml文檔解析 java讀取xml文件以及Jsoup解析使用 jsoup 解析HTML 初識Jsoup之解析HTML 爬蟲-jsoup解析 Jsoup解析iframe里面內容 Android 使用Jsoup解析Html 使用Jsoup解析HTML頁面 jsoup解析HTML及簡單實例