Android中利用jsoup解析html頁面

本文轉載自查看原文 2018-12-27 15:19 765 232.Android之Demo

Android 中使用:

添加依賴

    implementation 'org.jsoup:jsoup:1.10.1'

直接上代碼:

package com.loaderman.jsoupdemo;

import android.os.Bundle;
import android.support.v7.app.AppCompatActivity;
import android.view.View;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Main2Activity extends AppCompatActivity {

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main2);
        findViewById(R.id.btn).setOnClickListener(new View.OnClickListener() {
            @Override
            public void onClick(View view) {
                new Thread(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            Document doc = (Document) Jsoup.connect("http://192.168.0.195:8088/news.html").get();//解析html
                            Elements links = doc.select("ul[class=w_newslistpage_list]");//獲取li標簽且class為w_newslistpage_list的標簽
                            for (Element link : links) {
                                Elements li = link.select("li");//查找li標簽
                                for (Element element : li) {//遍歷
                                    Elements select = element.select("a[title]");//查找a標簽且帶有title屬性的標簽
                                    if (select!=null&&select.size()>0){
                                        String linkHref = select.get(0).attr("href");//獲取href值
                                        String linkText = select.get(0).text();//獲取text
                                        System.out.println("爬蟲結果 1 -->  " + linkHref +linkText);
                                    }
                                    Elements select1 = link.select("span[class=date]");//獲取span標簽且class為date的標簽
                                    if (select1!=null&&select1.size()>0){
                                        String date = select1.get(0).text();
                                        System.out.println("爬蟲結果 2--> " + date);
                                    }
                                }
                            }
                        } catch (IOException e) {
                            e.printStackTrace();
                        }

                    }
                }).start();
            }
        });
    }
}

小結如下:

解析和遍歷一個HTML文檔

如何解析一個HTML文檔：

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

其解析器能夠盡最大可能從你提供的HTML文檔來創見一個干凈的解析結果，無論HTML的格式是否完整。比如它可以處理：

沒有關閉的標簽 (比如： Lorem Ipsum parses to Lorem Ipsum)
隱式標簽 (比如. 它可以自動將 <td>Table data</td>包裝成<table><tr><td>?)
創建可靠的文檔結構（html標簽包含head 和 body，在head只出現恰當的元素）

一個文檔的對象模型

文檔由多個Elements和TextNodes組成 (
其繼承結構如下：Document繼承Element繼承Node. TextNode繼承 Node.
一個Element包含一個子節點集合，並擁有一個父Element。他們還提供了一個唯一的子元素過濾列表。

從一個文件加載一個文檔

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

說明

parse(File in, String charsetName, String baseUri) 這個方法用來加載和解析一個HTML文件。如在加載文件的時候發生錯誤，將拋出IOException，應作適當處理。

baseUri 參數用於解決文件中URLs是相對路徑的問題。如果不需要可以傳入一個空的字符串。

另外還有一個方法parse(File in, String charsetName) ，它使用文件的路徑做為 baseUri。這個方法適用於如果被解析文件位於網站的本地文件系統，且相關鏈接也指向該文件系統。

使用選擇器語法來查找元素

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); //帶有href屬性的a元素
Elements pngs = doc.select("img[src$=.png]");
  //擴展名為.png的圖片

Element masthead = doc.select("div.masthead").first();
  //class等於masthead的div標簽

Elements resultLinks = doc.select("h3.r > a"); //在h3元素之后的a元素

說明

jsoup elements對象支持類似於CSS (或jquery)的選擇器語法，來實現非常強大和靈活的查找功能。.

這個select 方法在Document, Element,或Elements對象中都可以使用。且是上下文相關的，因此可實現指定元素的過濾，或者鏈式選擇訪問。

Select方法將返回一個Elements集合，並提供一組方法來抽取和處理結果。

Selector選擇器概述

tagname: 通過標簽查找元素，比如：a
ns|tag: 通過標簽在命名空間查找元素，比如：可以用 fb|name 語法來查找 <fb:name> 元素
#id: 通過ID查找元素，比如：#logo
.class: 通過class名稱查找元素，比如：.masthead
[attribute]: 利用屬性查找元素，比如：[href]
[^attr]: 利用屬性名前綴來查找元素，比如：可以用[^data-] 來查找帶有HTML5 Dataset屬性的元素
[attr=value]: 利用屬性值來查找元素，比如：[width=500]
[attr^=value], [attr$=value], [attr*=value]: 利用匹配屬性值開頭、結尾或包含屬性值來查找元素，比如：[href*=/path/]
[attr~=regex]: 利用屬性值匹配正則表達式來查找元素，比如： img[src~=(?i)\.(png|jpe?g)]
*: 這個符號將匹配所有元素

Selector選擇器組合使用

el#id: 元素+ID，比如： div#logo
el.class: 元素+class，比如： div.masthead
el[attr]: 元素+class，比如： a[href]
任意組合，比如：a[href].highlight
ancestor child: 查找某個元素下子元素，比如：可以用.body p 查找在"body"元素下的所有 p元素
parent > child: 查找某個父元素下的直接子元素，比如：可以用div.content > p 查找 p 元素，也可以用body > * 查找body標簽下所有直接子元素
siblingA + siblingB: 查找在A元素之前第一個同級元素B，比如：div.head + div
siblingA ~ siblingX: 查找A元素之前的同級X元素，比如：h1 ~ p
el, el, el:多個選擇器組合，查找匹配任一選擇器的唯一元素，例如：div.masthead, div.logo

偽選擇器selectors

:lt(n): 查找哪些元素的同級索引值（它的位置在DOM樹中是相對於它的父節點）小於n，比如：td:lt(3) 表示小於三列的元素
:gt(n):查找哪些元素的同級索引值大於n，比如： div p:gt(2)表示哪些div中有包含2個以上的p元素
:eq(n): 查找哪些元素的同級索引值與n相等，比如：form input:eq(1)表示包含一個input標簽的Form元素
:has(seletor): 查找匹配選擇器包含元素的元素，比如：div:has(p)表示哪些div包含了p元素
:not(selector): 查找與選擇器不匹配的元素，比如： div:not(.logo) 表示不包含 class="logo" 元素的所有 div 列表
:contains(text): 查找包含給定文本的元素，搜索不區分大不寫，比如： p:contains(jsoup)
:containsOwn(text): 查找直接包含給定文本的元素
:matches(regex): 查找哪些元素的文本匹配指定的正則表達式，比如：div:matches((?i)login)
:matchesOwn(regex): 查找自身包含文本匹配指定正則表達式的元素
注意：上述偽選擇器索引是從0開始的，也就是說第一個元素索引值為0，第二個元素index為1等

具體api如下:

CSS-like element selector, that finds elements matching a query.

Selector syntax

A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).

The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).

Pattern	Matches	Example
`*`	any element	`*`
`tag`	elements with the given tag name	`div`
`*\|E`	elements of type E in any namespace ns	`*\|name` finds `<fb:name>` elements
`ns\|E`	elements of type E in the namespace ns	`fb\|name` finds `<fb:name>` elements
`#id`	elements with attribute ID of "id"	`div#wrap`, `#logo`
`.class`	elements with a class name of "class"	`div.left`, `.result`
`[attr]`	elements with an attribute named "attr" (with any value)	`a[href]`, `[title]`
`[^attrPrefix]`	elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets	`[^data-]`, `div[^data-]`
`[attr=val]`	elements with an attribute named "attr", and value equal to "val"	`img[width=500]`, `a[rel=nofollow]`
`[attr="val"]`	elements with an attribute named "attr", and value equal to "val"	`span[hello="Cleveland"][goodbye="Columbus"]`, `a[rel="nofollow"]`
`[attr^=valPrefix]`	elements with an attribute named "attr", and value starting with "valPrefix"	`a[href^=http:]`
`[attr$=valSuffix]`	elements with an attribute named "attr", and value ending with "valSuffix"	`img[src$=.png]`
`[attr*=valContaining]`	elements with an attribute named "attr", and value containing "valContaining"	`a[href*=/search/]`
`[attr~=regex]`	elements with an attribute named "attr", and value matching the regular expression	`img[src~=(?i)\\.(png\|jpe?g)]`
	The above may be combined in any order	`div.header[title]`
	Combinators
`E F`	an F element descended from an E element	`div a`, `.logo h1`
`E > F`	an F direct child of E	`ol > li`
`E + F`	an F element immediately preceded by sibling E	`li + li`, `div.head + div`
`E ~ F`	an F element preceded by sibling E	`h1 ~ p`
`E, F, G`	all matching elements E, F, or G	`a[href], div, h3`
	Pseudo selectors
`:lt(n)`	elements whose sibling index is less than n	`td:lt(3)` finds the first 3 cells of each row
`:gt(n)`	elements whose sibling index is greater than n	`td:gt(1)` finds cells after skipping the first two
`:eq(n)`	elements whose sibling index is equal to n	`td:eq(0)` finds the first cell of each row
`:has(selector)`	elements that contains at least one element matching the selector	`div:has(p)` finds divs that contain p elements
`:not(selector)`	elements that do not match the selector. See also `Elements.not(String)`	`div:not(.logo)` finds all divs that do not have the "logo" class. `div:not(:has(div))` finds divs that do not contain divs.
`:contains(text)`	elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants.	`p:contains(jsoup)` finds p elements containing the text "jsoup".
`:matches(regex)`	elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants.	`td:matches(\\d+)` finds table cells containing digits. `div:matches((?i)login)` finds divs containing the text, case insensitively.
`:containsOwn(text)`	elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants.	`p:containsOwn(jsoup)` finds p elements with own text "jsoup".
`:matchesOwn(regex)`	elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants.	`td:matchesOwn(\\d+)` finds table cells directly containing digits. `div:matchesOwn((?i)login)` finds divs containing the text, case insensitively.
`:containsData(data)`	elements that contains the specified data. The contents of `script` and `style` elements, and `comment` nodes (etc) are considered data nodes, not text nodes. The search is case insensitive. The data may appear in the found element, or any of its descendants.	`script:contains(jsoup)` finds script elements containing the data "jsoup".
	The above may be combined in any order and with other selectors	`.light:contains(name):eq(0)`
`:matchText`	treats text nodes as elements, and so allows you to match against and select text nodes. Note that using this selector will modify the DOM, so you may want to `clone` your document before using.	`p:matchText:firstChild` with input `<p>One<br />Two</p>` will return one `PseudoTextElement` with text "`One`".
Structural pseudo selectors
`:root`	The element that is the root of the document. In HTML, this is the `html` element	`:root`
`:nth-child(an+b)`	elements that have `an+b-1` siblings before it in the document tree, for any positive integer or zero value of `n`, and has a parent element. For values of `a` and `b` greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The `a` and `b` values must be integers (positive, negative, or zero). The index of the first child of an element is 1. In addition to this, `:nth-child()` can take `odd` and `even` as arguments instead. `odd` has the same signification as `2n+1`, and `even` has the same signification as `2n`.	`tr:nth-child(2n+1)` finds every odd row of a table. `:nth-child(10n-1)` the 9th, 19th, 29th, etc, element. `li:nth-child(5)` the 5h li
`:nth-last-child(an+b)`	elements that have `an+b-1` siblings after it in the document tree. Otherwise like `:nth-child()`	`tr:nth-last-child(-n+2)` the last two rows of a table
`:nth-of-type(an+b)`	pseudo-class notation represents an element that has `an+b-1` siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element	`img:nth-of-type(2n+1)`
`:nth-last-of-type(an+b)`	pseudo-class notation represents an element that has `an+b-1` siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element	`img:nth-last-of-type(2n+1)`
`:first-child`	elements that are the first child of some other element.	`div > p:first-child`
`:last-child`	elements that are the last child of some other element.	`ol > li:last-child`
`:first-of-type`	elements that are the first sibling of its type in the list of children of its parent element	`dl dt:first-of-type`
`:last-of-type`	elements that are the last sibling of its type in the list of children of its parent element	`tr > td:last-of-type`
`:only-child`	elements that have a parent element and whose parent element hasve no other element children
`:only-of-type`	an element that has a parent element and whose parent element has no other element children with the same expanded element name
`:empty`	elements that have no children at all

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 jsoup Cookbook(中文版)-Jsoup解析HTML 使用 jsoup 對 HTML 文檔進行解析和操作 Jsoup解析HTML、加載文檔等實例 Java 的HTML 解析器-----jsoup Java中使用開源庫JSoup解析HTML文件實例爬蟲-jsoup解析 jsoup -- xml文檔解析爬取微博的數據時別人用的是FM.view方法傳遞html標簽那么jsoup怎么解析呢 python中html解析 Python爬蟲 | Beautifulsoup解析html頁面