python爬蟲的頁面數據解析和提取/xpath/bs4/jsonpath/正則(1)

本文轉載自查看原文 2018-05-21 21:06 7859 嘿python基礎_爬蟲

一.數據類型及解析方式

一般來講對我們而言，需要抓取的是某個網站或者某個應用的內容，提取有用的價值。內容一般分為兩部分，非結構化的數據和結構化的數據。

非結構化數據：先有數據，再有結構，
結構化數據：先有結構、再有數據
不同類型的數據，我們需要采用不同的方式來處理。

　　1.非結構化的數據處理

　　　　文本、電話號碼、郵箱地址

　　　　用:正則表達式

　　　　html文件

　　　　用:正則表達式 / xpath/css選擇器/bs4

　　2.結構化的數據處理

　　　　json文件

　　　　用:jsonPath / 轉化成Python類型進行操作（json類）

　　　　xml文件

　　　　用:轉化成Python類型（xmltodict） / XPath / CSS選擇器 / 正則表達式

下面就將常用的數據解析及提取方式進行一下學習總結,主要包括:正則,bs4,jsonpath,xpath. json數據優先選擇使用jsonpath. html頁面個人比較喜歡使用xpath,若使用xpath較難提取的數據可以使用bs4進行輔助, 若二者都提取不到,這時再去考慮使用正則,當然這只是個人建議,大神盡可全程高能使用正則.大佬們的牛逼人生,永遠都不需要解釋!!!!!!!!!!!!!!!!!

=======================邪惡的分割線==========================

二.正則表達式

　　正則表達式，又稱規則表達式，通常被用來檢索、替換那些符合某個模式(規則)的文本。

　　正則表達式是對字符串操作的一種邏輯公式，就是用事先定義好的一些特定字符、及這些特定字符的組合，組成一個“規則字符串”，這個“規則字符串”用來表達對字符串的一種過濾邏輯。

　　給定一個正則表達式和另一個字符串，我們可以達到如下的目的：

給定的字符串是否符合正則表達式的過濾邏輯（“匹配”）；
通過正則表達式，從文本字符串中獲取我們想要的特定部分（“過濾”）。

正則表達式匹配規則

Python 的 re 模塊的使用

在 Python 中，我們可以使用內置的 re 模塊來使用正則表達式。

有一點需要特別注意的是，正則表達式使用對特殊字符進行轉義，所以如果我們要使用原始字符串，只需加一個 r 前綴，示例： r'chuanzhiboke\t\.\tpython'

re 模塊的一般使用步驟如下：

使用 compile() 函數將正則表達式的字符串形式編譯為一個 Pattern 對象
通過 Pattern 對象提供的一系列方法對文本進行匹配查找，獲得匹配結果，一個 Match 對象。
最后使用 Match 對象提供的屬性和方法獲得信息，根據需要進行其他的操作

compile 函數

compile 函數用於編譯正則表達式，生成一個 Pattern 對象，它的一般使用形式如下：

import re

# 將正則表達式編譯成 Pattern 對象
pattern = re.compile(r'\d+')

在上面，我們已將一個正則表達式編譯成 Pattern 對象，接下來，我們就可以利用 pattern 的一系列方法對文本進行匹配查找了。

Pattern 對象的一些常用方法主要有：

match 方法：從起始位置開始查找，一次匹配
search 方法：從任何位置開始查找，一次匹配
findall 方法：全部匹配，返回列表
finditer 方法：全部匹配，返回迭代器
split 方法：分割字符串，返回列表
sub 方法：替換

match 方法

match 方法用於查找字符串的頭部（也可以指定起始位置），它是一次匹配，只要找到了一個匹配的結果(注意貪婪與非貪婪)就返回，而不是查找所有匹配的結果。它的一般使用形式如下：

>>> import re
>>> pattern = re.compile(r'\d+')  # 用於匹配至少一個數字 + 一個或者無限多個

>>> m = pattern.match('one12twothree34four')  # 查找頭部，沒有匹配
>>> print m
None

>>> m = pattern.match('one12twothree34four', 2, 10) # 從'e'的位置開始匹配，沒有匹配
>>> print m                                         # 從0開始算起
None

>>> m = pattern.match('one12twothree34four', 3, 10) # 從'1'的位置開始匹配，正好匹配
>>> print m                                         # 返回一個 Match 對象
<_sre.SRE_Match object at 0x10a42aac0>

>>> m.group(0)   # 可省略 0
'12'
>>> m.start(0)   # 可省略 0 起始位置
3
>>> m.end(0)     # 可省略 0 結束位置
5
>>> m.span(0)    # 可省略 0 區間
(3, 5)

再看看一個例子：

>>> import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)  # re.I 表示忽略大小寫
>>> m = pattern.match('Hello World Wide Web')

>>> print m     # 匹配成功，返回一個 Match 對象
<_sre.SRE_Match object at 0x10bea83e8>

>>> m.group(0)  # 返回匹配成功的整個子串
'Hello World'

>>> m.span(0)   # 返回匹配成功的整個子串的索引
(0, 11)

>>> m.group(1)  # 返回第一個分組匹配成功的子串
'Hello'

>>> m.span(1)   # 返回第一個分組匹配成功的子串的索引
(0, 5)

>>> m.group(2)  # 返回第二個分組匹配成功的子串
'World'

>>> m.span(2)   # 返回第二個分組匹配成功的子串
(6, 11)

>>> m.groups()  # 等價於 (m.group(1), m.group(2), ...)
('Hello', 'World')

>>> m.group(3)   # 不存在第三個分組
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

search 方法

search 方法用於查找字符串的任何位置，它也是一次匹配，只要找到了一個匹配的結果就返回，而不是查找所有匹配的結果，當匹配成功時，返回一個 Match 對象，如果沒有匹配上，則返回 None。讓我們看看例子：

>>> import re
>>> pattern = re.compile('\d+')
>>> m = pattern.search('one12twothree34four')  # 這里如果使用 match 方法則不匹配
>>> m
<_sre.SRE_Match object at 0x10cc03ac0>
>>> m.group()
'12'
>>> m = pattern.search('one12twothree34four', 10, 30)  # 指定字符串區間
>>> m
<_sre.SRE_Match object at 0x10cc03b28>
>>> m.group()
'34'
>>> m.span()
(13, 15)

再來看一個例子：

# -*- coding: utf-8 -*-

import re
# 將正則表達式編譯成 Pattern 對象
pattern = re.compile(r'\d+')
# 使用 search() 查找匹配的子串，不存在匹配的子串時將返回 None
# 這里使用 match() 無法成功匹配
m = pattern.search('hello 123456 789')
if m:
    # 使用 Match 獲得分組信息
    print 'matching string:',m.group()
    # 起始位置和結束位置
    print 'position:',m.span()
執行結果：

matching string: 123456
position: (6, 12)

findall 方法

上面的 match 和 search 方法都是一次匹配，只要找到了一個匹配的結果就返回。然而，在大多數時候，我們需要搜索整個字符串，獲得所有匹配的結果。

findall 以列表形式返回全部能匹配的子串，如果沒有匹配，則返回一個空列表。

看看例子：

import re
pattern = re.compile(r'\d+')   # 查找數字

result1 = pattern.findall('hello 123456 789')
result2 = pattern.findall('one1two2three3four4', 0, 10)

print result1
print result2


['123456', '789']
['1', '2']

再先看一個例子：

# re_test.py

import re

#re模塊提供一個方法叫compile模塊，提供我們輸入一個匹配的規則
#然后返回一個pattern實例，我們根據這個規則去匹配字符串
pattern = re.compile(r'\d+\.\d*')

#通過partten.findall()方法就能夠全部匹配到我們得到的字符串
result = pattern.findall("123.141593, 'bigcat', 232312, 3.15")

#findall 以 列表形式 返回全部能匹配的子串給result
for item in result:
    print item

123.141593
3.15

finditer 方法

finditer 方法的行為跟 findall 的行為類似，也是搜索整個字符串，獲得所有匹配的結果。但它返回一個順序訪問每一個匹配結果（Match 對象）的迭代器。

# -*- coding: utf-8 -*-

import re
pattern = re.compile(r'\d+')

result_iter1 = pattern.finditer('hello 123456 789')
result_iter2 = pattern.finditer('one1two2three3four4', 0, 10)

print type(result_iter1)
print type(result_iter2)

print 'result1...'
for m1 in result_iter1:   # m1 是 Match 對象
    print 'matching string: {}, position: {}'.format(m1.group(), m1.span())

print 'result2...'
for m2 in result_iter2:
    print 'matching string: {}, position: {}'.format(m2.group(), m2.span())

執行結果：

<type 'callable-iterator'>
<type 'callable-iterator'>
result1...
matching string: 123456, position: (6, 12)
matching string: 789, position: (13, 16)
result2...
matching string: 1, position: (3, 4)
matching string: 2, position: (7, 8)

split 方法

split 方法按照能夠匹配的子串將字符串分割后返回列表, 這個其實可算是變形的字符串方法 split() 但是使用正則的這個方法,可以指定同時按照多個規則進行切割

split(string[, maxsplit])

其中，maxsplit 用於指定最大分割次數，不指定將全部分割。

import re
p = re.compile(r'[\s\,\;]+')
print p.split('a,b;; c   d')

執行結果：

['a', 'b', 'c', 'd']

sub 方法

sub 方法用於替換。它的使用形式如下：

sub(repl, string[, count])

其中，repl 可以是字符串也可以是一個函數：

如果 repl 是字符串，則會使用 repl 去替換字符串每一個匹配的子串，並返回替換后的字符串，另外，repl 還可以使用 id 的形式來引用分組，但不能使用編號 0；
如果 repl 是函數，這個方法應當只接受一個參數（Match 對象），並返回一個字符串用於替換（返回的字符串中不能再引用分組）。
count 用於指定最多替換次數，不指定時全部替換。

import re
p = re.compile(r'(\w+) (\w+)') # \w = [A-Za-z0-9_]
s = 'hello 123, hello 456'

print p.sub(r'hello world', s)  # 使用 'hello world' 替換 'hello 123' 和 'hello 456'
print p.sub(r'\2 \1', s)        # 引用分組

def func(m):
    return 'hi' + ' ' + m.group(2)

print p.sub(func, s)
print p.sub(func, s, 1)         # 最多替換一次

執行結果：

hello world, hello world
123 hello, 456 hello
hi 123, hi 456
hi 123, hello 456

========================邪惡的分割線==========================

三.BS4-------CSS選擇器：BeautifulSoup4------完美的湯

如果只是需要簡單的使用bs4進行提取數據,可以拉到本小節最后面的紅色字體開始的那部分!

一.官方文檔：http：//beautifulsoup.readthedocs.io/zh_CN/v4.4.0

二.Beautiful Soup也是一個HTML / XML的解析器，主要的功能也是如何解析和提取HTML / XML數據。

lxml只會局部遍歷，而Beautiful Soup是基於HTML DOM的，會載入整個文檔，解析整個DOM樹，因此時間和內存開銷都會大很多，所以性能要低於lxml。

BeautifulSoup用來解析HTML比較簡單，API非常人性化，支持CSS選擇器，Python標准庫中的HTML解析器，也支持lxml的XML解析器。

Beautiful Soup 3目前已經停止開發，推薦現在的項目使用Beautiful Soup 4.使用pip安裝即可：pip install beautifulsoup4

抓取工具	速度	使用難度	安裝難度
正則	最快	困難	無（內置）
BeautifulSoup	慢	最簡單	簡單
LXML	快	簡單	一般

三.簡單的使用示例

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#創建 Beautiful Soup 對象
soup = BeautifulSoup(html)

#打開本地 HTML 文件的方式來創建對象
#soup = BeautifulSoup(open('index.html'))

#格式化輸出 soup 對象的內容
print soup.prettify()

輸出結果:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

如果我們在IPython2下執行，會看到這樣一段警告：
意思是，如果我們沒有顯式地指定解析器，所以默認使用這個系統的最佳可用HTML解析器（“LXML”）。如果你在另一個系統中運行這段代碼，或者在不同的虛擬環境中，使用不同的解析器造成行為不同。
但是可以我們通過soup = BeautifulSoup(html,“lxml”)方式指定LXML解析器。

四.bs4的四大對象種類

Beautiful Soup將復雜的HTML文檔轉換成一個復雜的樹形結構，每個節點都是Python對象，所有對象可以歸納為4種：

標簽
NavigableString
BeautifulSoup
評論

1.標簽

Tag通俗點講就是HTML中的一個個標簽，例如：

<head><title>The Dormouse's story</title></head> <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

上面的等等title head a pHTML標簽加上里面包括的內容就是Tag，那么試着使用Beautiful Soup來獲取標簽：

from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ #創建 Beautiful Soup 對象 soup = BeautifulSoup(html) print soup.title # <title>The Dormouse's story</title> print soup.head # <head><title>The Dormouse's story</title></head> print soup.a # <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> print soup.p # <p class="title" name="dromouse"><b>The Dormouse's story</b></p> print type(soup.p) # <class 'bs4.element.Tag'>

我們可以利用湯加標簽名輕松地獲取這些標簽的內容，但這些對象的類型是bs4.element.Tag。但是注意，它查找的是在所有內容中的第一個符合要求的標簽。如果要查詢所有的標簽，后面會進行介紹。

對於Tag，它有兩個重要的屬性，是名和attrs

print soup.name # [document] #soup 對象本身比較特殊，它的 name 即為 [document] print soup.head.name # head #對於其他內部標簽，輸出的值便為標簽本身的名稱 print soup.p.attrs # {'class': ['title'], 'name': 'dromouse'} # 在這里，我們把 p 標簽的所有屬性打印輸出了出來，得到的類型是一個字典。 print soup.p['class'] # soup.p.get('class') # ['title'] #還可以利用get方法，傳入屬性的名稱，二者是等價的 soup.p['class'] = "newClass" print soup.p # 可以對這些屬性和內容等等進行修改 # <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p> del soup.p['class'] # 還可以對這個屬性進行刪除 print soup.p # <p name="dromouse"><b>The Dormouse's story</b></p>

2. NavigableString

既然我們已經得到了標簽的內容，那么問題來了，我們要想獲取標簽內部的文字怎么辦呢？很簡單，用.string即可，例如

print soup.p.string # The Dormouse's story print type(soup.p.string) # In [13]: <class 'bs4.element.NavigableString'>

3. BeautifulSoup

BeautifulSoup對象表示的是一個文檔的內容。大部分時候，可以把它當作Tag對象，是一個特殊的Tag，我們可以分別獲取它的類型，名稱，以及屬性來感受一下

print type(soup.name) # <type 'unicode'> print soup.name # [document] print soup.attrs # 文檔本身的屬性為空 # {}

4.評論

注釋對象是一個特殊類型的NavigableString對象，其輸出的內容不包括注釋符號。

print soup.a # <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> print soup.a.string # Elsie print type(soup.a.string) # <class 'bs4.element.Comment'>

a標簽里的內容實際上是注釋，但是如果我們利用.string來輸出它的內容時，注釋符號已經去掉了。

五.遍歷文檔樹

1.直接子節點：`.contents` `.children` 屬性

。內容

tag的.content屬性可以將標簽的子節點以列表的方式輸出

print soup.head.contents #[<title>The Dormouse's story</title>]

輸出方式為列表，我們可以用列表索引來獲取它的某一個元素

print soup.head.contents[0] #<title>The Dormouse's story</title>

。孩子

它返回的不是一個列表，不過我們可以通過遍歷獲取所有子節點。

我們打印輸出.children看一下，可以發現它是一個list生成器對象

print soup.head.children #<listiterator object at 0x7f71457f5710> for child in soup.body.children: print child

結果：

<p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>

2.所有子孫節點：`.descendants` 屬性

.contents和.children屬性僅包含標簽的直接子節點，.descendants屬性可以對所有標簽的子孫節點進行遞歸循環，和兒童類似，我們也需要遍歷獲取其中的內容。

for child in soup.descendants: print child

運行結果：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>

3.節點內容：`.string`屬性

如果一個標簽里面沒有標簽了，那么.string就會返回標簽里面的內容。如果標簽里面只有唯一的一個標簽了，那么.string也會返回最里面的內容。

print soup.head.string #The Dormouse's story print soup.title.string #The Dormouse's story

六.搜索文檔樹

1.`find_all(name, attrs, recursive, text, **kwargs)`

1）名稱參數

name參數可以查找所有名字為name的標簽，字符串對象會被自動忽略掉

A.傳字符串

最簡單的過濾器是字符串。在搜索方法中傳入一個字符串參數，Beautiful Soup會查找與字符串完整匹配的內容，下面的例子用於查找文檔中所有的<b>標簽：

soup.find_all('b') # [<b>The Dormouse's story</b>] print soup.find_all('a') #[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.傳正則表達式

如果傳入正則表達式作為參數，Beautiful Soup會通過正則表達式的match（）來匹配內容。下面例子中找出所有以b開頭的標簽，這表示<body>和<b>標簽都應該找到

import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b

C.傳列表

如果傳入列表參數，Beautiful Soup會將與列表中任一元素匹配的內容返回。下面代碼找到文檔中所有<a>標簽和<b>標簽：

soup.find_all(["a", "b"]) # [<b>The Dormouse's story</b>, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2）關鍵字參數

soup.find_all(class_ = "sister") #[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

3）文字參數

通過文本參數可以搜索文檔中的字符串內容，與名稱參數的可選值一樣，text參數接受字符串，正則表達式，列表

soup.find_all(text="Elsie") # [u'Elsie'] soup.find_all(text=["Tillie", "Elsie", "Lacie"]) # [u'Elsie', u'Lacie', u'Tillie'] soup.find_all(text=re.compile("Dormouse")) [u"The Dormouse's story", u"The Dormouse's story"]

2.find()方法

find()的用法與find_all一樣，在於區別find用於查找第一個符合匹配查詢查詢結果，find_all則用於查找所有匹配查詢查詢結果的列表。

3. CSS選擇器(在爬蟲中這是最常用的方式)

寫CSS時，標簽名不加任何修飾，類名前加英文句號 .，id名前加 #
在這里我們也可以利用類似的方法來篩選元素，用到的方法是soup.select()，返回類型是list

（1）通過標簽名查找

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#創建 Beautiful Soup 對象
soup = BeautifulSoup(html)

print soup.select('title') 
#[<title>The Dormouse's story</title>]

print soup.select('a') # 取到了所有的a標簽
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')
#[<b>The Dormouse's story</b>]

（2）通過類名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通過id名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）組合查找

組合查找即和寫類文件時，標簽名與類名，id名進行的組合原理是一樣的，例如查找p標簽中，id等於link1的內容，二者需要用空格分開

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子標簽查找，使用則 > 分隔

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

（5）屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性和標簽屬於同一節點，所以中間不能加空格，否則會無法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同樣，屬性仍然可以與上述查找方式組合，不在同一節點的空格隔開，同一節點的不加空格

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（6）獲取內容

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
    print title.get_text()

bs4基本實用的學習內容就是這些,更加詳細完善的使用方法請查看官方文檔

在 python爬蟲的頁面數據解析和提取(2) 中再繼續記錄爬蟲數據解析余下的內容

鏈接: https://www.cnblogs.com/lowmanisbusy/p/9226217.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲的頁面數據解析和提取/xpath/bs4/jsonpath/正則(2) 爬蟲之數據解析（bs4，Xpath）正則，bs4 ,xpath 和jsonpath 的匹配規則 Python：數據解析（bs4 / xpath）爬蟲的三種解析方式(正則解析, xpath解析, bs4解析) python爬蟲數據提取之bs4的使用方法 Python網絡爬蟲(數據解析-bs4模塊) Python爬蟲bs4解析實戰爬蟲的兩種解析方式 xpath和bs4 Python網絡爬蟲四大選擇器（正則表達式、BS4、Xpath、CSS）總結