python——BeautifulSoup4解析器，JSON與JsonPATH，多線程爬蟲，動態HTML處理

本文轉載自查看原文 2017-12-02 19:03 4707 Python

爬蟲的自我修養_3

一、CSS 選擇器：BeautifulSoup4

和 lxml 一樣，Beautiful Soup 也是一個HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數據。

lxml 只會局部遍歷，而Beautiful Soup 是基於HTML DOM的，會載入整個文檔，解析整個DOM樹，因此時間和內存開銷都會大很多，所以性能要低於lxml。

BeautifulSoup 用來解析 HTML 比較簡單，API非常人性化，支持CSS選擇器、Python標准庫中的HTML解析器，也支持 lxml 的 XML解析器。

Beautiful Soup 3 目前已經停止開發，推薦現在的項目使用Beautiful Soup 4。使用 pip 安裝即可：pip install beautifulsoup4

官方文檔：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

四大對象種類

Beautiful Soup將復雜HTML文檔轉換成一個復雜的樹形結構,每個節點都是Python對象,所有對象可以歸納為4種:

Tag
NavigableString
BeautifulSoup
Comment

1. Tag

Tag 通俗點講就是 HTML 中的一個個標簽，例如：

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

上面的 title head a p等等 HTML 標簽加上里面包括的內容就是 Tag，那么試着使用 Beautiful Soup 來獲取 Tags:

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

#創建 Beautiful Soup 對象
soup = BeautifulSoup(html)

print(soup.title)
# <title>The Dormouse's story</title>

print(soup.head)
# <head><title>The Dormouse's story</title></head>

print(soup.a)    # 只能取出第一個a標簽
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print(soup.p)
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

print(type(soup.p))    # 類型
# <class 'bs4.element.Tag'>

# 取出所有的a標簽並獲取他們的屬性，注意用的是i["..."]
for i in soup.find_all('a'):
    print(i)
    print(i['id'])
    print(i['class'])
    print(i['href'])
# <a class="sister" href="http://example.com/elsie" id="link1"
# name="6123">123</a>
# link1
# ['sister']
# http://example.com/elsie

對於 Tag，它有兩個重要的屬性，是 name 和 attrs　　

print(soup.name)
# [document] #soup 對象本身比較特殊，它的 name 即為 [document]

print(soup.head.name)
# head 對於其他內部標簽，輸出的值便為標簽本身的名稱 
# 標簽內部的name是算在attrs中的
# {'name': '我是a標簽', 'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

print(soup.p.attrs)
# {'class': ['title'], 'name': 'dromouse'}
# 在這里，我們把 p 標簽的所有屬性打印輸出了出來，得到的類型是一個字典。

print(soup.p['class'] # soup.p.get('class'))
# ['title'] #還可以利用get方法，傳入屬性的名稱，二者是等價的

soup.p['class'] = "newClass"
print(soup.p) # 可以對這些屬性和內容等等進行修改
# <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

del soup.p['class'] # 還可以對這個屬性進行刪除
print(soup.p)
# <p name="dromouse"><b>The Dormouse's story</b></p>

2. NavigableString

使用 .string 方法獲取標簽中的內容

print(soup.p.string)
# The Dormouse's story

print(type(soup.p.string))
# <class 'bs4.element.NavigableString'>

3. BeautifulSoup

BeautifulSoup 對象表示的是一個文檔的內容。大部分時候,可以把它當作 Tag 對象，是一個特殊的 Tag，我們可以分別獲取它的類型，名稱。

soup = BeautifulSoup(html,'lxml')    # BeautifulSoup對象

print(type(soup.name))
# <type 'unicode'>

print(soup.name) 
# [document]

print(soup.attrs) # 文檔本身的屬性為空
# {}

4. Comment

Comment 對象是一個特殊類型的 NavigableString 對象，其輸出的內容不包括注釋符號。

print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

print(soup.a.string)
# Elsie ,在輸出的時候把<-!->注釋符號去掉了

print(type(soup.a.string))
# <class 'bs4.element.Comment'>

遍歷文檔樹

1. 直接子節點：`.contents` `.children` 屬性

.content

tag 的 .content 屬性可以將tag的子節點以列表的方式輸出（把當前標簽下的所有標簽以列表的方式輸出）

print soup.body.contents 　　# 也可以直接遍歷HTML文檔
# [<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
        <a class="sister" href="http://example.com/elsie" id="link1" name="6123">123</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
        <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n']

輸出方式為列表，我們可以用列表索引來獲取它的某一個元素

print soup.head.contents[0]
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p>

.children

它返回的不是一個 list，不過我們可以通過遍歷獲取所有子節點。

我們打印輸出 .children 看一下，可以發現它是一個 list 生成器對象

print(soup.head.children)
#<listiterator object at 0x7f71457f5710>

for child in  soup.body.children:　　# 也可以直接遍歷整個HTML文檔
    print(child)

結果

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

2. 所有子孫節點: `.descendants` 屬性

.contents 和 .children 屬性僅包含tag的直接子節點，.descendants 屬性可以對所有tag的子孫節點進行遞歸循環，和 children類似（也是list生成器），我們也需要遍歷獲取其中的內容。

html = """
<html>
    <head>
        <title>The Dormouse's story<a>sadcx</a></title>
    </head>
"""

bs = BeautifulSoup(html,'lxml')

for child in bs.head.descendants:    # 也可以直接遍歷整個HTML文檔
    print(child)

結果

<title>The Dormouse's story<a>sadcx</a></title>
The Dormouse's story
<a>sadcx</a>
sadcx

搜索文檔樹

1.`find_all(name, attrs, recursive, text, **kwargs)`

1）name 參數

name 參數可以查找所有名字為 name 的tag,字符串對象會被自動忽略掉

A.傳字符串

最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數,Beautiful Soup會查找與字符串完整匹配的內容,下面的例子用於查找文檔中所有的<b>標簽:

print(soup.find_all('b'))
# [<b>The Dormouse's story</b>]

print(soup.find_all('a'))
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.傳正則表達式

如果傳入正則表達式作為參數,Beautiful Soup會通過正則表達式的 match() 來匹配內容.下面例子中找出所有以b開頭的標簽,這表示<body>和<b>標簽都應該被找到

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

C.傳列表

如果傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.下面代碼找到文檔中所有<a>標簽和<b>標簽:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2）keyword 參數（屬性）

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(input,attrs={"name":"_xsrf","class":"c1","id":"link2"})
# 屬性也可以以字典的格式傳

3）text 參數

通過 text 參數可以搜搜文檔中的字符串內容，與 name 參數的可選值一樣, text 參數接受字符串 , 正則表達式 , 列表

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

2.`find(name, attrs, recursive, text, **kwargs)`

find方法和find_all的使用方法是一樣的，只不過find只找一個值，find_all返回的是一個列表

CSS選擇器

這就是另一種與 find_all 方法有異曲同工之妙的查找方法.

寫 CSS 時，標簽名不加任何修飾，類名前加.，id名前加#
在這里我們也可以利用類似的方法來篩選元素，用到的方法是 soup.select()，返回類型是 list

（1）通過標簽名查找

print soup.select('title') 
#[<title>The Dormouse's story</title>]

print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')
#[<b>The Dormouse's story</b>]

（2）通過類名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通過 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）組合查找

組合查找即和寫 class 文件時，標簽名與類名、id名進行的組合原理是一樣的，例如查找 p 標簽中，id 等於 link1的內容，二者需要用空格分開

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子標簽查找，則使用 > 分隔

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

（5）屬性查找

查找時還可以加入屬性元素，屬性需要用中括號括起來，注意屬性和標簽屬於同一節點，所以中間不能加空格，否則會無法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同樣，屬性仍然可以與上述查找方式組合，不在同一節點的空格隔開，同一節點的不加空格

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

(6) 獲取內容

以上的 select 方法返回的結果都是列表形式，可以遍歷形式輸出，然后用 get_text() 方法來獲取它的內容。

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()

for title in soup.select('title'):
    print title.get_text()

 1 from bs4 import BeautifulSoup
 2 import requests,time
 3 
 4 # 請求報頭
 5 headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"}
 6 
 7 def captcha(captcha_data):
 8     """
 9     處理驗證碼
10     :return:
11     """
12     with open("captcha.jpg",'wb')as f:
13         f.write(captcha_data)
14     text = input("請輸入驗證碼：")
15     return text     # 返回用戶輸入的驗證碼
16 
17 def zhihuLogin():
18     """
19     獲取頁面的_xsrf
20     驗證碼-抓包工具
21     發送post表單請求-抓包工具
22     :return:
23     """
24 
25     # 相當於urllib2，通過HTTPCookieProcessor()處理器類構建一個處理器對象，
26     # 再用urllib2.build_opener構建一個自定義的opener來保存cookie
27     # 構建一個Session對象，可以保存頁面Cookie
28     sess = requests.session()   # 創建一個能夠儲存cookie的opener
29 
30     # 首先獲取登錄頁面，找到需要POST的數據（_xsrf)，同時會記錄當前網頁的Cookie值
31     # 也可以直接用requests.get("...")發送請求，但這樣就沒法保存cookie值了
32     # 獲取HTML內容可以用.text/.content來獲取
33     html = sess.get('https://www.zhihu.com/#signin',headers=headers).text   # get <==> 發送get請求
34 
35     # 調用lxml解析庫
36     bs = BeautifulSoup(html, 'lxml')
37 
38     # _xsrf 作用是防止CSRF攻擊（跨站請求偽造)，通常叫跨域攻擊，是一種利用網站對用戶的一種信任機制來做壞事
39     # 跨域攻擊通常通過偽裝成網站信任的用戶的請求(利用Cookie)，盜取用戶信息、欺騙web服務器
40     # 所以網站會通過設置一個隱藏字段來存放這個MD5字符串，這個字符串用來校驗用戶Cookie和服務器Session的一種方式
41 
42     # 找到name屬性值為 _xsrf 的input標簽，再取出value 的值
43     _xsrf = bs.find("input", attrs={"name":"_xsrf"}).get("value")   # 獲取_xsrf
44     # 根據UNIX時間戳，匹配出驗證碼的URL地址
45     # 發送圖片的請求，獲取圖片數據流，
46     captcha_data = sess.get('https://www.zhihu.com/captcha.gif?r=%d&type=login'%(time.time()*1000),headers=headers).content
47     # 調用上面的方法（需要手動輸入），獲取驗證碼里的文字
48     captcha_text = captcha(captcha_data)
49     data = {
50         "_xsrf": _xsrf,
51         "phone_num": "xxx",
52         "password": "xxx",
53         "captcha": captcha_text
54     }
55     # 發送登錄需要的POST數據，獲取登錄后的Cookie(保存在sess里)
56     sess.post('https://www.zhihu.com/login/phone_num',data=data,headers=headers)
57 
58     # 用已有登錄狀態的Cookie發送請求，獲取目標頁面源碼
59     response = sess.get("https://www.zhihu.com/people/peng-peng-34-48-53/activities",headers=headers)
60 
61     with open("jiaxin.html",'wb') as f:
62         f.write(response.content)
63 
64 if __name__ == '__main__':
65     zhihuLogin()

示例：通過bs獲取_xsrf登錄知乎

二、JSON與JsonPATH

JSON

json簡單說就是javascript中的對象和數組，所以這兩種結構就是對象和數組兩種結構，通過這兩種結構可以表示各種復雜的結構

對象：對象在js中表示為{ }括起來的內容，數據結構為 { key：value, key：value, ... }的鍵值對的結構，在面向對象的語言中，key為對象的屬性，value為對應的屬性值，所以很容易理解，取值方法為對象.key 獲取屬性值，這個屬性值的類型可以是數字、字符串、數組、對象這幾種。

數組：數組在js中是中括號[ ]括起來的內容，數據結構為 ["Python", "javascript", "C++", ...]，取值方式和所有語言中一樣，使用索引獲取，字段值的類型可以是數字、字符串、數組、對象幾種。

json模塊提供了四個功能：`dumps`、`dump`、`loads`、`load`，用於字符串和 python數據類型間進行轉換。

1. json.loads()　　字符串 ==> python類型

把Json格式字符串解碼轉換成Python對象從json到python的類型轉化對照如下：

import json

strList = '[1, 2, 3, 4]'

strDict = '{"city": "北京", "name": "大貓"}'

print(json.loads(strList))
# [1, 2, 3, 4]

print(json.loads(strDict)) # python3中json數據自動按utf-8存儲
# {'city': '北京', 'name': '大貓'}

2. json.dumps()　　python類型 ==> 字符串

實現python類型轉化為json字符串，返回一個str對象把一個Python對象編碼轉換成Json字符串

從python原始類型向json類型的轉化對照如下：

import json

dictStr = {"city": "北京", "name": "大貓"}
# 注意：json.dumps() 序列化時默認使用的ascii編碼
# 添加參數 ensure_ascii=False 禁用ascii編碼，按utf-8編碼
# chardet.detect()返回字典, 其中confidence是檢測精確度

print(json.dumps(dictStr))
# '{"city": "\\u5317\\u4eac", "name": "\\u5927\\u5218"}'

print(json.dumps(dictStr, ensure_ascii=False))
# {"city": "北京", "name": "大劉"}

3. json.dump()　　# 基本不用

將Python內置類型序列化為json對象后寫入文件

import json

listStr = [{"city": "北京"}, {"name": "大劉"}]
json.dump(listStr, open("listStr.json","wb"), ensure_ascii=False)

dictStr = {"city": "北京", "name": "大劉"}
json.dump(dictStr, open("dictStr.json","w"), ensure_ascii=False)

4. json.load()　　# 基本不用

讀取文件中json形式的字符串元素轉化成python類型

# json_load.py

import json

strList = json.load(open("listStr.json"))
print strList

# [{u'city': u'\u5317\u4eac'}, {u'name': u'\u5927\u5218'}]

strDict = json.load(open("dictStr.json"))
print strDict
# {u'city': u'\u5317\u4eac', u'name': u'\u5927\u5218'}

JsonPath

python3中沒有jsonpath，改為jsonpath_rw，用法不明

JsonPath 是一種信息抽取類庫，是從JSON文檔中抽取指定信息的工具，提供多種語言實現版本，包括：Javascript, Python， PHP 和 Java。

JsonPath 對於 JSON 來說，相當於 XPATH 對於 XML。

下載地址：https://pypi.python.org/pypi/jsonpath

安裝方法：點擊Download URL鏈接下載jsonpath，解壓之后執行python setup.py install

官方文檔：http://goessner.net/articles/JsonPath

JsonPath與XPath語法對比：

Json結構清晰，可讀性高，復雜度低，非常容易匹配，下表中對應了XPath的用法。

 1 #!/usr/bin/env python
 2 # -*- coding:utf-8 -*-
 3 
 4 import urllib2
 5 # json解析庫，對應到lxml
 6 import json
 7 # json的解析語法，對應到xpath
 8 import jsonpath
 9 
10 url = "http://www.lagou.com/lbs/getAllCitySearchLabels.json"
11 headers = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
12 
13 request = urllib2.Request(url, headers = headers)
14 
15 response = urllib2.urlopen(request)
16 #  取出json文件里的內容，返回的格式是字符串
17 html =  response.read()
18 
19 # 把json形式的字符串轉換成python形式的Unicode字符串
20 unicodestr = json.loads(html)
21 
22 # Python形式的列表
23 city_list = jsonpath.jsonpath(unicodestr, "$..name")
24 
25 #for item in city_list:
26 #    print item
27 
28 # dumps()默認中文為ascii編碼格式，ensure_ascii默認為Ture
29 # 禁用ascii編碼格式，返回的Unicode字符串，方便使用
30 array = json.dumps(city_list, ensure_ascii=False)
31 #json.dumps(city_list)
32 #array = json.dumps(city_list)
33 
34 with open("lagoucity.json", "w") as f:
35     f.write(array.encode("utf-8"))

示例：拉勾網json接口

三、多線程爬蟲案例

python多線程簡介

一個CPU一次只能執行一個進程，其他進程處於非運行狀態

進程里面包含的執行單位叫線程，一個進程包含多個線程

一個進程里面的內存空間是共享的，里面的線程都可以使用這個內存空間

一個線程在使用這個共享空間時，其他線程必須等他結束（通過加鎖實現）

鎖的作用：防止多個線程同時用這塊共享的內存空間，先使用的線程會上一把鎖，其他線程想要用的話就要等他用完才可以進去

python中的鎖（GIL）

python的多線程很雞肋，所以scrapy框架用的是協程


python多進程適用於：大量密集的並行計算
python多線程適用於：大量密集的I/O操作

Queue（隊列對象）

Queue是python中的標准庫，可以直接import Queue引用;隊列是線程間最常用的交換數據的形式

python下多線程的思考

對於資源，加鎖是個重要的環節。因為python原生的list,dict等，都是not thread safe的。而Queue，是線程安全的，因此在滿足使用條件下，建議使用隊列

初始化： class Queue.Queue(maxsize) FIFO 先進先出
包中的常用方法:
- Queue.qsize() 返回隊列的大小
- Queue.empty() 如果隊列為空，返回True,反之False
- Queue.full() 如果隊列滿了，返回True,反之False
- Queue.full 與 maxsize 大小對應
- Queue.get([block[, timeout]])獲取隊列，timeout等待時間
創建一個“隊列”對象
- import Queue
- myqueue = Queue.Queue(maxsize = 10)
將一個值放入隊列中
- myqueue.put(10)
將一個值從隊列中取出
- myqueue.get()

多線程示意圖

示例：多線程爬取糗事百科上的段子（好好看這個）

import threading,json,time,requests
from lxml import etree
from queue import Queue

class ThreadCrawl(threading.Thread):
    def __init__(self,thread_name,pageQueue,dataQueue):
        super(ThreadCrawl,self).__init__()      # 調用父類初始化方法
        self.thread_name = thread_name      # 線程名
        self.pageQueue = pageQueue      # 頁碼隊列
        self.dataQueue = dataQueue      # 數據隊列
        self.headers = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}

    def run(self):
        print("啟動"+self.thread_name)
        while not self.pageQueue.empty():   # 如果pageQueue為空，采集線程退出循環 Queue.empty() 判斷隊列是否為空
            try:
                # 取出一個數字，先進先出
                # 可選參數block，默認值為True
                #1. 如果對列為空，block為True的話，不會結束，會進入阻塞狀態，直到隊列有新的數據
                #2. 如果隊列為空，block為False的話，就彈出一個Queue.empty()異常，
                page = self.pageQueue.get(False)
                url = "http://www.qiushibaike.com/8hr/page/" + str(page) +"/"
                html = requests.get(url,headers = self.headers).text
                time.sleep(1)   # 等待1s等他全部下完
                self.dataQueue.put(html)

            except Exception as e:
                pass
        print("結束" + self.thread_name)

class ThreadParse(threading.Thread):
    def __init__(self,threadName,dataQueue,lock):
        super(ThreadParse,self).__init__()
        self.threadName = threadName
        self.dataQueue = dataQueue
        self.lock = lock    # 文件讀寫鎖
        self.headers = {"User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}

    def run(self):
        print("啟動"+self.threadName)
        while not self.dataQueue.empty():   # 如果pageQueue為空，采集線程退出循環
            try:
                html = self.dataQueue.get()     # 解析為HTML DOM
                text = etree.HTML(html)
                node_list = text.xpath('//div[contains(@id, "qiushi_tag_")]')   # xpath返回的列表，這個列表就這一個參數，用索引方式取出來，用戶名

                for i in node_list:
                    username = i.xpath('.//h2')[0].text     # 用戶名
                    user_img = i.xpath('.//img/@src')[0]    # 用戶頭像連接
                    word_content = i.xpath('.//div[@class="content"]/span')[0].text     # 文字內容
                    img_content = i.xpath('.//img/@src')        # 圖片內容
                    zan = i.xpath('./div[@class="stats"]/span[@class="stats-vote"]/i')[0].text      # 點贊
                    comments = i.xpath('./div[@class="stats"]/span[@class="stats-comments"]//i')[0].text    # 評論
                    items = {
                        "username": username,
                        "image": user_img,
                        "word_content": word_content,
                        "img_content": img_content,
                        "zan": zan,
                        "comments": comments
                    }
                    # with 后面有兩個必須執行的操作：__enter__ 和 _exit__
                    # 不管里面的操作結果如何，都會執行打開、關閉
                    # 打開鎖、處理內容、釋放鎖
                    with self.lock:
                        with open('qiushi-threading.json123','ab') as f:
                            # json.dumps()時，里面一定要加 ensure_ascii = False 否則會以ascii嘛的形式進行轉碼，文件中就不是中文了
                            f.write((self.threadName+json.dumps(items, ensure_ascii = False)).encode("utf-8") + b'\n')

            except Exception as e:
                print(e)

def main():
    pageQueue = Queue(10)   # 頁碼的隊列，表示10個頁面，不寫表示不限制個數
    for i in range(1,11):   # 放入1~10的數字，先進先出
        pageQueue.put(i)

    dataQueue = Queue()     # 采集結果(每頁的HTML源碼)的數據隊列，參數為空表示不限制個數
    crawlList = ["采集線程1號", "采集線程2號", "采集線程3號"]      # 存儲三個采集線程的列表集合，留着后面join（等待所有子進程完成在退出程序）

    threadcrawl = []
    for thread_name in crawlList:
        thread = ThreadCrawl(thread_name,pageQueue,dataQueue)
        thread.start()
        threadcrawl.append(thread)

    for i in threadcrawl:
        i.join()
        print('1')

    lock = threading.Lock()     # 創建鎖


    # *** 解析線程一定要在采集線程join（結束）以后寫，否則會出現dataQueue.empty()=True（數據隊列為空），因為采集線程還沒往里面存東西呢 ***
    parseList = ["解析線程1號","解析線程2號","解析線程3號"]    # 三個解析線程的名字
    threadparse = []    # 存儲三個解析線程，留着后面join（等待所有子進程完成在退出程序）
    for threadName in parseList:
        thread = ThreadParse(threadName,dataQueue,lock)
        thread.start()
        threadparse.append(thread)

    for j in threadparse:
        j.join()
        print('2')
    print("謝謝使用！")

if __name__ == "__main__":
    main()

四、動態HTML處理

獲取JavaScript，jQuery，Ajax...加載的網頁數據

Selenium

Selenium是一個Web的自動化測試工具，最初是為網站自動化測試而開發的，類型像我們玩游戲用的按鍵精靈，可以按指定的命令自動操作，不同是Selenium 可以直接運行在瀏覽器上，它支持所有主流的瀏覽器（包括PhantomJS這些無界面的瀏覽器）。

Selenium 可以根據我們的指令，讓瀏覽器自動加載頁面，獲取需要的數據，甚至頁面截屏，或者判斷網站上某些動作是否發生。

Selenium 自己不帶瀏覽器，不支持瀏覽器的功能，它需要與第三方瀏覽器結合在一起才能使用。但是我們有時候需要讓它內嵌在代碼中運行，所以我們可以用一個叫 PhantomJS 的工具代替真實的瀏覽器。

可以從 PyPI 網站下載 Selenium庫https://pypi.python.org/simple/selenium ，也可以用第三方管理器 pip用命令安裝：pip install selenium

Selenium 官方參考文檔：http://selenium-python.readthedocs.io/index.html

PhantomJS

PhantomJS 是一個基於Webkit的“無界面”(headless)瀏覽器，它會把網站加載到內存並執行頁面上的 JavaScript，因為不會展示圖形界面，所以運行起來比完整的瀏覽器要高效。

如果我們把 Selenium 和 PhantomJS 結合在一起，就可以運行一個非常強大的網絡爬蟲了，這個爬蟲可以處理 JavaScrip、Cookie、headers，以及任何我們真實用戶需要做的事情。

注意：PhantomJS 只能從它的官方網站http://phantomjs.org/download.html) 下載。因為 PhantomJS 是一個功能完善(雖然無界面)的瀏覽器而非一個 Python 庫，所以它不需要像 Python 的其他庫一樣安裝，但我們可以通過Selenium調用PhantomJS來直接使用。

PhantomJS 官方參考文檔：http://phantomjs.org/documentation

快速入門

Selenium 庫里有個叫 WebDriver 的 API。WebDriver 有點兒像可以加載網站的瀏覽器，但是它也可以像 BeautifulSoup 或者其他 Selector 對象一樣用來查找頁面元素，與頁面上的元素進行交互 (發送文本、點擊等)，以及執行其他動作來運行網絡爬蟲。

# 導入 webdriver
from selenium import webdriver

# 要想調用鍵盤按鍵操作需要引入keys包
from selenium.webdriver.common.keys import Keys

# 調用環境變量指定的PhantomJS瀏覽器創建瀏覽器對象
driver = webdriver.PhantomJS()

# 如果沒有在環境變量指定PhantomJS位置
# driver = webdriver.PhantomJS(executable_path="./phantomjs"))

# get方法會一直等到頁面被完全加載，然后才會繼續程序，通常測試會在這里選擇 time.sleep(2)
driver.get("http://www.baidu.com/")

# 獲取頁面名為 wrapper的id標簽的文本內容
data = driver.find_element_by_id("wrapper").text

# 打印數據內容
print data

# 打印頁面標題 "百度一下，你就知道"
print driver.title

# 生成當前頁面快照並保存
driver.save_screenshot("baidu.png")

# id="kw"是百度搜索輸入框，輸入字符串"長城"
driver.find_element_by_id("kw").send_keys(u"長城")

# id="su"是百度搜索按鈕，click() 是模擬點擊
driver.find_element_by_id("su").click()

# 獲取新的頁面快照
driver.save_screenshot("長城.png")

# 打印網頁渲染后的源代碼
print driver.page_source

# 獲取當前頁面Cookie
print driver.get_cookies()

# ctrl+a 全選輸入框內容
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'a')

# ctrl+x 剪切輸入框內容
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'x')

# 輸入框重新輸入內容
driver.find_element_by_id("kw").send_keys("itcast")

# 模擬Enter回車鍵
driver.find_element_by_id("su").send_keys(Keys.RETURN)

# 清除輸入框內容
driver.find_element_by_id("kw").clear()

# 生成新的頁面快照
driver.save_screenshot("itcast.png")

# 獲取當前url
print driver.current_url

# 關閉當前頁面，如果只有一個頁面，會關閉瀏覽器
# driver.close()

# 關閉瀏覽器
driver.quit()

頁面操作

Selenium 的 WebDriver提供了各種方法來尋找元素，假設下面有一個表單輸入框：

<input type="text" name="user-name" id="passwd-id" />

那么：

# 獲取id標簽值 element = driver.find_element_by_id("passwd-id") # 獲取name標簽值 element = driver.find_element_by_name("user-name") # 獲取標簽名值 element = driver.find_elements_by_tag_name("input") # 也可以通過XPath來匹配 element = driver.find_element_by_xpath("//input[@id='passwd-id']")

定位UI元素 (WebElements)

關於元素的選取，有如下的API 單個元素選取

find_element_by_id
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

By ID

<div id="coolestWidgetEvah">...</div>

實現

element = driver.find_element_by_id("coolestWidgetEvah") ------------------------ or ------------------------- from selenium.webdriver.common.by import By element = driver.find_element(by=By.ID, value="coolestWidgetEvah")

By Class Name

<div class="cheese"><span>Cheddar</span></div><div class="cheese"><span>Gouda</span></div>

實現

cheeses = driver.find_elements_by_class_name("cheese") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheeses = driver.find_elements(By.CLASS_NAME, "cheese")

By Tag Name

<iframe src="..."></iframe>

實現

frame = driver.find_element_by_tag_name("iframe") ------------------------ or ------------------------- from selenium.webdriver.common.by import By frame = driver.find_element(By.TAG_NAME, "iframe")

By Name

<input name="cheese" type="text"/>

實現

cheese = driver.find_element_by_name("cheese") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheese = driver.find_element(By.NAME, "cheese")

By Link Text

<a href="http://www.google.com/search?q=cheese">cheese</a>

實現

cheese = driver.find_element_by_link_text("cheese") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheese = driver.find_element(By.LINK_TEXT, "cheese")

By Partial Link Text

<a href="http://www.google.com/search?q=cheese">search for cheese</a>>

實現

cheese = driver.find_element_by_partial_link_text("cheese") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheese = driver.find_element(By.PARTIAL_LINK_TEXT, "cheese")

By CSS

<div id="food"><span class="dairy">milk</span><span class="dairy aged">cheese</span></div>

實現

cheese = driver.find_element_by_css_selector("#food span.dairy.aged") ------------------------ or ------------------------- from selenium.webdriver.common.by import By cheese = driver.find_element(By.CSS_SELECTOR, "#food span.dairy.aged")

By XPath

<input type="text" name="example" /> <INPUT type="text" name="other" />

實現

inputs = driver.find_elements_by_xpath("//input") ------------------------ or ------------------------- from selenium.webdriver.common.by import By inputs = driver.find_elements(By.XPATH, "//input")

鼠標動作鏈

有些時候，我們需要再頁面上模擬一些鼠標操作，比如雙擊、右擊、拖拽甚至按住不動等，我們可以通過導入 ActionChains 類來做到：

示例：


#導入 ActionChains 類 from selenium.webdriver import ActionChains # 鼠標移動到 ac 位置 ac = driver.find_element_by_xpath('element') ActionChains(driver).move_to_element(ac).perform() # 在 ac 位置單擊 ac = driver.find_element_by_xpath("elementA") ActionChains(driver).move_to_element(ac).click(ac).perform() # 在 ac 位置雙擊 ac = driver.find_element_by_xpath("elementB") ActionChains(driver).move_to_element(ac).double_click(ac).perform() # 在 ac 位置右擊 ac = driver.find_element_by_xpath("elementC") ActionChains(driver).move_to_element(ac).context_click(ac).perform() # 在 ac 位置左鍵單擊hold住 ac = driver.find_element_by_xpath('elementF') ActionChains(driver).move_to_element(ac).click_and_hold(ac).perform() # 將 ac1 拖拽到 ac2 位置 ac1 = driver.find_element_by_xpath('elementD') ac2 = driver.find_element_by_xpath('elementE') ActionChains(driver).drag_and_drop(ac1, ac2).perform()

填充表單

我們已經知道了怎樣向文本框中輸入文字，但是有時候我們會碰到<select> </select>標簽的下拉框。直接點擊下拉框中的選項不一定可行。

<select id="status" class="form-control valid" onchange="" name="status"> <option value=""></option> <option value="0">未審核</option> <option value="1">初審通過</option> <option value="2">復審通過</option> <option value="3">審核不通過</option> </select>

Selenium專門提供了Select類來處理下拉框。其實 WebDriver 中提供了一個叫 Select 的方法，可以幫助我們完成這些事情：

# 導入 Select 類 from selenium.webdriver.support.ui import Select # 找到 name 的選項卡 select = Select(driver.find_element_by_name('status')) # select.select_by_index(1) select.select_by_value("0") select.select_by_visible_text(u"未審核")

以上是三種選擇下拉框的方式，它可以根據索引來選擇，可以根據值來選擇，可以根據文字來選擇。注意：

index 索引從 0 開始

value是option標簽的一個屬性值，並不是顯示在下拉框中的值

visible_text是在option標簽文本的值，是顯示在下拉框的值

全部取消選擇怎么辦呢？很簡單:

select.deselect_all()

彈窗處理

當你觸發了某個事件之后，頁面出現了彈窗提示，處理這個提示或者獲取提示信息方法如下：

alert = driver.switch_to_alert()

頁面切換

一個瀏覽器肯定會有很多窗口，所以我們肯定要有方法來實現窗口的切換。切換窗口的方法如下：

driver.switch_to.window("this is window name")

也可以使用 window_handles 方法來獲取每個窗口的操作對象。例如：

for handle in driver.window_handles: driver.switch_to_window(handle)

頁面前進和后退

操作頁面的前進和后退功能：

driver.forward()     #前進 driver.back() # 后退

Cookies

獲取頁面每個Cookies值，用法如下

for cookie in driver.get_cookies(): print "%s -> %s" % (cookie['name'], cookie['value'])

刪除Cookies，用法如下

# By name driver.delete_cookie("CookieName") # all driver.delete_all_cookies()

頁面等待

注意：這是非常重要的一部分！！

現在的網頁越來越多采用了 Ajax 技術，這樣程序便不能確定何時某個元素完全加載出來了。如果實際頁面等待時間過長導致某個dom元素還沒出來，但是你的代碼直接使用了這個WebElement，那么就會拋出NullPointer的異常。

為了避免這種元素定位困難而且會提高產生 ElementNotVisibleException 的概率。所以 Selenium 提供了兩種等待方式，一種是隱式等待，一種是顯式等待。

隱式等待是等待特定的時間，顯式等待是指定某一條件直到這個條件成立時繼續執行。

顯式等待

顯式等待指定某個條件，然后設置最長等待時間。如果在這個時間還沒有找到元素，那么便會拋出異常了。

from selenium import webdriver from selenium.webdriver.common.by import By # WebDriverWait 庫，負責循環等待 from selenium.webdriver.support.ui import WebDriverWait # expected_conditions 類，負責條件出發 from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() driver.get("http://www.xxxxx.com/loading") try: # 頁面一直循環，直到 id="myDynamicElement" 出現 element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement")) ) finally: driver.quit()

如果不寫參數，程序默認會 0.5s 調用一次來查看元素是否已經生成，如果本來元素就是存在的，那么會立即返回。

下面是一些內置的等待條件，你可以直接調用這些條件，而不用自己寫某些等待條件了。

title_is
title_contains
presence_of_element_located
visibility_of_element_located
visibility_of
presence_of_all_elements_located
text_to_be_present_in_element
text_to_be_present_in_element_value
frame_to_be_available_and_switch_to_it
invisibility_of_element_located
element_to_be_clickable – it is Displayed and Enabled.
staleness_of
element_to_be_selected
element_located_to_be_selected
element_selection_state_to_be
element_located_selection_state_to_be
alert_is_present

隱式等待

隱式等待比較簡單，就是簡單地設置一個等待時間，單位為秒。

from selenium import webdriver driver = webdriver.Chrome() driver.implicitly_wait(10) # seconds driver.get("http://www.xxxxx.com/loading") myDynamicElement = driver.find_element_by_id("myDynamicElement")

當然如果不設置，默認等待時間為0。

示例一：使用Selenium + PhantomJS模擬豆瓣網登錄

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import time
from selenium import webdriver

# 實例化一個瀏覽器對象
driver = webdriver.PhantomJS("F:/Various plug-ins/phantomjs-2.1.1-windows/bin/phantomjs.exe")
driver.get("http://www.douban.com")

# 輸入賬號密碼
driver.find_element_by_name("form_email").send_keys("xx@qq.com")　　# 找到name=..的位置輸入值
driver.find_element_by_name("form_password").send_keys("xxx")

# 模擬點擊登錄
driver.find_element_by_xpath("//input[@class='bn-submit']").click()　　# 按照xpath的方式找到登錄按鈕，點擊

# 等待3秒
time.sleep(3)

# 生成登陸后快照
driver.save_screenshot("douban.png")

with open("douban.html", "wb") as file:
    file.write(driver.page_source.encode("utf-8"))      # driver.page_source獲取當前頁面的html

driver.quit()   # 關閉瀏覽器

示例2：模擬動態頁面的點擊（斗魚）

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import unittest,time
from selenium import webdriver
from bs4 import BeautifulSoup as bs

class douyu(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.PhantomJS("F:/Various plug-ins/phantomjs-2.1.1-windows/bin/phantomjs.exe")
        self.num = 0

    def testDouyu(self):
        self.driver.get("https://www.douyu.com/directory/all")

        while True:
            soup = bs(self.driver.page_source,'lxml')
            names = soup.find_all('h3',attrs={"class" : "ellipsis"})
            numbers = soup.find_all("span", attrs={"class" :"dy-num fr"})
            for name, number in zip(names, numbers):
                print u"觀眾人數: -" + number.get_text().strip() + u"-\t房間名: " + name.get_text().strip()
                self.num += 1
            if self.driver.page_source.find("shark-pager-disable-next") != -1:
                break
            time.sleep(0.5)     # 要sleep一會，等頁面加載完，否則會報錯
            self.driver.find_element_by_class_name("shark-pager-next").click()

    # 單元測試模式的測試結束執行的方法
    def tearDown(self):
        # 退出PhantomJS()瀏覽器
        print "當前網站直播人數" + str(self.num)
        # print "當前網站觀眾人數" + str(self.count)
        self.driver.quit()

if __name__ == "__main__":
    # 啟動測試模塊，必須這樣寫
    unittest.main()

示例3：執行JavaScript語句，模擬滾動條滾動到底部

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from selenium import webdriver
import time

driver = webdriver.PhantomJS()
driver.get("https://movie.douban.com/typerank?type_name=劇情&type=11&interval_id=100:90&action=")

# 向下滾動10000像素
js = "document.body.scrollTop=10000"
#js="var q=document.documentElement.scrollTop=10000"
time.sleep(3)

#查看頁面快照
driver.save_screenshot("douban.png")

# 執行JS語句
driver.execute_script(js)
time.sleep(10)

#查看頁面快照
driver.save_screenshot("newdouban.png")

driver.quit()

示例4：模擬登錄kmust教務管理系統

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys     # 導入這個包，才能操作鍵盤

driver = webdriver.PhantomJS("F:/Various plug-ins/phantomjs-2.1.1-windows/bin/phantomjs.exe")   # 實例化瀏覽器對象

driver.get("http://kmustjwcxk1.kmust.edu.cn/jwweb/")    # 發送get請求，訪問昆工校園網

driver.find_element_by_id('m14').click()    # 點擊用戶登錄按鈕，我們的校園網是采用的iframe加載的

driver.switch_to.frame("frmHomeShow")   # 所以我們要采用driver.switch_to.frame("iframe標簽名")方法，進到iframe里面

with open("kmust.html", "wb") as file:  # 保存一下，iframe里面的HTML
    file.write(driver.page_source.encode("utf-8"))
    
driver.find_element_by_id('txt_asmcdefsddsd').send_keys('xxx')     # 找到學號標簽，添加數據
driver.find_element_by_id('txt_pewerwedsdfsdff').send_keys('xxx')     # 找到密碼標簽，添加數據
driver.find_element_by_id('txt_sdertfgsadscxcadsads').click()       # 我們學校的驗證碼需要點一下驗證碼的input框才可以顯示出來
driver.save_screenshot("kmust.png")     # 保存當前界面的截屏
captcha = raw_input('請輸入驗證碼：')      # 打開截屏文件，這里需要手動輸入
driver.find_element_by_id('txt_sdertfgsadscxcadsads').send_keys(captcha)    # 找到驗證碼標簽，添加數據
driver.find_element_by_id("txt_sdertfgsadscxcadsads").send_keys(Keys.RETURN)    # 模擬鍵盤的Enter鍵
time.sleep(1)       # 網速太慢，讓他加載一會
driver.save_screenshot("kmust_ok.png")      # 保存一下登錄成功的截圖 
driver.switch_to.frame("banner")        # 我們的教務網站是由下面4個iframe組成的
# driver.switch_to.frame("menu")
# driver.switch_to.frame("frmMain")
# driver.switch_to.frame("frmFoot") 
with open("kmust-ok.html", "ab") as file:   # 所以我們要進入每個iframe，執行相應的操作（我不多說了，搶課腳本...）
    file.write(driver.page_source.encode("utf-8"))      # 保存下當前iframe頁面的HTML數據

一個CPU一次只能執行一個進程，其他進程處於非運行狀態
進程里面包含的執行單位叫線程，一個進程包含多個線程
一個進程里面的內存空間是共享的，里面的線程都可以使用這個內存空間
一個線程在使用這個共享空間時，其他線程必須等他結束（通過加鎖實現）
鎖的作用：防止多個線程同時用這塊共享的內存空間，先使用的線程會上一把鎖，其他線程想要用的話就要等他用完才可以進去
python中的鎖（GIL）
python的多線程很雞肋，所以scrapy框架用的是協程

python多進程適用於：大量密集的並行計算python多線程適用於：大量密集的I/O操作

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python爬蟲(十四)_BeautifulSoup4 解析器 python爬蟲-html解析器beautifulsoup Python HTML解析器BeautifulSoup(爬蟲解析器) 爬蟲筆記(四)------關於BeautifulSoup4解析器與編碼關於BeautifulSoup4 解析器的說明爬蟲-使用BeautifulSoup4（bs4）解析html數據 python爬蟲beautifulsoup4系列3 python爬蟲beautifulsoup4系列1 python爬蟲beautifulsoup4系列2 python爬蟲之BeautifulSoup的HTML解析

python——BeautifulSoup4解析器，JSON與JsonPATH，多線程爬蟲，動態HTML處理

爬蟲的自我修養_3

一、CSS 選擇器：BeautifulSoup4

四大對象種類

1. Tag

對於 Tag，它有兩個重要的屬性，是 name 和 attrs

2. NavigableString

3. BeautifulSoup

4. Comment

遍歷文檔樹

1. 直接子節點 ：.contents .children 屬性

.content

.children

2. 所有子孫節點: .descendants 屬性

搜索文檔樹

1.find_all(name, attrs, recursive, text, **kwargs)

1）name 參數

A.傳字符串

B.傳正則表達式

C.傳列表

2）keyword 參數（屬性）

3）text 參數

2.find(name, attrs, recursive, text, **kwargs)

CSS選擇器

（1）通過標簽名查找

（2）通過類名查找

（3）通過 id 名查找

（4）組合查找

（5）屬性查找

(6) 獲取內容

二、JSON與JsonPATH

JSON

json模塊提供了四個功能：dumps、dump、loads、load，用於字符串 和 python數據類型間進行轉換。

JsonPath

三、多線程爬蟲案例

Queue（隊列對象）

多線程示意圖

四、動態HTML處理

獲取JavaScript，jQuery，Ajax...加載的網頁數據

Selenium

PhantomJS

快速入門

頁面操作

定位UI元素 (WebElements)

關於元素的選取，有如下的API 單個元素選取

鼠標動作鏈

填充表單

彈窗處理

頁面切換

頁面前進和后退

Cookies

頁面等待

顯式等待

隱式等待

免責聲明！

對於 Tag，它有兩個重要的屬性，是 name 和 attrs　　

1. 直接子節點：`.contents` `.children` 屬性

2. 所有子孫節點: `.descendants` 屬性

1.`find_all(name, attrs, recursive, text, **kwargs)`

2.`find(name, attrs, recursive, text, **kwargs)`

json模塊提供了四個功能：`dumps`、`dump`、`loads`、`load`，用於字符串和 python數據類型間進行轉換。