Python爬蟲目前是基於requests包,下面是該包的文檔,查一些資料還是比較方便。
http://docs.python-requests.org/en/master/
POST發送內容格式
爬取某旅游網站的產品評論,通過分析,獲取json文件需要POST指令。簡單來說:
- GET是將需要發送的信息直接添加在網址后面發送
- POST方式是發送一個另外的內容到服務器
那么通過POST發送的內容可以大概有三種,即form、json和multipart,目前先介紹前兩種
1.content in form
Content-Type: application/x-www-form-urlencoded
將內容放入dict,然后傳遞給參數data即可。
payload = {'key1': 'value1', 'key2': 'value2'} r = requests.post(url, data=payload)
2. content in json
Content-Type: application/json
將dict轉換為json,傳遞給data參數。
payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload))
或者將dict傳遞給json參數。
payload = {'some': 'data'} r = requests.post(url, json=payload
HTTP Hearder概述
A new request may need type(eg: POST), URL, request Headers and request Body. Now let's talk about the request body of a new POST request.
Reference: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Accept
It can be used to specify certain media types which are acceptable for the response.
The asterisk "*" character means all types. For example, "*/*" indicating all media types and "type/*" indicating all subtypes of that type.
";" "q" "=" qvalue is a relative degree. The default "q" is 1.
Accept: audio/*; q=0.2, audio/basic
If more than one media range applies to a given type, the most specific reference has precedence.
Accept: text/*, text/html, text/html;level=1, */*
In this example, "text/html;level=1" has the highest precedence.
Content-Length
the size of the entity-body that would have been sent had the request been a GET.
For example, The form data is like this:
type: all currentPage: 3 productId:
And the Request Body you send is like this:
type=all¤tPage=3&productId=
So the Content-Length is 33.
User-Agent
Search the Internet for different User-Agents.
然后貼一下簡單的代碼供參考。
import requests import json def getCommentStr(): url = r"https://package.com/user/comment/product/queryComments.json" header = { 'User-Agent': r'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0', 'Accept': r'application/json, text/javascript, */*; q=0.01', 'Accept-Language': r'en-US,en;q=0.5', 'Accept-Encoding': r'gzip, deflate, br', 'Content-Type': r'application/x-www-form-urlencoded; charset=UTF-8', 'X-Requested-With': r'XMLHttpRequest', 'Content-Length': '65', 'DNT': '1', 'Connection': r'keep-alive', 'TE': r'Trailers' } params = { 'pageNo': '2', 'pageSize': '10', 'productId': '2590732030', 'rateStatus': 'ALL', 'type': 'all' } r = requests.post(url, headers = header, data = params) print(r.text) getCommentStr()
小技巧
- 對於cookies,感覺可以用瀏覽器的編輯功能,逐步刪除每次發送的cookies信息,判斷哪些是沒有用的?
- 對於測試代碼階段,我還是比較習慣於將爬取的數據存為str,也算是為了服務器減負吧。
爬取信息處理
爬取信息處理主要講Beautifulsoup庫和正則表達式(Regular Expression)
1. BeautifulSoup
bs4的官方文檔
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
首先在Ternimal安裝 pip install bs4 ,同時也可以安裝lxml解析器 pip install lxml ,或者html5lib解析器。
soup = bs4.BeautifulSoup(t,'lxml') tagList = soup.find_all('div', attrs={'class': 'content'}) tagList = soup.find_all('div', attrs={'class': re.compile("(content)|()")})
其中t是需要解析的文本,lxml是解析器。
tagList接收的是div標簽下class="content"的標簽內容,其中可以運用正則表達式對象。
2. 正則表達式
正則表達式使用前先 import re ,基本語法見筆記。
提取匹配信息
對目標文本t匹配
useful = re.findall(r'有用<em>\d+</em>',t)
構造正則表達式對象,並進行使用
usefulRE = re.compile('有用<em>\d+</em>') useful = usefulRE.findall(t)
替換匹配信息
replace()函數替換文本
newUseful.append(useful[i].replace('有用<em>','').replace('</em>',''))
正則表達式替換文本
newScoreA.append(re.sub(r'[^\d+]','',scoreA[i]))