Python爬蟲系列 - 初探：爬取旅游評論

本文轉載自查看原文 2018-10-29 23:35 1064 爬蟲/ Python

Python爬蟲目前是基於requests包，下面是該包的文檔，查一些資料還是比較方便。

http://docs.python-requests.org/en/master/

POST發送內容格式

爬取某旅游網站的產品評論，通過分析，獲取json文件需要POST指令。簡單來說：

GET是將需要發送的信息直接添加在網址后面發送
POST方式是發送一個另外的內容到服務器

那么通過POST發送的內容可以大概有三種，即form、json和multipart，目前先介紹前兩種

1.content in form

Content-Type: application/x-www-form-urlencoded

將內容放入dict，然后傳遞給參數data即可。

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post(url, data=payload)

2. content in json

Content-Type: application/json

將dict轉換為json，傳遞給data參數。

payload = {'some': 'data'}
r = requests.post(url, data=json.dumps(payload))

或者將dict傳遞給json參數。

payload = {'some': 'data'}
r = requests.post(url, json=payload

HTTP Hearder概述

A new request may need type(eg: POST), URL, request Headers and request Body. Now let's talk about the request body of a new POST request.

Reference: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

Accept

It can be used to specify certain media types which are acceptable for the response.

The asterisk "*" character means all types. For example, "*/*" indicating all media types and "type/*" indicating all subtypes of that type.

";" "q" "=" qvalue is a relative degree. The default "q" is 1.

Accept: audio/*; q=0.2, audio/basic

If more than one media range applies to a given type, the most specific reference has precedence.

Accept: text/*, text/html, text/html;level=1, */*

In this example, "text/html;level=1" has the highest precedence.

Content-Length

the size of the entity-body that would have been sent had the request been a GET.

For example, The form data is like this:

type: all
currentPage: 3
productId:

And the Request Body you send is like this:

type=all&currentPage=3&productId=

So the Content-Length is 33.

User-Agent

Search the Internet for different User-Agents.

然后貼一下簡單的代碼供參考。

import requests
import json

def getCommentStr():
    url = r"https://package.com/user/comment/product/queryComments.json"

    header = {
        'User-Agent':           r'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0',
        'Accept':               r'application/json, text/javascript, */*; q=0.01',
        'Accept-Language':      r'en-US,en;q=0.5',
        'Accept-Encoding':      r'gzip, deflate, br',
        'Content-Type':         r'application/x-www-form-urlencoded; charset=UTF-8',
        'X-Requested-With':     r'XMLHttpRequest',
        'Content-Length':       '65',
        'DNT':                  '1',
        'Connection':           r'keep-alive',
        'TE':                   r'Trailers'
    }

    params = {
        'pageNo':               '2',
        'pageSize':             '10',
        'productId':            '2590732030',
        'rateStatus':           'ALL',
        'type':                 'all'
    }
    
    
    r = requests.post(url, headers = header, data = params)
    print(r.text)

getCommentStr()

小技巧

對於cookies，感覺可以用瀏覽器的編輯功能，逐步刪除每次發送的cookies信息，判斷哪些是沒有用的？
對於測試代碼階段，我還是比較習慣於將爬取的數據存為str，也算是為了服務器減負吧。

爬取信息處理

爬取信息處理主要講Beautifulsoup庫和正則表達式（Regular Expression）

1. BeautifulSoup

bs4的官方文檔

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

首先在Ternimal安裝 pip install bs4 ，同時也可以安裝lxml解析器 pip install lxml ，或者html5lib解析器。

soup = bs4.BeautifulSoup(t,'lxml')
tagList = soup.find_all('div', attrs={'class': 'content'})
tagList = soup.find_all('div', attrs={'class': re.compile("(content)|()")})

其中t是需要解析的文本，lxml是解析器。

tagList接收的是div標簽下class="content"的標簽內容，其中可以運用正則表達式對象。

2. 正則表達式

正則表達式使用前先 import re ，基本語法見筆記。

提取匹配信息

對目標文本t匹配

useful = re.findall(r'有用<em>\d+</em>',t)

構造正則表達式對象，並進行使用

usefulRE = re.compile('有用<em>\d+</em>')
useful = usefulRE.findall(t)

替換匹配信息

replace()函數替換文本

newUseful.append(useful[i].replace('有用<em>','').replace('</em>',''))

正則表達式替換文本

newScoreA.append(re.sub(r'[^\d+]','',scoreA[i]))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲-靜態爬取豆瓣評論【Python爬蟲案例學習】Python爬取淘寶店鋪和評論 Python爬蟲實戰：爬取騰訊視頻的評論初識python 之爬蟲：爬取豆瓣電影最熱評論用python寫網絡爬蟲-爬取新浪微博評論 python3爬蟲 -----新浪微博(m)-------評論爬取 python制作爬蟲爬取京東商品評論教程 python 網絡爬蟲（一）爬取天涯論壇評論 Python爬蟲系列之爬取貓眼電影（一） python爬取網易評論