Python 爬蟲實戰（一）——requests+正則表達式爬取貓眼TOP100

本文轉載自查看原文 2017-06-30 17:09 5146 python/ PYTHON 爬蟲

一。思路：python 內置了兩個網絡庫 urlib和urlib2,但是這兩個庫使用起來不是很方便，所以這里使用廣受好評的第三庫requests。（基本思路使用requests獲取頁面信息，使用正則表達式解析頁面，為了更加迅速的爬取數據，使用multiprocessing實現多進程抓取。下一篇文章會使用Beautifulsoup來解析頁面。這篇文章主要用來記錄一下代碼過程中遇到的一點問題，關於各個模塊的使用自行先熟悉。

環境配置：我用的是Anaconda2 （python 2.7）

二。Requests：

1.request官方文檔：http://docs.python-requests.org/en/master/
使用requests的get請求獲得貓眼Top100的html。通過狀態碼來判斷請求是否成功，如果請求不成功，那么返回None.用Requests的RequestException來捕捉異常：

def get_one_page(url):
    try:
        response=requests.get(url)
        if response.status_code==200:
            response=response.text
            return response
        else:
            return None
    except RequestException:
        return None

2.用正則表達式來解析這個頁面。這里面主要用到 compile()和findall()這兩個方法。

使用函數compile(pattern,flag) 進行預編譯，把我們想要提取的index,title,actor等信息分別用子組來匹配中。findall()返回的是一個元組組成的列表，當正則表達式有多個子組的時候，元組中的每一個元素都是一個子模式的匹配內容。re.S這個參數是在字符串a中，包含換行符\n，在這種情況下，如果不使用re.S參數，則只在每一行內進行匹配，如果一行沒有，就換下一行重新開始。而使用re.S參數以后，正則表達式會將這個字符串作為一個整體，在整體中進行匹配。

def parse_one_page(html):
    pattern=re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name">'
                      +'<a.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">'
                      +'(.*?)</i>.*?fraction">(\d+)</i>.*?</dd>',re.S)
    items=re.findall(pattern,html) ## 這個時候items是一個元組組成的列表
    for item in items:              ##遍歷列表，用一個生成器來存儲遍歷到的結果
        yield {
            "index":item[0],
             "title":item[2],
             "actor":item[3].strip()[3:],
             "time":item[4].strip()[5:],
             "score":item[5]+item[6] ,
             "image":item[1]
            
         }

3.把爬取下來的結果存取下來。

這里要用到 json.dumps()函數，需要注意的是json.dumps是將dict轉化成str格式，json.loads是將str轉化成dict格式.

這兒我用的是with open函數來寫的python 3中可以直接給open函數傳入(encoding='utf-8')編碼參數，但是python 2.7中需要引入codes模塊才可以給open函數傳入編碼參數，否則會報錯。

def write_to_file(content):    
    with open('result.txt','a','utf-8') as f:
    #f=codecs.open('result.txt','a','utf-8') 
        f.write(json.dumps(content,ensure_ascii=False)+'\n')

4.在貓眼Top100榜單看到，每個頁面只展示了10個，網頁通過一個偏移量offset來設置每個頁面展示的上榜電影。

url='http://maoyan.com/board/4?offset='+str(offset)
    html=get_one_page(url)
    for item in parse_one_page(html):
        write_to_file(item)

5. 多線程爬取使用 multiprocessing 這個模塊。

5.整個代碼塊如下：

# -*- coding: utf-8 -*-
"""
Created on Thu Jun 29 10:23:46 2017

@author: Tana
"""
import requests
import codecs
import json
from requests.exceptions import RequestException 
import re
import  multiprocessing 



def get_one_page(url):
    try:
        response=requests.get(url)
        if response.status_code==200:
            response=response.text
            return response
        else:
            return None
    except RequestException:
        return None
        
def parse_one_page(html):
    pattern=re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name">'
                      +'<a.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">'
                      +'(.*?)</i>.*?fraction">(\d+)</i>.*?</dd>',re.S)
    items=re.findall(pattern,html)
    for item in items:
        yield {
            "index":item[0],
             "title":item[2],
             "actor":item[3].strip()[3:],
             "time":item[4].strip()[5:],
             "score":item[5]+item[6] ,
             "image":item[1]
            
         }
    
  
def write_to_file(content): 
    with codecs.open('result.txt','a','utf-8') as f:
        
        
   # f=codecs.open('result.txt','a','utf-8') 
        f.write(json.dumps(content,ensure_ascii=False)+'\n')
        f.close()
  

def main(offset):
    url='http://maoyan.com/board/4?offset='+str(offset)
    html=get_one_page(url)
    for item in parse_one_page(html):
        write_to_file(item)
    
   # print html
    
if __name__=='__main__':
    pool=multiprocessing.Pool()
    pool.map(main,[i*10 for i in range(10)])

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Requests+正則表達式爬取貓眼電影 python爬蟲：爬取貓眼TOP100榜的100部高分經典電影 python 爬取貓眼電影top100數據 Python爬蟲：使用正則表達式爬取網站電影信息 python網絡爬蟲之解析網頁的正則表達式(爬取4k動漫圖片)[三] 【Python爬蟲實戰--3】html寫正則表達式貓眼電影爬取(一)：requests+正則，並將數據存儲到mysql數據庫 [轉][python] 常用正則表達式爬取網頁信息及分析HTML標簽總結 [python] 常用正則表達式爬取網頁信息及分析HTML標簽總結 Selenium+python --使用正則表達式爬取頁面的URL鏈接

Python 爬蟲實戰（一）——requests+正則表達式 爬取貓眼TOP100

免責聲明！

Python 爬蟲實戰（一）——requests+正則表達式爬取貓眼TOP100