Python3.5:爬取網站上電影數據


首先我們導入幾個pyhton3的庫:

from urllib import request
import urllib
from html.parser import HTMLParser

在Python2和Python3之間一個重要區別就是,在Python2有urllib,urllib2兩個庫,在Python3整合到一起,里面的函數方式也有一點變,先定義一個函數,將header,url,request,都打包成一個函數方便調用,且看下面代碼:

def print_movies(url):
    # 偽裝成瀏覽器訪問網站,但其實沒啥用,很容易被中間件檢測出來,但沒有又不行,所以蠻寫吧
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
    # Python3的urllib
    req = urllib.request.Request(url, headers=header)
    s = urllib.request.urlopen(req)
    parser = MovieParser()
    parser.feed((s.read()).decode('utf-8'))
    s.close()

再重載HTMLParser庫的handle_starttag(self, tag, attrs),系統就會默認調用用戶重載的,具體調用方式在官方文檔里面詳細介紹:HTMLParser,

class MovieParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.movies = []
        # 重載HTMLParser自帶的函數
    def handle_starttag(self, tag, attrs):
        def _attr(attrlist, attrname):
            for attr in attrlist:
                if attr[0] == attrname:
                    return attr[1]
            return None
        # 可以在這class后面找到每個li標簽的特征屬性比如catrgory在下面判斷
        if tag == 'li' and _attr(attrs, 'data-title'):
            movie= {}
            movie['title'] = _attr(attrs, 'data-title')
            movie['rate'] = _attr(attrs, 'data-rate')
            movie['director'] = _attr(attrs, 'data-director')
            movie['actors'] = _attr(attrs, 'data-actors')
            self.movies.append(movie)
            print('%(title)s|%(rate)s|%(director)s|%(actors)s' % movie)

當我們執行到parser.feed((s.read()).decode('utf-8'))時,知道為什么要這樣寫,首先parser時HTMLParser的子類所以包括feed(),在注入數據時,s.read()是返回bytes類型,但feed()只接受str類型,所以直接在后面加個decode('utf-8')即轉碼(三個bytes轉換為一個中文),又可以轉換為str,基本獲取數據就這么簡單,要是想獲取別的網站的數據,可以換個url和條件判斷就可以了,我把全部代碼貼上來:

from urllib import request
import urllib
from html.parser import HTMLParser

class MovieParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.movies = []
        # 重載HTMLParser自帶的函數
    def handle_starttag(self, tag, attrs):
        def _attr(attrlist, attrname):
            for attr in attrlist:
                if attr[0] == attrname:
                    return attr[1]
            return None
        # 可以在這class后面找到每個li標簽的特征屬性比如catrgory在下面判斷
        if tag == 'li' and _attr(attrs, 'data-title'):
            movie= {}
            movie['title'] = _attr(attrs, 'data-title')
            movie['rate'] = _attr(attrs, 'data-rate')
            movie['director'] = _attr(attrs, 'data-director')
            movie['actors'] = _attr(attrs, 'data-actors')
            self.movies.append(movie)
            print('%(title)s|%(rate)s|%(director)s|%(actors)s' % movie)

def print_movies(url):
    # 偽裝成瀏覽器訪問網站,但其實沒啥用,很容易被中間件檢測出來,但沒有又不行,所以蠻寫吧
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
    # Python3的urllib
    req = urllib.request.Request(url, headers=header)
    s = urllib.request.urlopen(req)
    parser = MovieParser()
    parser.feed((s.read()).decode('utf-8'))
    s.close()


if __name__ == '__main__':
    url = 'https://movie.douban.com/'
    # 返回一個電影列表
    print_movies(url)

運行結果為:

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM