Python(十一) 原生爬蟲

本文轉載自查看原文 2018-08-20 16:33 922 python

一、分析抓取目的確定抓取頁面

#爬取主播人氣排行

二、整理爬蟲常規思路

爬蟲前奏

明確目的
找到數據對應的網頁
分析網頁的結構找到數據所在的標簽位置

模擬 HTTP 請求， 向服務器發送這個請求， 獲取到服務器返回給我們的HTML
用正則表達式提取我們要的數據（名字，人數）

三、 VSCode中調試代碼

F5 啟動和vs 調試一樣

BeautifulSoup , Scrapy

爬蟲、反爬蟲、反反爬蟲

ip 封

代理 ip庫

五、數據提取層級分析及原則三、正則分析HTML、正則分析獲取名字和人數

from urllib import request
import re
#斷點調試 有坑 7i
class  Spider():
    url = 'https://www.panda.tv/cate/lol'
    root_pattern ='<div class="video-info">([\s\S]*?)</div>'
    name_pattern = '</li>([\s\S]*?)</span>' 
    number_pattern = '<span class="video-number">([\s\S]*?)</span>'

    def __fetch_content(self):
        r = request.urlopen(Spider.url)
        #bytes
        htmls = r.read()
        htmls = str(htmls,encoding='utf-8')
        return htmls
    
    def __analysis(self,htmls):
        root_html = re.findall(Spider.root_pattern, htmls)
        anchors = []
        for html in root_html:
            name = re.findall(Spider.name_pattern, html)
            number = re.findall(Spider.number_pattern, html)
            anchor = {'name':name,'number':number}
            anchors.append(anchor)
        # print(anchors[0])
        return anchors

    def __refine(self, anchors):
        l = lambda anchor:{
            'name':anchor['name'][0].strip(),
            'number':anchor['number'][0]
            }
        return map(l,anchors)

    def go(self):
        htmls = self.__fetch_content()
        anchors = self.__analysis(htmls)
        anchors =list(self.__refine(anchors))
        print(anchors[0])

s = Spider()
s.go()

結果：
{'name': 'LOL丶搖擺哥', 'number': '26.8萬'}

八、數據精煉、 sorted 排序

from urllib import request
import re
#斷點調試 坑 7i
class  Spider():
    url = 'https://www.panda.tv/cate/lol'
    root_pattern ='<div class="video-info">([\s\S]*?)</div>'
    name_pattern = '</li>([\s\S]*?)</span>' 
    number_pattern = '<span class="video-number">([\s\S]*?)</span>'

    # 獲取數據的頁面
    def __fetch_content(self):
        r = request.urlopen(Spider.url)
        #bytes
        htmls = r.read()
        htmls = str(htmls,encoding='utf-8')
        return htmls
    
    # 從頁面上抓取數據
    def __analysis(self,htmls):
        root_html = re.findall(Spider.root_pattern, htmls)
        anchors = []
        for html in root_html:
            name = re.findall(Spider.name_pattern, html)
            number = re.findall(Spider.number_pattern, html)
            anchor = {'name':name,'number':number}
            anchors.append(anchor)
        # print(anchors[0])
        return anchors

    # 數據取雜質（空格換行）strip() 字符串去空格換行
    def __refine(self, anchors):
        l = lambda anchor:{
            'name':anchor['name'][0].strip(),
            'number':anchor['number'][0]
            }
        return map(l,anchors) #map類  對字典每一個序列進行l這個函數

    # 對抓取的數據進行排序 reverse=True 倒序
    def __sort(self, anchors):
        anchors = sorted(anchors, key=self.__sort_seed, reverse=True)
        return anchors

    # 給 key 寫的函數 說明用那個進行排序
    def __sort_seed(self, anchors):
        r = re.findall('\d*', anchors['number'])
        number = float(r[0])
        if '萬' in anchors['number']:
            number *= 10000
        return number

    # 顯示排名
    def __show(self, anchors):
        for rank in range(0,len(anchors)):
            print('rank '+ str(rank +1)+'   : '+anchors[rank]['name']+'    '+anchors[rank]['number']+'人')

    # 主程序
    def go(self):
        htmls = self.__fetch_content()
        anchors = self.__analysis(htmls)
        anchors =list(self.__refine(anchors))
        print(anchors[0])
        anchors= self.__sort(anchors)
        self.__show(anchors[:20])

s = Spider()
s.go()

結果：
{'name': 'LOL丶搖擺哥', 'number': '20.2萬'}
rank 1   : 賈克虎丶虎神    96.9萬人
rank 2   : LOL丶搖擺哥    20.2萬人
rank 3   : LPL熊貓官方直播    12.1萬人
rank 4   : WUCG官方直播平台    8.4萬人
rank 5   : 溫州丶黃小明    5.1萬人
rank 6   : 暴君aa    4.6萬人
rank 7   : 順順套路王    3.1萬人
rank 8   : 火苗OB解說    2.5萬人
rank 9   : 蘭晨丶    1.1萬人
rank 10   : 海洋OvO    1.9萬人
rank 11   : 小馬哥玩蓋倫    1.6萬人
rank 12   : 牛老師丶    1.5萬人
rank 13   : Riot國際賽事直播間    1.5萬人
rank 14   : 小白Mini    7361人
rank 15   : 一個很C的稻草人    7223人
rank 16   : 抗寒使者    4976人
rank 17   : 小麥子鮮奶油    4902人
rank 18   : 祝允兒    4574人
rank 19   : 請叫我大腿岩丶    4201人
rank 20   : 李小青ZJ    3838人

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python爬蟲（十一） session 自學Python十一 Python爬蟲總結 Python 爬蟲從入門到進階之路（十一） Python爬蟲(二十一)_Selenium與PhantomJS 爬蟲(十一)：selenium爬蟲 Python爬蟲從入門到放棄（十一）之 Scrapy框架整體的一個了解 python爬蟲（二十一）中國天氣網最低氣溫爬蟲及可視化 Python爬蟲(十一)_案例：使用正則表達式的爬蟲 Jmeter(四十一)_圖片爬蟲 Python爬蟲從入門到放棄（二十一）之 Scrapy分布式部署