pyspider爬取網頁實例

本文轉載自查看原文 2020-01-20 16:28 373 大數據

1. 歷趣網

咱要爬取的網站是 http://www.liqucn.com/rj/new/ 這個網站我看了一下，有大概13021頁，每頁數據是12個，數據量大概在150000左右，可以抓取下來，后面做數據分析使用，也可以練習優化數據庫。

網站基本沒有反爬措施，上去爬就可以，略微控制一下並發，畢竟不要給別人服務器太大的壓力。

頁面經過分析之后，可以看到它是基於URL進行的分頁，這就簡單了，我們先通過首頁獲取總頁碼，然后批量生成所有頁碼即可

https://www.liqucn.com/rj/new/ https://www.liqucn.com/rj/new/?page=12953 https://www.liqucn.com/rj/new/?page=12952 https://www.liqucn.com/rj/new/?page=12951

這里可以看出page的頁碼是越來越小的。

在index_page中獲取總頁碼的代碼：

class Handler(BaseHandler):
    crawl_config = {
        'itag': 'v223'
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('https://www.liqucn.com/rj/new/', callback=self.index_page,validate_cert=False)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        #獲取總的頁碼數
        totle = response.doc(".page :nth-child(2)").attr('href')
        print(totle)
        num = re.search("(\d+)",totle).group()
        
        #for page in range(1,int(num)+2):
            #self.crawl('http://www.liqucn.com/rj/new/?page={}'.format(page), callback=self.detail_page,validate_cert=False)
        #這里先抓取一頁
        self.crawl('http://www.liqucn.com/rj/new/?page={}'.format(13021), callback=self.detail_page,validate_cert=False)

分頁數據已添加到待爬取隊列中去了，下面開始分析爬取到的數據，這個在detail_page函數實現：

@config(priority=2)
    def detail_page(self, response):
        docs = response.doc(".tip_blist li").items()
        dicts = []
        for item in docs:
            title = item(".tip_list>span>a").text()
            pubdate = item(".tip_list :nth-child(3)").text()
            info = item(".tip_list :nth-child(4)").text()
            size = info.split("/")
            app_type = size[0]
            if len(size) == 2:
                size = size[1].strip()
            else:
                size = "0MB"                      
            mobile_type = item("h3>a").text()            

            # 獲取軟件logo圖片下載地址          
            img_url = item(".tip_list>a>img").attr("src")
            # 獲取文件名字
            filename = img_url[img_url.rindex("/")+1:]
            # 下載軟件logo圖片並保存到本地
            self.crawl(img_url,callback=self.save_img,save={"filename":filename},validate_cert=False)
　　　　　　　#獲取app數據
            data={
                "title":title,
                "pubdate":pubdate,
                "size":size,
                "app_type":app_type,
                "mobile_type":mobile_type 
                }
            dicts.append(data)
            print(dicts)
        return dicts

app數據已經集中返回，我們重寫on_result來保存數據。

from pyspider.libs.base_handler import *
import re,os
import pandas as pd

DIR_PATH=r"d:\123\\"

數據存儲

def on_result(self,result):
        if result:
            print("正在存儲數據....")
            data = pd.DataFrame(result)
            data.to_csv(r"d:\數據.csv", mode='a', header=False, encoding='utf_8_sig')

最后把圖片下載完善一下，就收工啦！圖片下載，其實就是保存網絡圖片到一個地址即可

def save_img(self,response):
        content = response.content
        file_name = response.save["filename"]
        #創建文件夾（如果不存在）
        if not os.path.exists(DIR_PATH):                         
            os.makedirs(DIR_PATH) 
            
        file_path = DIR_PATH + file_name
        
        with open(file_path,"wb" ) as f:
            f.write(content)

到此為止，任務完成，保存之后，調整爬蟲的抓取速度，點擊run，數據跑起來~~~~

2. 虎嗅網

網址為 https://www.huxiu.com/ 爬的就是它的資訊頻道。

常規操作，分析待爬取的頁面

拖拽頁面到最底部，會發現一個加載更多按鈕，點擊之后，抓取一下請求，得到如下地址

在這里插入圖片描述

查閱該請求的方式和地址，包括參數，如下圖所示
在這里插入圖片描述

得到以下信息

頁面請求地址為：https://www.huxiu.com/v2_action/article_list
請求方式：POST
請求參數比較重要的是一個叫做page的參數

我們只需要按照上面的內容，把pyspider代碼部分編寫完畢即可。
on_start 函數內部編寫循環事件，注意到有個數字2025這個數字，是我從剛才那個請求中看到的總頁數。你看到這篇文章的時候，這個數字應該變的更大了。

 @every(minutes=24 * 60) def on_start(self): for page in range(1,2025): print("正在爬取第 {} 頁".format(page)) self.crawl('https://www.huxiu.com/v2_action/article_list', method="POST",data={"page":page},callback=self.parse_page,validate_cert=False)

頁面生成完畢之后，開始調用parse_page 函數，用來解析 crawl() 方法爬取 URL 成功后返回的 Response 響應。

 @config(age=10 * 24 * 60 * 60) def parse_page(self, response): content = response.json["data"] doc = pq(content) lis = doc('.mod-art').items() data = [{ 'title': item('.msubstr-row2').text(), 'url':'https://www.huxiu.com'+ str(item('.msubstr-row2').attr('href')), 'name': item('.author-name').text(), 'write_time':item('.time').text(), 'comment':item('.icon-cmt+ em').text(), 'favorites':item('.icon-fvr+ em').text(), 'abstract':item('.mob-sub').text() } for item in lis ] return data

最后，定義一個 on_result() 方法，該方法專門用來獲取 return 的結果數據。這里用來接收上面 parse_page() 返回的 data 數據，在該方法可以將數據保存到 MongoDB 中。

  # 頁面每次返回的數據 def on_result(self,result): if result: self.save_to_mongo(result) # 存儲到mongo數據庫 def save_to_mongo(self,result): df = pd.DataFrame(result) content = json.loads(df.T.to_json()).values() if collection.insert_many(content): print('存儲數據成功') # 暫停1s time.sleep(1)

好的，保存代碼，修改每秒運行次數和並發數

在這里插入圖片描述

點擊run將代碼跑起來，不過當跑起來之后，就會發現抓取一個頁面之后程序就停止了， pyspider 以 URL的 MD5 值作為 唯一 ID 編號，ID 編號相同，就視為同一個任務，不會再重復爬取。

GET 請求的分頁URL 一般不同，所以 ID 編號會不同，能夠爬取多頁。
POST 請求的URL是相同的，爬取第一頁之后，后面的頁數便不會再爬取。

解決辦法，需要重新寫下 ID 編號的生成方式，在 on_start() 方法前面添加下面代碼即可：

    def get_taskid(self,task): return md5string(task['url']+json.dumps(task['fetch'].get('data','')))

3. 微醫掛號網

今天嘗試使用一個新的爬蟲庫進行數據的爬取，這個庫叫做pyspider，國人開發的，當然支持一下。

github地址： https://github.com/binux/pyspider
官方文檔地址：http://docs.pyspider.org/en/latest/

安裝起來是非常簡單的

pip install pyspider

安裝之后，啟動在CMD控制台里面敲入命令

pyspider

出現如下界面，代表運行成功，一般情況下，你的電腦如果沒有安裝 phantomjs 他會先給你安裝一下。

在這里插入圖片描述

接下來打開瀏覽器，訪問地址輸入 127.0.0.1:5000, 應該顯示如下界面，就可以愉快的進行編碼了~

在這里插入圖片描述

3步創建一個項目

在這里插入圖片描述

2. 微醫掛號網專家團隊數據----庫基本使用入門

這款工具的詳細使用，給你提供一個非常好的博文，寫的很完善了，我就不在贅述了。咱們直接進入到編碼的部分。

https://blog.csdn.net/weixin_37947156/article/details/76495144

3. 微醫掛號網專家團隊數據----爬蟲源碼

我們要爬取的目標站點是微醫掛號網專家團隊數據 網頁地址https://www.guahao.com/eteam/index

在這里插入圖片描述

分析AJAX鏈接地址，尋找爬取規律

在這里插入圖片描述

經過分析之后獲取到的鏈接為 https://www.guahao.com/json/white/search/eteams?q=&dept=&page=2&cid=&pid=&_=1542794523454

其中page參數最重要，表示頁碼，實際測試中發現，當代碼翻頁到 84頁的時候，數據竟然開始重復了，應該是網站本身系統的問題，這個沒有辦法。

爬蟲流程

獲取總頁數
循環爬取每頁的數據

爬取總頁數

在入口函數on_start的位置去爬取第一頁數據，爬取成功之后調用index_page函數

from pyspider.libs.base_handler import * import pandas as pd class Handler(BaseHandler): crawl_config = { }  @every(minutes=24 * 60) def on_start(self): self.crawl('https://www.guahao.com/json/white/search/eteams?page=1', callback=self.index_page,validate_cert=False)

index_page函數用來獲取頁碼總數，並且將所有待爬取的地址存放到self.crawl中，這個地方因為數據重復的原因，最終硬編碼為84頁數據了

 @config(age=10 * 24 * 60 * 60) def index_page(self, response): doctors = response.json if doctors: if doctors["data"]: page_count = doctors["data"]["pageCount"] #for page in range(1,page_count+1): for page in range(1,85): self.crawl('https://www.guahao.com/json/white/search/eteams?page={}'.format(page),callback=self.detail_page,validate_cert=False)

最后一步，解析數據，數據爬取完畢，存放到 csv 文件里面

 @config(priority=2) def detail_page(self, response): doctors = response.json data = doctors["data"]["list"] return data def on_result(self,result): if result: print("正在存儲數據....") data = pd.DataFrame(result) data.to_csv("專家數據.csv", mode='a', header=False, encoding='utf_8_sig')