python 手機App數據抓取實戰一


前言

當前手機使用成為互聯網主流,每天手機App產生大量數據,學習爬蟲的人也不能只會爬取網頁數據,我們需要學習如何從手機 APP 中獲取數據,本文就以豆果美食為例,講訴爬取手機App的流程

環境准備

  • python3
  • fiddler
  • 一款支持橋接模式的安卓虛擬機(本文使用夜神模擬器)

需要准備的知識有:

  • requests的使用
  • mongodb的使用
  • fiddler抓包工具的基本操作
  • 線程池ThreadPoolExecutor的基本使用

 

項目開始

我們項目的目標是將豆果美食App中所有的菜譜都抓取下來到我們的本地數據庫中

本文不再講解fiddler、安卓模擬器、以及某些python第三方庫的安裝,不會的同學可以百度,非常簡單的操作

我們抓取的流程大概就是

  1. 安卓模擬器使用代理連接至fiddler
  2. 打開安卓模擬器進行操作
  3. 分析fiddler抓到的數據
  4. 使用python模擬數據給服務器發送request請求得到響應數據
  5. 使用多線程抓取並在本地保存至數據庫

 

1)安卓模擬器使用代理連接至fiddler

打開fiddler,進行設置 打開最上方菜單欄中的 Tools 菜單中的 Options 選項 即配置選項進行以下配置(主要就是前3項的配置)

第三張圖片中的8889就是我們fiddler監聽的端口號,一會我們將模擬器配置代理就可以從fiddler中抓取數據包

打開安卓模擬器的網絡連接的橋接模式(夜神模擬器包括此功能,此步驟需要重啟),使用夜神的童鞋推薦使用android4 以為我一開始使用android5發現橋接模式無法聯網

然后我們現在看下我們電腦的IP地址 打開命令行輸入 ipconfig 命令即可

例如當前我的地址為 172.20.10.2 那么打開安卓模擬器的設置中的無線連接選項,長按網絡名稱如下圖出現修改網絡選項 接下來選擇高級設置添加手動代理 將自己電腦的IP與fiddler中的端口號輸入

接下來測試是否能連接到fiddler 打開fiddler 然后安卓模擬器打開瀏覽器隨便開一個網頁,看看fiddler是否可以抓到包

發現現在是可以抓到數據包的,但是瀏覽器一直在提醒安全證書有問題,這是因為我們模擬器中沒有添加fiddler的證書,我們可以在模擬器瀏覽器中訪問我們電腦的IP地址和端口號來安裝證書

點擊圖中的最下方的超鏈接下載證書  然后安裝證書,注意這一步可能需要我們設置下模擬器的鎖屏密碼自己設置即可

至此我們第一步模擬器連接到fiddler已經完成,接下來可以進行抓取的操作了


 

2)打開安卓模擬器進行操作

模擬器安裝好豆果美食app開始操作  進入App (圖一)  點擊菜譜分類(即index頁面(圖二)) 點進去具體食材(圖三)然后選擇菜譜(圖四)

 

分別進行以上操作后我們的fiddler就會抓取到很多數據包  然后我們現在打開fiddler來分析下到底是哪些數據包對我們有用


3)分析fiddler抓取的數據

我們先來看下這時候fiddler中出現了什么 看起來很多數據包不知道如何下手

學過編程的同學們都知道一般程序的請求數據都是通過api接口的,我們發現fiddler中抓取到一些名稱為 api.douguo.net  的數據包 我們來看看是不是我們想要的數據包

我們點擊名稱以api 接口的幾個數據包就會發現里面返回的數據正是我們想要的數據(出現了分類茄子等內容的數據),所以我們現在將其他沒用的數據包刪掉,只留下對我們有用的數據包

並且我們根據某些數據包的url名稱也會看出點東西,因為編程的命名一般不會雜亂無章的,所以有些名稱直接就暴露了數據包的內容

最后我們簡單的用fiddler看返回的response可以看出來我們需要用的就是以下三個 返回的數據分別對應着剛剛模擬器操作的圖二圖三圖四 也正是菜譜的內容


4)使用python模擬數據給服務器發送request請求得到響應數據

現在我們對於抓取到的三個包的request分析進行模擬,向服務器發送請求,看看我們能否收到想要的響應

打開pycharm 編寫文件  首先我們分析第一個包 也就是菜譜分類頁面如下圖

可以看出來request請求是 POST 方法 頭部信息和發送的data 都在fiddler中可以看出來 現在編寫py文件

將頭部信息復制到notepad或者其他文本處理工具中將文本使用查找替換成我們要的字典格式如圖(記得將空格刪除)

douguo_test.py文件編寫如下:

import requests
import json

def douguo_request(url,data):
    #有些頭部信息不需要已經進行了注釋,因為可能會造成服務器檢測我們多次請求問題
    headers = {
        "client": "4",
        "version": "6945.4",
        "device": "OPPO R11",
        "sdk": "19,4.4.2",
        "imei": "866174010601603",
        "channel": "baidu",
        #"mac": "3C:A0:67:68:D1:F5",
        "resolution": "1280*720",
        "dpi": "1.5",
        #"android-id": "3ca06768d1f58615",
        #"pseudo-id": "768d1f586153ca06",
        "brand": "OPPO",
        "scale": "1.5",
        "timezone": "28800",
        "language": "zh",
        "cns": "3",
        "carrier": "CMCC",
        #"imsi": " 460076016067682",
        "User-Agent": "Mozilla/5.0 (Linux; Android 4.4.2; OPPO R11 Build/NMF26X) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Mobile Safari/537.36",
        "act-code": "9d77be448da77d86aa48ae9d822d70d0",
        "act-timestamp": "1569203760",
        "uuid": "b2db10e9-cb21-4c36-ba4c-be62d8b3c67c",
        "newbie": "1",
        "reach": "10000",
        #"lon": "112.573081",
        #"lat": "37.735522",
        #"cid": "140100",
        "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "Keep-Alive",
        #"Cookie": "duid=61248941",
        "Host": "api.douguo.net",
        #"Content-Length": "68",
    }

    response = requests.post(url=url,data=data,headers=headers)
    return response

url = 'http://api.douguo.net/recipe/flatcatalogs'
data = {
        "client":"4",
        #"_session" : "1568947372977863254011601605",
        #"v" : "1568891837",
        "_vs" : "2305"
}

response = douguo_request(url,data)
print(response.text)

然后我們發現打印出一個json格式的東西,接下來我們將打印的東西放進 json.cn網站中讀取一下(這樣可以更方便的看json數據)

我們發現沒錯正是我們想要的分類數據   接下來繼續下一步的操作

我們點開三個數據包會發現三個數據包的頭部信息與POST方法都一樣,不一樣的是請求的 url 和 data 內容  所以我們編寫的第一個request函數可以反復使用

第一個我們已經可以得到頁面索引,現在我們來看如何進入具體食材的列表視圖 如茄子 分析第二個數據包發送的data 發現data中存在些亂碼,這是url編碼后的結果,我們轉換一下就好了

因此我們發現,其實data就是發送了具體食材的名稱,那么名稱從哪里獲取的呢?沒錯就是我們剛剛返回的json中就包含,因此我們需要編寫一個可以分析json結構獲取名稱然后組成data的函數

繼續編寫如下

import requests
import json

def douguo_request(url,data):
    #有些頭部信息不需要已經進行了注釋,因為可能會造成服務器檢測我們多次請求問題
    headers = {
        "client": "4",
        "version": "6945.4",
        "device": "OPPO R11",
        "sdk": "19,4.4.2",
        "imei": "866174010601603",
        "channel": "baidu",
        #"mac": "3C:A0:67:68:D1:F5",
        "resolution": "1280*720",
        "dpi": "1.5",
        #"android-id": "3ca06768d1f58615",
        #"pseudo-id": "768d1f586153ca06",
        "brand": "OPPO",
        "scale": "1.5",
        "timezone": "28800",
        "language": "zh",
        "cns": "3",
        "carrier": "CMCC",
        #"imsi": " 460076016067682",
        "User-Agent": "Mozilla/5.0 (Linux; Android 4.4.2; OPPO R11 Build/NMF26X) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Mobile Safari/537.36",
        "act-code": "9d77be448da77d86aa48ae9d822d70d0",
        "act-timestamp": "1569203760",
        "uuid": "b2db10e9-cb21-4c36-ba4c-be62d8b3c67c",
        "newbie": "1",
        "reach": "10000",
        #"lon": "112.573081",
        #"lat": "37.735522",
        #"cid": "140100",
        "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "Keep-Alive",
        #"Cookie": "duid=61248941",
        "Host": "api.douguo.net",
        #"Content-Length": "68",
    }

    response = requests.post(url=url,data=data,headers=headers)
    return response

def douguo_index():
    url = 'http://api.douguo.net/recipe/flatcatalogs'
    data = {
            "client":"4",
            #"_session" : "1568947372977863254011601605",
            #"v" : "1568891837",
            "_vs" : "2305"
    }
    response = douguo_request(url, data)
    #需要把json數據變為dict
    response_dict = json.loads(response.text)
    for item1 in response_dict['result']['cs']:
        for item2 in item1['cs']:
            for item3 in item2['cs']:
                data={
                    "client": "4",
                    # "_session": "1568947372977863254011601605",
                    "keyword": item3['name'],
                    "_vs": "400"
                }
                print(data)

douguo_index()

運行得到如下圖

沒錯我們現在已經構建好了要發送的data數據 

接下來我們繼續構造另一個發送的函數 來獲取具體食材的菜譜列表(我們只選取食材的最火的前10個菜譜) 當前我們先測試第一個獲取的也就是茄子,一會我們使用隊列來多線程抓取

import requests
import json

def douguo_request(url,data):
    #有些頭部信息不需要已經進行了注釋,因為可能會造成服務器檢測我們多次請求問題
    headers = {
        "client": "4",
        "version": "6945.4",
        "device": "OPPO R11",
        "sdk": "19,4.4.2",
        "imei": "866174010601603",
        "channel": "baidu",
        #"mac": "3C:A0:67:68:D1:F5",
        "resolution": "1280*720",
        "dpi": "1.5",
        #"android-id": "3ca06768d1f58615",
        #"pseudo-id": "768d1f586153ca06",
        "brand": "OPPO",
        "scale": "1.5",
        "timezone": "28800",
        "language": "zh",
        "cns": "3",
        "carrier": "CMCC",
        #"imsi": " 460076016067682",
        "User-Agent": "Mozilla/5.0 (Linux; Android 4.4.2; OPPO R11 Build/NMF26X) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Mobile Safari/537.36",
        "act-code": "9d77be448da77d86aa48ae9d822d70d0",
        "act-timestamp": "1569203760",
        "uuid": "b2db10e9-cb21-4c36-ba4c-be62d8b3c67c",
        "newbie": "1",
        "reach": "10000",
        #"lon": "112.573081",
        #"lat": "37.735522",
        #"cid": "140100",
        "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "Keep-Alive",
        #"Cookie": "duid=61248941",
        "Host": "api.douguo.net",
        #"Content-Length": "68",
    }

    response = requests.post(url=url,data=data,headers=headers)
    return response

def douguo_index():
    url = 'http://api.douguo.net/recipe/flatcatalogs'
    data = {
            "client":"4",
            #"_session" : "1568947372977863254011601605",
            #"v" : "1568891837",
            "_vs" : "2305"
    }
    response = douguo_request(url, data)
    #需要把json數據變為dict
    response_dict = json.loads(response.text)
    for item1 in response_dict['result']['cs']:
        for item2 in item1['cs']:
            for item3 in item2['cs']:
                data={
                    "client": "4",
                    # "_session": "1568947372977863254011601605",
                    "keyword": item3['name'],
                    "_vs": "400"
                }
                return data

def douguo_item(data):
    print('當前處理食材', data['keyword'])
    url = 'http://api.douguo.net/search/universalnew/0/10'
    response = douguo_request(url=url,data=data)
    print(response.text)

douguo_item(douguo_index())

又輸出了一個json數據,接着我們放進網頁查看

現在我們已經得到了菜譜的作者、介紹等信息但是沒有菜譜的詳細信息,我們可以進行簡單的賦值操作 

但是現在我們並沒有需要的原料和具體的制作步驟,所以接下來我們分析最后一個數據包,可以得到具體的原料,制作步驟等信息

可以發現同樣的我們只需要修改數據包發送的data的部分數據就可以,同上面方法類似繼續編寫程序然后給字典對象賦值

import requests
import json

def douguo_request(url,data):
    #有些頭部信息不需要已經進行了注釋,因為可能會造成服務器檢測我們多次請求問題
    headers = {
        "client": "4",
        "version": "6945.4",
        "device": "OPPO R11",
        "sdk": "19,4.4.2",
        "imei": "866174010601603",
        "channel": "baidu",
        #"mac": "3C:A0:67:68:D1:F5",
        "resolution": "1280*720",
        "dpi": "1.5",
        #"android-id": "3ca06768d1f58615",
        #"pseudo-id": "768d1f586153ca06",
        "brand": "OPPO",
        "scale": "1.5",
        "timezone": "28800",
        "language": "zh",
        "cns": "3",
        "carrier": "CMCC",
        #"imsi": " 460076016067682",
        "User-Agent": "Mozilla/5.0 (Linux; Android 4.4.2; OPPO R11 Build/NMF26X) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Mobile Safari/537.36",
        "act-code": "9d77be448da77d86aa48ae9d822d70d0",
        "act-timestamp": "1569203760",
        "uuid": "b2db10e9-cb21-4c36-ba4c-be62d8b3c67c",
        "newbie": "1",
        "reach": "10000",
        #"lon": "112.573081",
        #"lat": "37.735522",
        #"cid": "140100",
        "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "Keep-Alive",
        #"Cookie": "duid=61248941",
        "Host": "api.douguo.net",
        #"Content-Length": "68",
    }

    response = requests.post(url=url,data=data,headers=headers)
    return response

def douguo_index():
    url = 'http://api.douguo.net/recipe/flatcatalogs'
    data = {
            "client":"4",
            #"_session" : "1568947372977863254011601605",
            #"v" : "1568891837",
            "_vs" : "2305"
    }
    response = douguo_request(url, data)
    #需要把json數據變為dict
    response_dict = json.loads(response.text)
    for item1 in response_dict['result']['cs']:
        for item2 in item1['cs']:
            for item3 in item2['cs']:
                data={
                    "client": "4",
                    # "_session": "1568947372977863254011601605",
                    "keyword": item3['name'],
                    "_vs": "400"
                }
                return data

def douguo_item(data):
    item_index_number = 0
    print('當前處理食材', data['keyword'])
    url = 'http://api.douguo.net/search/universalnew/0/10'
    list_response = douguo_request(url=url,data=data)
    list_response_dict = json.loads(list_response.text)
    for item in list_response_dict['result']['recipe']['recipes']:
        item_index_number = item_index_number + 1
        #創建一個字典對象用來存放數據
        caipu_info = {}
        caipu_info['shicai'] = data['keyword']
        caipu_info['author'] = item['an']
        caipu_info['shicai_id'] = item['id']
        caipu_info['shicai_name'] = item['n']
        caipu_info['describe'] = item['cookstory']
        caipu_info['cailiao_list'] = item['major']
        #更多細節需要繼續編寫請求
        detail_url='http://api.douguo.net/recipe/detail/'+str(caipu_info['shicai_id'])
        detail_data = {
            "client": "4",
            "_session": "1569204243934866174010601603",
            "author_id": "0",
            "_vs": "11101",
            #最下面一條需要我們修改為指定的參數  注意引號與加號寫法
            "_ext":'{"query":{"id":'+str(caipu_info['shicai_id'])+',"kw":'+str(caipu_info['shicai'])+',"idx":'+str(item_index_number)+',"src":"11101","type":"13"}}'
        }
        detail_response = douguo_request(url=detail_url,data=detail_data)
        detail_response_dict = json.loads(detail_response.text)
        caipu_info['tips'] = detail_response_dict['result']['recipe']['tips']
        caipu_info['cook_step'] = detail_response_dict['result']['recipe']['cookstep']
        print(caipu_info)

douguo_item(douguo_index())

現在我們大部分的工作已經完成了,我們已經可以獲取到食材的各種信息,接下來就是最后一步了。


5)使用多線程抓取並在本地保存至數據庫

大家肯定會發現我們剛剛雖然抓取成功了,但是我們只抓取了茄子的數據,那么還有很多很多的食材菜譜,我們該怎樣操作呢?對了就是利用線程池進行多線程抓取提升效率

下面開始編寫代碼:

import requests
import json
#引入隊列
from multiprocessing import Queue
#引入線程池
from concurrent.futures import ThreadPoolExecutor

#創建隊列
queue_list = Queue()

def douguo_request(url,data):
    #有些頭部信息不需要已經進行了注釋,因為可能會造成服務器檢測我們多次請求問題
    headers = {
        "client": "4",
        "version": "6945.4",
        "device": "OPPO R11",
        "sdk": "19,4.4.2",
        "imei": "866174010601603",
        "channel": "baidu",
        #"mac": "3C:A0:67:68:D1:F5",
        "resolution": "1280*720",
        "dpi": "1.5",
        #"android-id": "3ca06768d1f58615",
        #"pseudo-id": "768d1f586153ca06",
        "brand": "OPPO",
        "scale": "1.5",
        "timezone": "28800",
        "language": "zh",
        "cns": "3",
        "carrier": "CMCC",
        #"imsi": " 460076016067682",
        "User-Agent": "Mozilla/5.0 (Linux; Android 4.4.2; OPPO R11 Build/NMF26X) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Mobile Safari/537.36",
        "act-code": "9d77be448da77d86aa48ae9d822d70d0",
        "act-timestamp": "1569203760",
        "uuid": "b2db10e9-cb21-4c36-ba4c-be62d8b3c67c",
        "newbie": "1",
        "reach": "10000",
        #"lon": "112.573081",
        #"lat": "37.735522",
        #"cid": "140100",
        "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "Keep-Alive",
        #"Cookie": "duid=61248941",
        "Host": "api.douguo.net",
        #"Content-Length": "68",
    }

    response = requests.post(url=url,data=data,headers=headers)
    return response

def douguo_index():
    url = 'http://api.douguo.net/recipe/flatcatalogs'
    data = {
            "client":"4",
            #"_session" : "1568947372977863254011601605",
            #"v" : "1568891837",
            "_vs" : "2305"
    }
    response = douguo_request(url, data)
    #需要把json數據變為dict
    response_dict = json.loads(response.text)
    for item1 in response_dict['result']['cs']:
        for item2 in item1['cs']:
            for item3 in item2['cs']:
                data={
                    "client": "4",
                    # "_session": "1568947372977863254011601605",
                    "keyword": item3['name'],
                    "_vs": "400"
                }
                #放入隊列使用put方法
                queue_list.put(data)

def douguo_item(data):
    item_index_number = 0
    print('當前處理食材', data['keyword'])
    url = 'http://api.douguo.net/search/universalnew/0/10'
    list_response = douguo_request(url=url,data=data)
    list_response_dict = json.loads(list_response.text)
    for item in list_response_dict['result']['recipe']['recipes']:
        item_index_number = item_index_number + 1
        #創建一個字典對象用來存放數據
        caipu_info = {}
        caipu_info['shicai'] = data['keyword']
        caipu_info['author'] = item['an']
        caipu_info['shicai_id'] = item['id']
        caipu_info['shicai_name'] = item['n']
        caipu_info['describe'] = item['cookstory']
        caipu_info['cailiao_list'] = item['major']
        #更多細節需要繼續編寫請求
        detail_url='http://api.douguo.net/recipe/detail/'+str(caipu_info['shicai_id'])
        detail_data = {
            "client": "4",
            "_session": "1569204243934866174010601603",
            "author_id": "0",
            "_vs": "11101",
            #最下面一條需要我們修改為指定的參數  注意引號與加號寫法
            "_ext":'{"query":{"id":'+str(caipu_info['shicai_id'])+',"kw":'+str(caipu_info['shicai'])+',"idx":'+str(item_index_number)+',"src":"11101","type":"13"}}'
        }
        detail_response = douguo_request(url=detail_url,data=detail_data)
        detail_response_dict = json.loads(detail_response.text)
        caipu_info['tips'] = detail_response_dict['result']['recipe']['tips']
        caipu_info['cook_step'] = detail_response_dict['result']['recipe']['cookstep']
        print('當前處理的菜譜是:',caipu_info['shicai_name'])
        print(caipu_info)

douguo_index()
#同時進行處理的任務數
pool = ThreadPoolExecutor(max_workers=25)
while(queue_list.qsize()>0):
    #注意多線程寫法,douguo_item函數后一定不能加括號
    pool.submit(douguo_item,queue_list.get())

再次運行,發現現在已經可以抓取所有的菜譜,然后我們只需要進行最后一步將數據保存到本地數據庫就徹底大功告成了

我們可以先編寫一個入庫函數douguo_mongo.py

import pymongo
from pymongo.collection import Collection

class Cnonnect_mongo(object):
    def __init__(self):
        #本地數據庫host為127.0.0.1 端口為27017 具體可以打開mongodb查看
        self.client = pymongo.MongoClient(host='127.0.0.1',port=27017)
        #數據庫名稱
        self.dbdata = self.client['douguomeishi']

    def insert_item(self,item):
        #數據表名稱為 douguo_item
        db_collection = Collection(self.dbdata,'douguo_item')
        db_collection.insert(item)

mongo_info = Cnonnect_mongo()

然后我們去爬蟲文件引用剛剛寫的入庫對象類並使用插入函數

import requests
import json
#引入隊列
from multiprocessing import Queue
#引入線程池
from concurrent.futures import ThreadPoolExecutor
#引用入庫類對象
from handel_mongo import mongo_info

#創建隊列
queue_list = Queue()

def douguo_request(url,data):
    #有些頭部信息不需要已經進行了注釋,因為可能會造成服務器檢測我們多次請求問題
    headers = {
        "client": "4",
        "version": "6945.4",
        "device": "OPPO R11",
        "sdk": "19,4.4.2",
        "imei": "866174010601603",
        "channel": "baidu",
        #"mac": "3C:A0:67:68:D1:F5",
        "resolution": "1280*720",
        "dpi": "1.5",
        #"android-id": "3ca06768d1f58615",
        #"pseudo-id": "768d1f586153ca06",
        "brand": "OPPO",
        "scale": "1.5",
        "timezone": "28800",
        "language": "zh",
        "cns": "3",
        "carrier": "CMCC",
        #"imsi": " 460076016067682",
        "User-Agent": "Mozilla/5.0 (Linux; Android 4.4.2; OPPO R11 Build/NMF26X) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Mobile Safari/537.36",
        "act-code": "9d77be448da77d86aa48ae9d822d70d0",
        "act-timestamp": "1569203760",
        "uuid": "b2db10e9-cb21-4c36-ba4c-be62d8b3c67c",
        "newbie": "1",
        "reach": "10000",
        #"lon": "112.573081",
        #"lat": "37.735522",
        #"cid": "140100",
        "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "Keep-Alive",
        #"Cookie": "duid=61248941",
        "Host": "api.douguo.net",
        #"Content-Length": "68",
    }

    response = requests.post(url=url,data=data,headers=headers)
    return response

def douguo_index():
    url = 'http://api.douguo.net/recipe/flatcatalogs'
    data = {
            "client":"4",
            #"_session" : "1568947372977863254011601605",
            #"v" : "1568891837",
            "_vs" : "2305"
    }
    response = douguo_request(url, data)
    #需要把json數據變為dict
    response_dict = json.loads(response.text)
    for item1 in response_dict['result']['cs']:
        for item2 in item1['cs']:
            for item3 in item2['cs']:
                data={
                    "client": "4",
                    # "_session": "1568947372977863254011601605",
                    "keyword": item3['name'],
                    "_vs": "400"
                }
                #放入隊列使用put方法
                queue_list.put(data)

def douguo_item(data):
    item_index_number = 0
    print('當前處理食材', data['keyword'])
    url = 'http://api.douguo.net/search/universalnew/0/10'
    list_response = douguo_request(url=url,data=data)
    list_response_dict = json.loads(list_response.text)
    for item in list_response_dict['result']['recipe']['recipes']:
        item_index_number = item_index_number + 1
        #創建一個字典對象用來存放數據
        caipu_info = {}
        caipu_info['shicai'] = data['keyword']
        caipu_info['author'] = item['an']
        caipu_info['shicai_id'] = item['id']
        caipu_info['shicai_name'] = item['n']
        caipu_info['describe'] = item['cookstory']
        caipu_info['cailiao_list'] = item['major']
        #更多細節需要繼續編寫請求
        detail_url='http://api.douguo.net/recipe/detail/'+str(caipu_info['shicai_id'])
        detail_data = {
            "client": "4",
            "_session": "1569204243934866174010601603",
            "author_id": "0",
            "_vs": "11101",
            #最下面一條需要我們修改為指定的參數  注意引號與加號寫法
            "_ext":'{"query":{"id":'+str(caipu_info['shicai_id'])+',"kw":'+str(caipu_info['shicai'])+',"idx":'+str(item_index_number)+',"src":"11101","type":"13"}}'
        }
        detail_response = douguo_request(url=detail_url,data=detail_data)
        detail_response_dict = json.loads(detail_response.text)
        caipu_info['tips'] = detail_response_dict['result']['recipe']['tips']
        caipu_info['cook_step'] = detail_response_dict['result']['recipe']['cookstep']
        print('當前處理的菜譜是:',caipu_info['shicai_name'])
        #保存到數據庫
        mongo_info.insert_item(caipu_info)

douguo_index()
#同時進行處理的任務數
pool = ThreadPoolExecutor(max_workers=25)
while(queue_list.qsize()>0):
    #注意多線程寫法,douguo_item函數后一定不能加括號
    pool.submit(douguo_item,queue_list.get())

運行,我們的數據就保存在數據庫了

這就是爬取一個App的整個流程,有點繁瑣,但也很有趣,一步一步的解析數據,挺有挑戰性的。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM