現在手機應用越來越多,大家也都習慣了用手機上網,爬取手機上的數據就成為爬蟲們的必要工作。
爬取手機資料的基本原理是用抓包工具抓取手機訪問網頁或者APP過程中的數據,然后進行解析。
因為手機上的數據大部分是格式化的,主要是json格式,所以相對來說解析比較容易,難度主要就在於如何抓包,
並從一大堆雜亂無章的數據包中找到需要的數據。
抓包有很多工具,比較常用的是fiddler。
抓包工具 Fiddler
Fiddler是一個http協議調試代理工具,它能夠記錄並檢查所有你的電腦和互聯網之間的http通訊,設置斷點,查看所有的“進出”Fiddler的數據。
Fiddler 要比其他的網絡調試器要更加簡單,因為它不僅僅暴露http通訊還提供了一個用戶友好的格式。
1、安裝
fiddler的官方下載鏈接:https://www.telerik.com/download/fiddler
下載完成后一步步安裝即可
2、設置fiddler
2.1設置允許抓取HTTPS信息包
操作很簡單,打開下載好的fiddler,找到 Tools -> Options,然后在HTTPS的工具欄下勾選Decrpt HTTPS traffic,
在新彈出的選項欄下勾選Ignore server certificate errors。這樣,fiddler就會抓取到HTTPS的信息包,否則會一直顯示tunnel。

2.2設置允許外部設備發送HTTP/HTTPS到fiddler
相同的,在Connections選項欄下勾選Allow remote computers to connect,並記住上面的端口號8888,端口號后面會使用到。

3、設置手機端
設置手機端之前,我們需要記住一點:電腦和手機需要在同一個網絡下進行操作。可以使用wifi或者手機熱點等來完成。
假如你已經讓電腦和手機處於同一個網絡下了,這時候我們需要知道此網絡的ip地址,可以在命令行輸入ipconfig簡單的獲得,如圖。
下面以Android手機為例進行代理設置
確定一下手機和PC是連接在同一個局域網中
進入手機的設置->點擊進入WLAN設置->選擇連接到的無線網,長按彈出選項框:如圖所示:

將代理設置成手動,將上面獲取到的ip地址和端口號填入,點擊保存。這樣就將我們的手機設置成功了。
第四步:下載Fiddler的安全證書
使用Android手機的瀏覽器打開:http://192.168.1.96:8888, 點"FiddlerRoot certificate" 然后安裝證書,如圖:

注意:這里這個證書是安裝在手機端的,如果不裝,就不能正確抓取HTTPS的數據
如果一切順利的話,這時候打開fiddler,用手機上網訪問網頁或APP,就能看到fiddler開始抓取數據了

這里可以看到app發送和接收了哪些數據包
為了更加精准定位到某乎(只看目標的數據包),添加一個過濾條件

這樣我們獲取的數據包列表就都是過濾條件內的目標網址

4、查找數據包
比如點擊熱榜

對應的https加密數據包如下:

數據包中的數據如下:

提取出url鏈接

https://api.zhihu.com/topstory/hot-list?limit=10&reverse_order=0
注意:這里尋找所要的數據包,一是看body欄的體積大小,二是看最前面的文件類型,我們需要的大部分數據應該在 {json}這個里面。圖像則在img里面。
拿到url之后,接着開始編程爬取保存數據。
5、編寫爬蟲程序
import requests import json headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',} url = "https://api.zhihu.com/topstory/hot-list?limit=10&reverse_order=0" res = requests.get(url, headers=headers) res.encoding = 'utf-8' s = json.loads(res.text) list = s['data'] for i in list: title = i['target']['title'] print(title)
得到結果如下

再比如從東方財富網抓取某一個股票的當天的交易記錄
import requests headers = { 'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', } url='https://push2his.eastmoney.com/api/qt/stock/trends2/get?secid=0.000651&fields1=f1%2Cf2%2Cf3%2Cf4%2Cf5%2Cf6%2Cf7%2Cf8%2Cf9%2Cf10%2Cf11%2Cf12%2Cf13%2Cf14&fields2=f51%2Cf53%2Cf54%2Cf55%2Cf56%2Cf57%2Cf58&iscr=0&iscca=0&ut=f057cbcbce2a86e2866ab8877db1d059&ndays=1' res=requests.get(url=url,headers=headers,verify=False) data=res.json() for i in range(30): print(data['data']['trends'][i])
#得到如下數據
2021-08-17 09:36,47.82,47.87,47.82,2412,11541700.00,47.923 2021-08-17 09:37,47.75,47.82,47.75,4438,21215542.00,47.906 2021-08-17 09:38,47.67,47.75,47.66,3942,18802944.00,47.882 2021-08-17 09:39,47.89,47.89,47.61,5406,25805845.00,47.861 2021-08-17 09:40,47.92,47.98,47.89,1341,6428628.00,47.864 2021-08-17 09:41,47.96,47.96,47.89,2773,13292278.00,47.868 2021-08-17 09:42,47.99,47.99,47.96,1525,7316794.00,47.872 2021-08-17 09:43,48.06,48.06,47.99,1842,8846337.00,47.878 2021-08-17 09:44,48.03,48.06,48.00,3042,14611700.00,47.888 2021-08-17 09:45,48.01,48.03,48.00,2128,10215539.00,47.892 2021-08-17 09:46,48.18,48.20,48.03,7288,35060015.00,47.919 2021-08-17 09:47,48.13,48.18,48.08,2552,12286559.00,47.928 2021-08-17 09:48,48.13,48.14,48.10,2558,12310755.00,47.936 2021-08-17 09:49,48.07,48.12,48.07,1568,7542189.00,47.940 2021-08-17 09:50,47.98,48.07,47.98,2322,11148561.00,47.942 2021-08-17 09:51,48.02,48.03,47.97,1370,6574020.00,47.943 2021-08-17 09:52,48.01,48.03,48.00,1093,5248116.00,47.944 2021-08-17 09:53,47.95,48.01,47.95,1830,8780509.00,47.945 2021-08-17 09:54,47.88,47.95,47.88,1346,6448633.00,47.945 2021-08-17 09:55,47.90,47.90,47.85,1453,6955471.00,47.943 2021-08-17 09:56,47.93,47.93,47.91,999,4787261.00,47.943 2021-08-17 09:57,47.95,47.95,47.93,1215,5824444.00,47.943 2021-08-17 09:58,47.99,47.99,47.95,1172,5622096.00,47.943 2021-08-17 09:59,47.98,48.00,47.97,881,4227343.00,47.944
6、常見問題
6.1手機設置代理后不能上網
問題解答見如下鏈接
https://www.jianshu.com/p/b122eab059c4
https://blog.csdn.net/jss19940414/article/details/89875043
https://blog.csdn.net/jianglianye21/article/details/81743129
6.2手機可以上網,fiddler可以抓取,但是部分APP不能訪問
問題解答見如下鏈接:
https://www.cnblogs.com/lulianqi/p/11380794.html
