爬取一些網站的信息時,偶爾會碰到這樣一種情況:網頁瀏覽顯示是正常的,用python爬取下來是亂碼,F12用開發者模式查看網頁源代碼也是亂碼。這種一般是網站設置了字體反爬
一、58同城
用谷歌瀏覽器打開58同城:https://sz.58.com/chuzu/,按F12用開發者模式查看網頁源代碼,可以看到有些房屋出租標題和月租是亂碼,但是在網頁上瀏覽卻顯示是正常的。

用python爬取下來也是亂碼:

回到網頁上,右鍵查看網頁源代碼,搜索font-face關鍵字,可以看到一大串用base64加密的字符,把這些加密字符復制下來

在python中用base64對復制下來的加密字符進行解碼並保存為58.ttf
import base64 font_face='AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8B/ZwAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQTmDvfAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOu1IchfDzz1AAsIAAAAAADYCHhnAAAAANgIeGcAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAABgAIAAEABQAKAAIABwADAAQACQAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAGAACVjwAAlY8AAAAIAACZPAAAmTwAAAABAACaSwAAmksAAAAFAACeOgAAnjoAAAAKAACeowAAnqMAAAACAACfZAAAn2QAAAAHAACfkgAAn5IAAAADAACfpAAAn6QAAAAEAACfpQAAn6UAAAAJAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA' b = base64.b64decode(font_face) with open('58.ttf','wb') as f: f.write(b)
在網上搜索下載並安裝字體處理軟件FontCreator,用軟件打開保存的解碼文件58.ttf


現在我們可以得到解決問題的思路了:
1、獲取自定義字體和正常字體的映射表,比如:9F92對應的數字是2,9EA3對應的是1。
2、把頁面上的自定義字體替換成正常字體,這樣就可以正常爬取了。
怎樣來獲取字體映射表呢?靜態的還好,我們用FontCreator工具解析后,直接寫死到字典中。但是如果字體映射關系是動態的呢?比如,我們刷新當前頁面后,再來查看頁面源碼:

字體映射關系變了,這樣的話,就只能請求一次頁面,就獲取一次映射關系,用第三方庫fontTools來實現。
安裝fontTools庫,直接pip install fontTools
先來看下ttf文件中有哪些信息,直接打開ttf文件那當然看不了,把它轉換成xml文件就可以查看了
from fontTools.ttLib import TTFont font = TTFont('58.ttf') # 打開本地的ttf文件 font.saveXML('58.xml') # 轉換成xml
打開xml文件,可以看到類似html標簽的文件結構:

點開GlyphOrder標簽,可以看到Id和name

點開glyf標簽,看到的是name和一些坐標點,這些座標點就是描繪字體形狀的,這里不需要關注這些坐標點。

點開cmap標簽,是編碼和name的對應關系

從這張圖我們可以發現,glyph00001對應的是數字0,glyph00002對應的是數字1,以此類推......glyph00010對應的是數字9
用代碼來獲取編碼和name的對應關系:
from fontTools.ttLib import TTFont font = TTFont('58.ttf') #打開本地的ttf文件 bestcmap = font['cmap'].getBestCmap() print(bestcmap)
輸出如下:
{38006: 'glyph00006', 38287: 'glyph00008', 39228: 'glyph00001', 39499: 'glyph00005', 40506: 'glyph00010', 40611: 'glyph00002', 40804: 'glyph00007', 40850: 'glyph00003', 40868: 'glyph00004', 40869: 'glyph00009'}
輸出的是一個字典,key是編碼的int型
我們把這個字典轉一下,變成編碼和正常字體的映射關系:
import re from fontTools.ttLib import TTFont font = TTFont('58.ttf') #打開本地的ttf文件 bestcmap = font['cmap'].getBestCmap() newmap = dict() for key in bestcmap.keys(): value = int(re.search(r'(\d+)', bestcmap[key]).group(1)) - 1 key = hex(key) newmap[key] = value print(newmap)
輸出:
{'0x9476': 5, '0x958f': 7, '0x993c': 0, '0x9a4b': 4, '0x9e3a': 9, '0x9ea3': 1, '0x9f64': 6, '0x9f92': 2, '0x9fa4': 3, '0x9fa5': 8}
現在就可以把頁面上的自定義字體替換成正常字體,再解析了,全部代碼如下:
import requests import re import base64 import io from lxml import etree from fontTools.ttLib import TTFont url = r'https://sz.58.com/chuzu/' headers = { 'User-Agent':'Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)' } response = requests.get(url=url,headers=headers) # 獲取加密字符串 base64_str = re.search("base64,(.*?)'\)",response.text).group(1) b = base64.b64decode(base64_str) font = TTFont(io.BytesIO(b)) bestcmap = font['cmap'].getBestCmap() newmap = dict() for key in bestcmap.keys(): value = int(re.search(r'(\d+)', bestcmap[key]).group(1)) - 1 key = hex(key) newmap[key] = value # 把頁面上自定義字體替換成正常字體 response_ = response.text for key,value in newmap.items(): key_ = key.replace('0x','&#x') + ';' if key_ in response_: response_ = response_.replace(key_,str(value)) # 獲取標題 rec = etree.HTML(response_) lis = rec.xpath('//ul[@class="listUl"]/li') for li in lis: title = li.xpath('./div[@class="des"]/h2/a/text()') if title: title = title[0] print(title)
