前言
今天就來記錄一下破解汽車之家的字體反爬, 完整代碼在末尾
分析頁面
首先我們看一下頁面顯示, 全都是""
在查看下源碼, 顯示的是""
很明顯的字體反爬, 接下來我們就一步步來揭開字體文件的神秘面紗
查看字體文件
首先將字體文件下載到本地
使用在線工具查看字體文件內容, 在線查看地址
打開看一下, 有沒有很眼熟這個編碼, 這不就是上面源碼里的編碼嘛
破解字體文件
使用fontTools來處理字體文件
# 安裝fonttools pip3 install fonttools
讀取字體編碼表
# 解析字體庫 font = TTFont('ChcCQ1sUz1WATUkcAABj5B4DyuQ43..ttff') # 讀取字體編碼的列表 uni_list = font.getGlyphOrder() print(uni_list)
輸出:(第一個是空白字符, 下面會切除掉)
['.notdef', 'uniEDD2', 'uniEC30', 'uniED71', 'uniECBD', 'uniED0F', 'uniEC5C', 'uniED9C', 'uniEDEE', 'uniED3B', 'uniED8D', 'uniECD9', 'uniEC26', 'uniEC78', 'uniEDB8', 'uniED05', 'uniED57', 'uniECA3', 'uniECF5', 'uniEC42', 'uniED82', 'uniEDD4', 'uniED21', 'uniEC6D', 'uniECBF', 'uniEE00', 'uniEC5D', 'uniED9E', 'uniECEB', 'uniED3C', 'uniEC89', 'uniEDCA', 'uniEC27', 'uniED68', 'uniEDBA', 'uniED06', 'uniEC53', 'uniECA5', 'uniEDE5']
制作字體和文字的映射表
必備條件是, 需要先手寫一個文字的列表(就是不知道怎么自動獲取這個列表, 求指教)
word_list = [ "壞", "少", "遠", "大", "九", "左", "近", "呢", "十", "高", "着", "矮", "八", "二", "右", "是", "得", "的", "小", "短", "很", "一", "了", "地", "好", "多", "七", "不", "長", "低", "三", "五", "六", "下", "更", "和", "四", "上" ] # 處理字體編碼 utf_list = [uni[3:].lower() for uni in uni_list[1:]] # 編碼和字體映射表 utf_word_map = dict(zip(utf_list, word_list))
替換源碼, 提取內容
這里使用的是先替換源碼, 在提取內容
# 請求內容 response = requests.get(url, headers=headers) html = response.text for utf_code, word in utf_word_map.items(): html = html.replace("&#x%s;" % utf_code, word) # 使用xpath 獲取 主貼 xp_html = etree.HTML(html) subject_text = ''.join(xp_html.xpath('//div[@xname="content"]//div[@class="tz-paragraph"]//text()')) print(subject_text)
輸出, 字體破解成功
上個禮拜六碳罐索賠成功,更換時一直在傍邊,將整個后橋拆掉,然后更換。換完后回家發現后邊排氣管有“突突”聲,沒換之前沒有。正常嗎?
本頁源碼
# -*- coding: utf-8 -*- # @Author: Mehaei # @Date: 2020-01-09 10:01:59 # @Last Modified by: Mehaei # @Last Modified time: 2020-01-10 11:52:19 import re import requests from lxml import etree from fontTools.ttLib import TTFont class NotFoundFontFileUrl(Exception): pass class CarHomeFont(object): def __init__(self, url, *args, **kwargs): self.download_ttf_name = 'norm_font.ttf' self._download_ttf_file(url) self._making_code_map() def _download_ttf_file(self, url): self.page_html = self.download(url) or "" # 獲取字體的連接文件 font_file_name = (re.findall(r",url\('(//.*\.ttf)?'\) format", self.page_html) or [""])[0] if not font_file_name: raise NotFoundFontFileUrl("not found font file name") # 下載字體文件 file_content = self.download("https:%s" % font_file_name, content=True) # 講字體文件保存到本地 with open(self.download_ttf_name, 'wb') as f: f.write(file_content) print("font file download success") def _making_code_map(self): font = TTFont(self.download_ttf_name) uni_list = font.getGlyphOrder() # 轉換格式 utf_list = [uni[3:].lower() for uni in uni_list[1:]] # 被替換的字體的列表 word_list = [ "壞", "少", "遠", "大", "九", "左", "近", "呢", "十", "高", "着", "矮", "八", "二", "右", "是", "得", "的", "小", "短", "很", "一", "了", "地", "好", "多", "七", "不", "長", "低", "三", "五", "六", "下", "更", "和", "四", "上" ] self.utf_word_map = dict(zip(utf_list, word_list)) def repalce_source_code(self): replaced_html = self.page_html for utf_code, word in self.utf_word_map.items(): replaced_html = replaced_html.replace("&#x%s;" % utf_code, word) return replaced_html def get_subject_content(self): normal_html = self.repalce_source_code() # 使用xpath 獲取 主貼 xp_html = etree.HTML(normal_html) subject_text = ''.join(xp_html.xpath('//div[@xname="content"]//div[@class="tz-paragraph"]//text()')) return subject_text def download(self, url, *args, try_time=5, method="GET", content=False, **kwargs): kwargs.setdefault("headers", {}) kwargs["headers"].update({"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"}) while try_time: try: response = requests.request(method.upper(), url, *args, **kwargs) if response.ok: if content: return response.content return response.text else: continue except Exception as e: try_time -= 1 print("download error: %s" % e) if __name__ == "__main__": url = "https://club.autohome.com.cn/bbs/thread/62c48ae0f0ae73ef/75904283-1.html" car = CarHomeFont(url) text = car.get_subject_content() print(text)
后續
到這里本以為就已經結束了, 卻發現, 這個爬蟲只能在這一頁使用, 再換一頁還是輸出亂碼
下一篇就來講講如何解決這種情況
點擊這里查看, 動態字體文件破解