上一篇解決了但頁面的字體反爬, 這篇記錄下如何解決動態字體文件, 編碼不同, 文字順序不同的情況
源碼在最后
冷靜分析頁面
打開一個頁面, 發現字體文件地址是動態的, 這個倒是好說, 寫個正則, 就可以動態匹配出來
先下載下來一個新頁面的字體文件, 做一下對比, 如圖
mmp, 發現編碼, 字體順序那那都不一樣, 這可就過分了, 心里一萬個xxx在奔騰
頭腦風暴ing.gif
(與伙伴對話ing...)
不着急, 還是要冷靜下來, 再想想哪里還有突破點
同一個頁面的字體文件地址是動態的, 但是, 里面的字體編碼和順序是不會變的呀
可以使用某一個頁面的字體文件做一個標准的字體映射表呀!
好像發現了新世界的大門, 可門還沒開開, 就被自己堵死了, 就想 做出來映射表然后呢!(又要奔騰了)
想呀想呀想呀想, 最后叫上小伙伴一起想
突然就想到了, 雖然那么多不一樣, 但是, 但是, 相同文字的坐標點相同呀! 突然又打開了大門
首先排除特別的文字的情況下, 只是在這個字體文件的情況下, 60%的字坐標點一樣
那剩下的怎么辦呢! 先不管了, 先把這60%給弄出來
提取60%的字體映射表
制作標准編碼文字映射表(某頁字體文件為准)
def extract_ttf_file(self, file_name, get_word_map=True): _font = TTFont(file_name) uni_list = _font.getGlyphOrder()[1:] # 被替換的字體的列表 word_list = [ "壞", "少", "遠", "大", "九", "左", "近", "呢", "十", "高", "着", "矮", "八", "二", "右", "是", "得", "的", "小", "短", "很", "一", "了", "地", "好", "多", "七", "不", "長", "低", "三", "五", "六", "下", "更", "和", "四", "上" ] utf_word_map = {} utf_coordinates_map = {} for index, uni_code in enumerate(uni_list): utf_word_map[uni_code] = word_list[index] utf_coordinates_map[uni_code] = list(_font['glyf'][uni_code].coordinates) if get_word_map: return utf_word_map, utf_coordinates_map return utf_coordinates_map # self.local_utf_word_map, self.local_utf_coordinates_map = self.extract_ttf_file(self.local_ttf_name)
制作新標准編碼映射表
下載要破解的字體文件, 並替換標准編碼字體映射表
def replace_ttf_map(self): unicode_mlist_map = [] new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False) for local_unicode, local_coordinate in self.local_utf_coordinates_map.items(): coordinate_equal_list = [] for new_unicode, new_coordinate in new_utf_coordinates_map.items(): if len(new_coordinate) == len(local_coordinate): coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}) if len(coordinate_equal_list) == 1: unicode_mlist_map.append(coordinate_equal_list[0])for unicode_dict in unicode_mlist_map: self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]] print("new unicode map extract success\n", self.new_unicode_map)
會得到22個字體的映射表, 共38個:
{'uniED23': '壞', 'uniED75': '少', 'uniEC18': '九', 'uniECB0': '左', 'uniECF7': '呢', 'uniEC7A': '高', 'uniECA7': '矮', 'uniEDD6': '八', 'uniEDA0': '二', 'uniEDF2': '右', 'uniEC96': '的', 'uniEDF0': '小', 'uniEC8B': '短', 'uniECDB': '一', 'uniEDD5': '地', 'uniED8F': '好', 'uniECC1': '多', 'uniEDBA': '七', 'uniEC2A': '長', 'uniED59': '下', 'uniEC94': '和', 'uniED73': '四'}
替換40%的字體映射表
重組新標准映射表
接下來, 就用坐標點來解決, 以下為思路
使用兩點坐標差來判斷, 但是這個偏差值拿不准
相同文字, 坐標點幾乎一致, 即所有坐標點相差的絕對值的和最小的就為同一個字
來先試試
def get_distence(self, norm_coordinate, new_coordinate): distance_total = 0 for index, coordinate_point in enumerate(norm_coordinate): distance_total += abs(new_coordinate[index][0] - coordinate_point[0]) + abs(new_coordinate[index][1] - coordinate_point[1]) return distance_total
然后在重組標准編碼, 標准坐標, 新的編碼, 和新坐標
(這是想, 找出最相近的坐標, 使用新坐標提取出標准編碼, 然后用標准編碼提取對應的文字, 在替換成使用本頁用的編碼映射表)
# 准備替換的編碼坐標映射表 {"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}
提取所有坐標點加起來最小的元素
def handle_subtraction(self, coordinate_equal_list): coordinate_min_list = [] for coordinate_equal in coordinate_equal_list: n = self.get_distence(coordinate_equal.get('norm_coordinate'), coordinate_equal.get('new_coordinate')) coordinate_min_list.append(n) return coordinate_equal_list[coordinate_min_list.index(min(coordinate_min_list))]
替換, 生成新的標准映射表
self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]]
加入判斷
在以上替換60%的字體映射表再加入一個判斷, 改成如下
def replace_ttf_map(self): unicode_mlist_map = [] new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False) for local_unicode, local_coordinate in self.local_utf_coordinates_map.items(): coordinate_equal_list = [] for new_unicode, new_coordinate in new_utf_coordinates_map.items(): if len(new_coordinate) == len(local_coordinate): coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}) if len(coordinate_equal_list) == 1: unicode_mlist_map.append(coordinate_equal_list[0]) elif len(coordinate_equal_list) > 1: min_word = self.handle_subtraction(coordinate_equal_list) unicode_mlist_map.append(min_word) for unicode_dict in unicode_mlist_map: self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]] print("new unicode map extract success\n", self.new_unicode_map)
輸出一個標准的坐標值, 這里我就不上圖進行對比了, 經過對比, 發現沒什么問題
{'uniED23': '壞', 'uniED75': '少', 'uniEDC5': '遠', 'uniED5A': '大', 'uniEC18': '九', 'uniECB0': '左', 'uniECA5': '近', 'uniECF7': '呢', 'uniED3F': '十', 'uniEC7A': '高', 'uniEC44': '着', 'uniECA7': '矮', 'uniEDD6': '八', 'uniEDA0': '二', 'uniEDF2': '右', 'uniED09': '是', 'uniEC32': '得', 'uniEC96': '的', 'uniEDF0': '小', 'uniEC8B': '短', 'uniED3D': '很', 'uniECDB': '一', 'uniEC60': '了', 'uniEDD5': '地', 'uniED8F': '好', 'uniECC1': '多', 'uniEDBA': '七', 'uniED2D': '不', 'uniEC2A': '長', 'uniED11': '低', 'uniEC5E': '三', 'uniECDD': '五', 'uniEDBC': '六', 'uniED59': '下', 'uniEE02': '更', 'uniEC94': '和', 'uniED73': '四', 'uniED6A': '上'}
補充
如有以上有錯誤, 懇求大神指出
其余的就和上篇的一致了, 點擊這里查看
源碼
# -*- coding: utf-8 -*- # @Author: Mehaei # @Date: 2020-01-10 14:51:53 # @Last Modified by: Mehaei # @Last Modified time: 2020-01-13 10:10:13 import re import os import requests from lxml import etree from fontTools.ttLib import TTFont class NotFoundFontFileUrl(Exception): pass class CarHomeFont(object): def __init__(self, url, *args, **kwargs): self.local_ttf_name = "norm_font.ttf" self.download_ttf_name = 'new_font.ttf' self.new_unicode_map = {} self._making_local_code_map() self._download_ttf_file(url, self.download_ttf_name) def _download_ttf_file(self, url, file_name): self.page_html = self.download(url) or "" # 獲取字體的連接文件 font_file_name = (re.findall(r",url\('(//.*\.ttf)?'\) format", self.page_html) or [""])[0] if not font_file_name: raise NotFoundFontFileUrl("not found font file name") # 下載字體文件 file_content = self.download("https:%s" % font_file_name, content=True) # 講字體文件保存到本地 with open(file_name, 'wb') as f: f.write(file_content) print("font file download success") def _making_local_code_map(self): if not os.path.exists(self.local_ttf_name): # 這個url為標准字體文件地址, 如要更改, 請手動更改字體列表 url = "https://club.autohome.com.cn/bbs/thread/62c48ae0f0ae73ef/75904283-1.html" self._download_ttf_file(url, self.local_ttf_name) self.local_utf_word_map, self.local_utf_coordinates_map = self.extract_ttf_file(self.local_ttf_name) print("local ttf load done") def get_distence(self, norm_coordinate, new_coordinate): distance_total = 0 for index, coordinate_point in enumerate(norm_coordinate): distance_total += abs(new_coordinate[index][0] - coordinate_point[0]) + abs(new_coordinate[index][1] - coordinate_point[1]) return distance_total def handle_subtraction(self, coordinate_equal_list): coordinate_min_list = [] for coordinate_equal in coordinate_equal_list: n = self.get_distence(coordinate_equal.get('norm_coordinate'), coordinate_equal.get('new_coordinate')) coordinate_min_list.append(n) return coordinate_equal_list[coordinate_min_list.index(min(coordinate_min_list))] def replace_ttf_map(self): unicode_mlist_map = [] new_utf_coordinates_map = self.extract_ttf_file(self.download_ttf_name, get_word_map=False) for local_unicode, local_coordinate in self.local_utf_coordinates_map.items(): coordinate_equal_list = [] for new_unicode, new_coordinate in new_utf_coordinates_map.items(): if len(new_coordinate) == len(local_coordinate): coordinate_equal_list.append({"norm_key": local_unicode, "norm_coordinate": local_coordinate, "new_key": new_unicode, "new_coordinate": new_coordinate}) if len(coordinate_equal_list) == 1: unicode_mlist_map.append(coordinate_equal_list[0]) elif len(coordinate_equal_list) > 1: min_word = self.handle_subtraction(coordinate_equal_list) unicode_mlist_map.append(min_word) for unicode_dict in unicode_mlist_map: self.new_unicode_map[unicode_dict["new_key"]] = self.local_utf_word_map[unicode_dict["norm_key"]] print("new unicode map extract success\n", self.new_unicode_map) def extract_ttf_file(self, file_name, get_word_map=True): _font = TTFont(file_name) uni_list = _font.getGlyphOrder()[1:] # 被替換的字體的列表 word_list = [ "壞", "少", "遠", "大", "九", "左", "近", "呢", "十", "高", "着", "矮", "八", "二", "右", "是", "得", "的", "小", "短", "很", "一", "了", "地", "好", "多", "七", "不", "長", "低", "三", "五", "六", "下", "更", "和", "四", "上" ] utf_word_map = {} utf_coordinates_map = {} for index, uni_code in enumerate(uni_list): utf_word_map[uni_code] = word_list[index] utf_coordinates_map[uni_code] = list(_font['glyf'][uni_code].coordinates) if get_word_map: return utf_word_map, utf_coordinates_map return utf_coordinates_map def repalce_source_code(self): replaced_html = self.page_html for utf_code, word in self.new_unicode_map.items(): replaced_html = replaced_html.replace("&#x%s;" % utf_code[3:].lower(), word) return replaced_html def get_subject_content(self): normal_html = self.repalce_source_code() # 使用xpath 獲取 主貼 xp_html = etree.HTML(normal_html) subject_text = ''.join(xp_html.xpath('//div[@xname="content"]//div[@class="tz-paragraph"]//text()')) return subject_text def download(self, url, *args, try_time=5, method="GET", content=False, **kwargs): kwargs.setdefault("headers", {}) kwargs["headers"].update({"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"}) while try_time: try: response = requests.request(method.upper(), url, *args, **kwargs) if response.ok: if content: return response.content return response.text else: continue except Exception as e: try_time -= 1 print("download error: %s" % e) if __name__ == "__main__": url = "https://club.autohome.com.cn/bbs/thread/34d6bcc159b717a9/85794510-1.html#pvareaid=6830286" car = CarHomeFont(url) car.replace_ttf_map() text = car.get_subject_content() print(text)