爬蟲之路: 字體文件反爬一


前言

今天就來記錄一下破解汽車之家的字體反爬,  完整代碼在末尾

分析頁面

首先我們看一下頁面顯示, 全都是""

在查看下源碼, 顯示的是""

 

很明顯的字體反爬, 接下來我們就一步步來揭開字體文件的神秘面紗

 

查看字體文件

首先將字體文件下載到本地

使用在線工具查看字體文件內容, 在線查看地址

打開看一下, 有沒有很眼熟這個編碼, 這不就是上面源碼里的編碼嘛

 

破解字體文件 

使用fontTools來處理字體文件

# 安裝fonttools
pip3 install fonttools

讀取字體編碼表

# 解析字體庫
font = TTFont('ChcCQ1sUz1WATUkcAABj5B4DyuQ43..ttff')

# 讀取字體編碼的列表
uni_list = font.getGlyphOrder()
print(uni_list)

 

輸出:(第一個是空白字符, 下面會切除掉)

['.notdef', 'uniEDD2', 'uniEC30', 'uniED71', 'uniECBD', 'uniED0F', 'uniEC5C', 'uniED9C', 'uniEDEE', 'uniED3B', 'uniED8D', 'uniECD9', 'uniEC26', 'uniEC78', 'uniEDB8', 'uniED05', 'uniED57', 'uniECA3', 'uniECF5', 'uniEC42', 'uniED82', 'uniEDD4', 'uniED21', 'uniEC6D', 'uniECBF', 'uniEE00', 'uniEC5D', 'uniED9E', 'uniECEB', 'uniED3C', 'uniEC89', 'uniEDCA', 'uniEC27', 'uniED68', 'uniEDBA', 'uniED06', 'uniEC53', 'uniECA5', 'uniEDE5']

 

制作字體和文字的映射表

必備條件是, 需要先手寫一個文字的列表(就是不知道怎么自動獲取這個列表, 求指教)

word_list = [
    "", "", "", "", "", "", "", "", "", "", "", 
    "", "", "", "", "", "", "", "", "", "", "", "", 
    "", "", "", "", "", "", "", "", "", "", "", "", 
    "", "", ""
]
# 處理字體編碼
utf_list = [uni[3:].lower() for uni in uni_list[1:]]
# 編碼和字體映射表
utf_word_map  = dict(zip(utf_list, word_list))

 

替換源碼, 提取內容

這里使用的是先替換源碼, 在提取內容

# 請求內容
response = requests.get(url, headers=headers)
html = response.text

for utf_code, word in utf_word_map.items():
    html = html.replace("&#x%s;" % utf_code, word)


# 使用xpath 獲取 主貼
xp_html = etree.HTML(html)
subject_text = ''.join(xp_html.xpath('//div[@xname="content"]//div[@class="tz-paragraph"]//text()'))
print(subject_text)

輸出, 字體破解成功

上個禮拜六碳罐索賠成功,更換時一直在傍邊,將整個后橋拆掉,然后更換。換完后回家發現后邊排氣管有“突突”聲,沒換之前沒有。正常嗎?

 

本頁源碼

# -*- coding: utf-8 -*-
# @Author: Mehaei
# @Date:   2020-01-09 10:01:59
# @Last Modified by:   Mehaei
# @Last Modified time: 2020-01-10 11:52:19
import re
import requests
from lxml import etree
from fontTools.ttLib import TTFont


class NotFoundFontFileUrl(Exception):
    pass


class CarHomeFont(object):
    def __init__(self, url, *args, **kwargs):
        self.download_ttf_name = 'norm_font.ttf'
        self._download_ttf_file(url)
        self._making_code_map()

    def _download_ttf_file(self, url):
        self.page_html = self.download(url) or ""
        # 獲取字體的連接文件
        font_file_name = (re.findall(r",url\('(//.*\.ttf)?'\) format", self.page_html) or [""])[0]
        if not font_file_name:
            raise NotFoundFontFileUrl("not found font file name")
        # 下載字體文件
        file_content = self.download("https:%s" % font_file_name, content=True)
        # 講字體文件保存到本地
        with open(self.download_ttf_name, 'wb') as f:
            f.write(file_content)
        print("font file download success")

    def _making_code_map(self):
        font = TTFont(self.download_ttf_name)
        uni_list = font.getGlyphOrder()
        # 轉換格式
        utf_list = [uni[3:].lower() for uni in uni_list[1:]]
        # 被替換的字體的列表
        word_list = [
            "", "", "", "", "", "", "", "", "", "", "", 
            "", "", "", "", "", "", "", "", "", "", "", "", 
            "", "", "", "", "", "", "", "", "", "", "", "", 
            "", "", ""
        ]
        self.utf_word_map = dict(zip(utf_list, word_list))

    def repalce_source_code(self):
        replaced_html = self.page_html
        for utf_code, word in self.utf_word_map.items():
            replaced_html = replaced_html.replace("&#x%s;" % utf_code, word)
        return replaced_html

    def get_subject_content(self):
        normal_html = self.repalce_source_code()
        # 使用xpath 獲取 主貼
        xp_html = etree.HTML(normal_html)
        subject_text = ''.join(xp_html.xpath('//div[@xname="content"]//div[@class="tz-paragraph"]//text()'))
        return subject_text

    def download(self, url, *args, try_time=5, method="GET", content=False, **kwargs):
        kwargs.setdefault("headers", {})
        kwargs["headers"].update({"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36"})
        while try_time:
            try:
                response = requests.request(method.upper(), url, *args, **kwargs)
                if response.ok:
                    if content:
                        return response.content
                    return response.text
                else:
                    continue
            except Exception as e:
                try_time -= 1
                print("download error: %s" % e)


if __name__ == "__main__":
    url = "https://club.autohome.com.cn/bbs/thread/62c48ae0f0ae73ef/75904283-1.html"
    car = CarHomeFont(url)
    text = car.get_subject_content()
    print(text)

 

后續

到這里本以為就已經結束了, 卻發現, 這個爬蟲只能在這一頁使用, 再換一頁還是輸出亂碼

下一篇就來講講如何解決這種情況

 

點擊這里查看,  動態字體文件破解


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM