爬蟲之爬取抖音用戶信息-字體加密-靜態


破解字體加密

獲取用戶的url

找到目標用戶

復制的鏈接:http://v.douyin.com/SBGFHC/

在瀏覽器請求后重定向的地址就是真實地址:
https://www.iesdouyin.com/share/user/88445518961?sec_uid=MS4wLjABAAAAWxLpO0Q437qGFpnEKBIIaU5-xOj2yAhH3MNJi-AUY04&u_code=g307keff&timestamp=1563956307

查看我們要獲取的信息

檢查網頁源代碼

發現該網站的字體是自定義的,我們在爬取時需要獲取它的字體文件,根據它的編碼格式進行解碼;

通過http://fontstore.baidu.com/static/editor/index.html查看下載的字體文件,每次請求獲得的字體文件都一樣,說明該網站自定義的字體文件只有這一個,字體映射關系不會發生改變

通過fontTools模塊將文件轉換成xml可讀文件

from fontTools.ttLib import TTFont

font = TTFont('iconfont_9eb9a50.woff')
font.saveXML('iconfont_9eb9a50.xml')

具體實現代碼

# -*- coding: utf-8 -*-
# @Time    : 2019/7/24 12:03
import re
import requests
from lxml import etree
from fontTools.ttLib import TTFont
# 從本地讀取字體文件
ttfond = TTFont("iconfont_9eb9a50.woff")

def get_cmap_dict():
    """
    :return: 關系映射表
    """
    # 從本地讀取關系映射表【從網站下載的woff字體文件】
    best_cmap = ttfond["cmap"].getBestCmap()
    # 循環關系映射表將數字替換成16進制
    best_cmap_dict = {}
    for key,value in best_cmap.items():
        best_cmap_dict[hex(key)] = value
    return best_cmap_dict   # 'num_1', '0xe604': 'num_2', '0xe605': 'num_3'

def get_num_cmap():
    """
    :return: 返回num和真正的數字映射關系
    """
    num_map = {
        "x":"", "num_":1, "num_1":0,
        "num_2":3, "num_3":2, "num_4":4,
        "num_5":5, "num_6":6, "num_7":9,
        "num_8":7, "num_9":8,
    }
    return num_map


def map_cmap_num(get_cmap_dict,get_num_cmap):
    new_cmap = {}
    for key,value in get_cmap_dict().items():
        key = re.sub("0","&#",key,count=1) + ";"    # 源代碼中的格式 
        new_cmap[key] = get_num_cmap()[value]
        # 替換后的格式
        # '': 1, '': 0, '': 3, '': 2,
    return new_cmap


# 獲取網頁源碼
def get_html(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
    }
    response = requests.get(url,headers=headers).text
    return response

def replace_num_and_cmap(result,response):
    """
    將網頁源代碼中的替換成數字
    :param result:
    :param response:
    :return:
    """
    for key,value in result.items():
        if key in response:
            # print(777)
            response = re.sub(key, str(value), response)
    return response

def manage(response):
    res = etree.HTML(response)
    douyin_name = res.xpath('//p[@class="nickname"]//text()')[0]
    douyin_id = 'ID:'+''.join(res.xpath('//p[@class="shortid"]/i//text()')).replace(' ','')
    guanzhu_num = ''.join(res.xpath('//span[@class="focus block"]//text()')).replace(' ','')
    fensi_num = ''.join(res.xpath('//span[@class="follower block"]//text()')).replace(' ','')
    dianzan = ''.join(res.xpath('//span[@class="liked-num block"]//text()')).replace(' ','')
    print(douyin_name,douyin_id,guanzhu_num,fensi_num,dianzan)
    # Dear-迪麗熱巴 ID:274110380 0關注 5298.2w粉絲 15123.2w贊

if __name__ == '__main__':
    new_cmap = map_cmap_num(get_cmap_dict, get_num_cmap)

    response = get_html("https://www.iesdouyin.com/share/user/88445518961?sec_uid=MS4wLjABAAAAWxLpO0Q437qGFpnEKBIIaU5-xOj2yAhH3MNJi-AUY04&u_code=g307keff&timestamp=1563956307")

    response = replace_num_and_cmap(new_cmap,response)
    manage(response)

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM