貓眼電影加密數字破解（爬取評分票房票價）

本文轉載自查看原文 2018-12-26 18:03 998 奇技淫巧

title: 貓眼電影加密數字破解（爬取評分票房票價）
toc: true
date: 2018-07-01 22:05:27
categories:

methods

tags:

爬蟲
Python

背景

在爬取貓眼電影相關數據時發現爬取下來的評分、票房、票價不是具體的數字而是一串類似於\uf5fb的碼，需要解密。

而這些密碼是每次訪問時隨機生成的，和0-9的映射關系也是隨機的。

解密辦法

下載動態字體文件，解析映射關系。

解密思路

首先找到動態字體文件的地址（head標簽內的style標簽內）：

<style>
    @font-face {
      font-family: stonefont;
      src: url('//vfile.meituan.net/colorstone/e954129d5204b4e8c783c95f7da4c2733168.eot');
      src: url('//vfile.meituan.net/colorstone/e954129d5204b4e8c783c95f7da4c2733168.eot?#iefix') format('embedded-opentype'),
           url('//vfile.meituan.net/colorstone/8f497cdb4e39d1f3dcbafa28a486aea42076.woff') format('woff');
    }

    .stonefont {
      font-family: stonefont;
    }
  </style>

其中的.woff文件是我們需要的。

爬取代碼如下（利用scrapy）：

#下載字體文件
font_url = sel.xpath('/html/head/style/text()').extract()[0]
font_url = 'http:'+font_url[font_url.rfind('url')+5:font_url.find('woff')+4]
print(font_url)
woff_path = 'tmp.woff'
f = urllib.request.urlopen(font_url)
data = f.read()
with open(woff_path, "wb") as code:
    code.write(data)

利用TTFont將woff文件轉換為xml文件：

font1 = TTFont('tmp.woff')
font1.saveXML('tmp.xml')

查看xml文件會發現一個映射關系：

<GlyphOrder>
    <!-- The 'id' attribute is only for humans; it is ignored when parsed. -->
    <GlyphID id="0" name="glyph00000"/>
    <GlyphID id="1" name="x"/>
    <GlyphID id="2" name="uniF753"/>
    <GlyphID id="3" name="uniEA72"/>
    <GlyphID id="4" name="uniEE4E"/>
    <GlyphID id="5" name="uniECE6"/>
    <GlyphID id="6" name="uniE140"/>
    <GlyphID id="7" name="uniF4B0"/>
    <GlyphID id="8" name="uniE1B7"/>
    <GlyphID id="9" name="uniF245"/>
    <GlyphID id="10" name="uniE488"/>
    <GlyphID id="11" name="uniE6DA"/>
</GlyphOrder>

但是使用這個映射關系解碼發現解密出來的數字不對，因此GlyphOrder並不是我們需要的映射關系。

xml文件往下翻，發現了字體數據：

<TTGlyph name="uniF245" xMin="0" yMin="0" xMax="508" yMax="716">
  <contour>
    <pt x="323" y="0" on="1"/>
    <pt x="323" y="171" on="1"/>
    <pt x="13" y="171" on="1"/>
    <pt x="13" y="252" on="1"/>
    <pt x="339" y="716" on="1"/>
    <pt x="411" y="716" on="1"/>
    <pt x="411" y="252" on="1"/>
    <pt x="508" y="252" on="1"/>
    <pt x="508" y="171" on="1"/>
    <pt x="411" y="171" on="1"/>
    <pt x="411" y="0" on="1"/>
  </contour>
  <contour>
    <pt x="323" y="252" on="1"/>
    <pt x="323" y="575" on="1"/>
    <pt x="99" y="252" on="1"/>
  </contour>
  <instructions/>
</TTGlyph>

看到這里突然想到，無論unicode碼怎么變，數字渲染出來的樣子是不會變的，因此可以從字體數據入手：

0-9每一個數字都有對應的一個TTGlyph數據，首先對一個已知映射關系的字體文件進行分析，獲取0-9的字體數據，然后對於每次下載的動態字體文件，將其字體信息與0-9的字體數據進行對比就可以知道其映射關系了。

首先需要一份已知映射關系的xml文件作為映射關系對比文件，將其命名為data.xml，然后使用百度字體編輯器分析其對應的woff獲取其映射關系(由於我的data.xml對應的woff文件刪掉了，因此這里截圖的是一個隨機的woff文件對應的映射關系，可能與后邊的代碼內的映射關系不同，特此說明)：

創建data.xml對應的映射關系的字典：

data_dict = {"uniE184":"4","uniE80B":"3","uniF22E":"8","uniE14C":"0",
		"uniF5FB":"6","uniEE59":"5","uniEBD3":"1","uniED85":"7","uniECB8":"2","uniE96A":"9"}

要對比字體數據就要對xml文件進行分析，因此創建相關xml分析函數：

獲取某節點指定屬性的值：

def getValue(node, attribute):
	return node.attributes[attribute].value

字體數據的標簽為TTGlyph，創建獲取一個xml文件中所有的文字信息節點的函數：

def getTTGlyphList(xml_path):
	dataXmlfilepath = os.path.abspath(xml_path)
	dataDomObj = xmldom.parse(dataXmlfilepath)
	dataElementObj = dataDomObj.documentElement
	dataTTGlyphList = dataElementObj.getElementsByTagName('TTGlyph')
	return dataTTGlyphList

判斷兩個TTGlyph節點數據是否相同的函數：

def isEqual(ttglyph_a, ttglyph_b):
	a_pt_list = ttglyph_a.getElementsByTagName('pt')
	b_pt_list = ttglyph_b.getElementsByTagName('pt')
	a_len = len(a_pt_list)
	b_len = len(b_pt_list)
	if a_len != b_len:
		return False
	for i in range(a_len):
		if getValue(a_pt_list[i], 'x') != getValue(b_pt_list[i], 'x')  or getValue(a_pt_list[i], 'y') != getValue(b_pt_list[i], 'y') or getValue(a_pt_list[i], 'on') != getValue(b_pt_list[i], 'on'):
			return False
	return True

===============================================

相關函數建好后可以繼續分析：

由於每次的unicode碼是隨機生成的，因此還需要知道新的0-9對應的unicode碼是多少，為了方便直接使用函數獲取了上邊提到過的映射關系不對的GlyphOrder，是一個key為unicode，value為數字的字典：

decode_dict = dict(enumerate(font1.getGlyphOrder()[2:]))
decode_dict = dict(zip(decode_dict.values(),decode_dict.keys()))

獲取已知映射關系的data.xml的字體數據節點和新的動態字體文件的數據節點：

dataTTGlyphList = getTTGlyphList("data.xml")
tmpTTGlyphList = getTTGlyphList("tmp.xml")

利用字體數據更新映射字典：

decode_dict = refresh(decode_dict,tmpTTGlyphList,dataTTGlyphList)

更新函數的具體實現如下：

def refresh(dict, ttGlyphList_a, ttGlyphList_data):
	data_dict = {"uniE184":"4","uniE80B":"3","uniF22E":"8","uniE14C":"0",
		"uniF5FB":"6","uniEE59":"5","uniEBD3":"1","uniED85":"7","uniECB8":"2","uniE96A":"9"}
	data_keys = data_dict.keys()
	for ttglyph_data in ttGlyphList_data:
		if 	getValue(ttglyph_data,'name') in data_keys:
			for ttglyph_a in ttGlyphList_a:
				if isEqual(ttglyph_a, ttglyph_data):
					dict[getValue(ttglyph_a,'name')] = data_dict[getValue(ttglyph_data,'name')]
					break
	return dict

考慮到小數的情況，加入小數點映射：

decode_dict['.'] = '.'

實現解碼函數（輸入映射字典和一個需要解密的數值，輸出解密后的結果如15.6）：

def decode(decode_dict, code):
	_lst_uincode = []
	for item in code.__repr__().split("\\u"):
		_lst_uincode.append("uni" + item[:4].upper())
		if item[4:]:
			_lst_uincode.append(item[4:])
	_lst_uincode = _lst_uincode[1:-1]
	result = "".join([str(decode_dict[i]) for i in _lst_uincode])
	return result

==================================================

具體代碼鏈接

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 1-2 用Python爬取貓眼票房網上的電影票房信息 Python3爬取起貓眼電影實時票房信息，解決文字反爬~~~附源代碼一起學爬蟲——使用xpath庫爬取貓眼電影國內票房榜爬取貓眼電影影評貓眼電影的各種爬取方法 Python反爬：利用js逆向和woff文件爬取貓眼電影評分信息 Python爬蟲實例：爬取貓眼電影——破解字體反爬實時爬取貓眼票房＋微信推送|作業3 Python爬蟲系列之爬取貓眼電影（一） [Python爬蟲]貓眼電影榜單爬取