1. 引言
從安卓手機收集上來的機型大都為這樣:
mi|5
mi|4c
mi 4c
2014022
kiw-al10
nem-tl00h
收集的機型大都雜亂無章,不便於做統計分析。因此,標注顯得尤為重要。
中關村在線有對國內大部分手機的介紹情況,包括手機機型nem-tl00h及其對應的常見名稱榮耀暢玩5C。因而,設計機型自動化標注策略如下:
- 在搜狗搜索中輸入機型進行搜索,為了限定第一個返回結果為ZOL網站,加上限定詞
site:detail.zol.com.cn; - 通過第一條返回結果的鏈接,跳轉到相應的ZOL頁面,解析拿到標注名稱與手機別名。
2. 實現
根據上面的爬取策略,我用Python實現一個簡單的爬蟲:采用PyQuery解析HTML頁面,PyQuery采用類似jQuery的語法來操作HTML元素,熟悉jQuery的人對PyQuery是上手即用。
Sogou爬蟲的代碼實現(基於Python 3.5.2)如下:
# -*- coding: utf-8 -*-
# @Time : 2016/8/8
# @Author : rain
import codecs
import csv
import logging
import re
import time
import urllib.parse
import urllib.request
import urllib.error
from pyquery import PyQuery as pq
def quote_url(model_name):
base_url = "https://www.sogou.com/web?query=%s"
site_zol = "site:detail.zol.com.cn "
return base_url % (urllib.parse.quote(site_zol + model_name))
def parse_sogou(model_name):
search_url = quote_url(model_name)
request = urllib.request.Request(url=search_url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/45.0.2454.101 Safari/537.36'})
sogou_html = urllib.request.urlopen(request).read()
sogou_dom = pq(sogou_html)
goto_url = sogou_dom("div.results>.vrwrap>.vrTitle>a[target='_blank']").eq(0).attr("href")
logging.warning("goto url: %s", goto_url)
if goto_url is None:
return None
goto_dom = pq(url=goto_url)
script_text = goto_dom("script").text()
zol_url = re.findall(r'\("(.*)"\)', script_text)[0]
return zol_url
def parse_zol(model_name):
zol_url = parse_sogou(model_name)
if zol_url is None:
return None, None
try:
zol_html = urllib.request.urlopen(zol_url).read()
except urllib.error.HTTPError as e:
logging.exception(e)
return None, None
zol_dom = pq(zol_html)
title = zol_dom(".page-title.clearfix")
name = title("h1").text()
alias = title("h2").text()
if u'(' in name and u')' in name:
match_result = re.match(u'(.*)((.*))', name)
name = match_result.group(1)
alias = match_result.group(2) + " " + alias
return name, alias
if __name__ == "__main__":
with codecs.open("./resources/data.txt", 'r', 'utf-8') as fr:
with open("./resources/result.csv", 'w', newline='') as fw:
writer = csv.writer(fw, delimiter=',')
for model in fr.readlines():
model = model.rstrip()
label_name, label_alias = parse_zol(model)
writer.writerow([model, label_name, label_alias])
logging.warning("model: %s, name: %s, alias: %s", model, label_name, label_alias)
time.sleep(10)
為了防止sogou封禁,每爬一次則休息10s。當然,這種爬取的速度會非常慢,需要做些優化。
3. 優化
下載驗證碼
sogou是通過訪問頻次來進行封禁,當訪問次數過多時,會要求輸入驗證碼:
<div class="content-box">
<p class="ip-time-p">IP:61...<br/>訪問時間:2016.08.09 15:40:04</p>
<p class="p2">用戶您好,您的訪問過於頻繁,為確認本次訪問為正常用戶行為,需要您協助驗證。</p>
...
<form name="authform" method="POST" id="seccodeForm" action="/">
<p class="p4">
...
<input type="hidden" name="m" value="0"/> <span class="s1">
<a onclick="changeImg2();" href="javascript:void(0)">
<img id="seccodeImage" onload="setImgCode(1)" onerror="setImgCode(0)" src="util/seccode.php?tc=1470728404" width="100" height="40" alt="請輸入圖中的驗證碼" title="請輸入圖中的驗證碼"/>
</a>
</span>
<a href="javascript:void(0);" id="change-img" onclick="changeImg2();" style="padding-left:50px;">換一張</a>
<span class="s2" id="error-tips" style="display: none;"/>
</p>
</form>
...
</div>
通過分析html,真實的驗證碼圖像需要做如下的拼接:
http://weixin.sogou.com/antispider/util/seccode.php?tc=1470728404
下載驗證碼圖像到本地:
import urllib.request
from pyquery import PyQuery as pq
import re
for i in range(100):
html = urllib.request.urlopen("https://www.sogou.com/web?query=treant").read()
dom = pq(html)
img_src = dom("#seccodeImage").attr("src")
if img_src is not None:
img_name = re.search("tc=(.*)", img_src).group(1)
anti_img_url = "http://weixin.sogou.com/antispider/" + img_src
urllib.request.urlretrieve(anti_img_url, "./images/" + img_name + ".jpg")
tesseract識別驗證碼,識別的效果的一般,等以后有時間再考慮下其他識別方法。
