python爬蟲學習(10) —— 專利檢索DEMO

本文轉載自查看原文 2016-12-23 19:12 2101 crawler

這是一個稍微復雜的demo，它的功能如下：

輸入專利號，下載對應的專利文檔
輸入關鍵詞，下載所有相關的專利文檔

0. 模塊准備

首先是requests，這個就不說了，爬蟲利器
其次是安裝tesseract-ocr，pytesseract 和 PIL 「用於識別驗證碼」

1. 模擬登陸

我們需要對這個網站專利檢索及分析進行分析，反復鼓搗之后發現，找不到下載鏈接？
tell my why? 原來是沒有登陸。果然，登陸之后能夠找到下載按鈕，並手動下載成功。

驗證碼

注意到，在登陸和下載的同時，還需要輸入驗證碼。

這樣，我們第一步要解決的問題，便是驗證碼識別與模擬登陸

apt-get install tesseract-ocr  
pip install pytesseract
pip install PIL

2. 關鍵字檢索請求

登陸之后，我們鍵入一個關鍵詞，進行檢索

js2

瀏覽器會向上圖所示的url發送post請求，form data中的searchExp是我們的關鍵詞

那么，我們通過post請求，便可以得到檢索結果了

3. 翻頁

把鼠標放在下一頁上，我們能看到，是調用js來實現頁面跳轉的。

fy2

進一步分析，實際上我們對上圖的url進行post請求也能實現翻頁。
注意到form data中的參數

"resultPagination.limit":"10",
"resultPagination.sumLimit":"10",
"resultPagination.start":cnt, #這次參數決定了當前頁面從搜索到的第幾個數據開始
"resultPagination.totalCount":total, #總的搜索數據數目
"searchCondition.searchType":"Sino_foreign",
"searchCondition.dbId":"",
"searchCondition.extendInfo['MODE']":"MODE_GENERAL",
"searchCondition.searchExp":keywords, #我們輸入的關鍵詞
"wee.bizlog.modulelevel":"0200101",
"searchCondition.executableSearchExp":executableSearchExp, #需要我們自己構造
"searchCondition.literatureSF":literatureSF, #需要我們自己構造
"searchCondition.strategy":"",
"searchCondition.searchKeywords":"",
"searchCondition.searchKeywords":keywords # 我們輸入的關鍵詞

所以我們需要翻頁的時候向 showSearchResult-startWa.shtml 這個頁面進行post請求即可，
注意每次更新resultPagination.start參數

4. 瀏覽文檔

需要下載專利文獻，我們需要跳轉到另外一個頁面，而這個頁面，又是通過js跳轉的。

注意這里跳轉的同時，傳入的參數即為專利號，我們把它們保存下了，后面的post請求需要用到

xz2

可以看到，通過js跳轉到了一個新窗口showViewList
這里的viewQC.viewLiteraQCList[0].searchCondition.executableSearchExp:非常關鍵，
這正是之前將它們保存下來的原因。

'viewQC.viewLiteraQCList[0].srcCnName':cnName,
'viewQC.viewLiteraQCList[0].srcEnName':srcEnName,
'viewQC.viewLiteraQCList[0].searchStrategy':'',
'viewQC.viewLiteraQCList[0].searchCondition.executableSearchExp':condition,
'viewQC.viewLiteraQCList[0].searchCondition.sortFields':'-APD,+PD',
'viewQC.needSearch':'true',
'viewQC.type':'SEARCH',
'wee.bizlog.modulelevel':'0200604'

5. 下載文檔

xz21

點擊下載按鈕，彈出一個新的界面，Nextwork中產生三個請求
第三個便是驗證碼對應的圖片

xz22

xz222

手動輸入驗證碼后，首先通過validateMask.shtml進行驗證碼校驗，並返回一個加密過后的mask串
再向downloadLitera.do發送post請求，完成下載！其中的參數，需要在之前自行確定。

6. 初步代碼

很多細節問題需要考慮，這份代碼只能算初步代碼：

#coding=utf-8
import requests, re
import Image
from pytesseract import *


def get_verification_code(url):
    src = s.get(url).content
    open('temp_pic',"wb").write(src)
    pic=Image.open(r'./temp_pic')
    return image_to_string(pic)

login_url = 'http://www.pss-system.gov.cn/sipopublicsearch/wee/platform/wee_security_check'
host_url = 'http://www.pss-system.gov.cn/sipopublicsearch/portal/index.shtml'
pic_url = 'http://www.pss-system.gov.cn/sipopublicsearch/portal/login-showPic.shtml'
pic_mask = 'http://www.pss-system.gov.cn/sipopublicsearch/search/validateCode-showPic.shtml?params=2595D550022F3AC2E5D76ED4CAFD4D8E'
search_url = 'http://www.pss-system.gov.cn/sipopublicsearch/search/smartSearch-executeSmartSearch.shtml'
show_page = 'http://www.pss-system.gov.cn/sipopublicsearch/search/showSearchResult-startWa.shtml'
show_list = 'http://www.pss-system.gov.cn/sipopublicsearch/search/search/showViewList.shtml'
mask_check_url = 'http://www.pss-system.gov.cn/sipopublicsearch/search/validateMask.shtml'
download_url = 'http://www.pss-system.gov.cn/sipopublicsearch/search/downloadLitera.do'

down_head = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate',
    'Accept-Language':'en-US,en;q=0.8',
    'Cache-Control':'max-age=0',
    'Connection':'keep-alive',
    'Content-Type':'application/x-www-form-urlencoded',
    'Origin':'http://www.pss-system.gov.cn',
    'Referer':'http://www.pss-system.gov.cn/sipopublicsearch/search/search/showViewList.shtml',
    'Upgrade-Insecure-Requests':'1',
    'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36'
}


s = requests.session()
cookies = dict(cookies_are='working')
s.get(host_url)

vcode = get_verification_code(pic_url)
login_data = {
    'j_validation_code':vcode,
    'j_loginsuccess_url':'http://www.pss-system.gov.cn/sipopublicsearch/portal/index.shtml',
    'j_username':'emhhbmdzYW4xMjM=',
    'j_password':'emhhbmdzYW4xMjM='
}

s.post(login_url,data=login_data,cookies=cookies)
print "login!!\n-------------------"

keywords = raw_input("please input keywords: ")
cnt = 0

search_data = {
    'searchCondition.searchExp':keywords,
    'searchCondition.dbId':'VDB',
    'searchCondition.searchType':'Sino_foreign',
    'wee.bizlog.modulelevel':'0200101'
}


result = s.post(search_url,data=search_data,cookies=cookies).content
total = int(re.search(r'&nbsp;共.*?頁&nbsp;(.*?)條數據',result,re.S).group(1))
print "total:",total
all_result = re.findall('javascript:viewLitera_search\(\'.*?\',\'(.*?)\',\'single\'\)',result,re.S)

executableSearchExp = "VDB:(TBI=" + "'" + keywords + "')"
literatureSF = "復合文本=(" + keywords + ")"

while cnt <= total:
    cnt += 10
    for cur in all_result:
        real_id = cur
        if 'CN' in cur :
            real_id = cur[:14] + '.' + cur[14:]

        condition = r"VDB:(ID='" + real_id + r"')"
        cnName = '檢索式:復合文本=' + '(' + keywords + ')'
        srcEnName = 'SearchStatement:復合文本=' + '(' + keywords + ')'


        print real_id

        data_cur = {
            'viewQC.viewLiteraQCList[0].srcCnName':cnName,
            'viewQC.viewLiteraQCList[0].srcEnName':srcEnName,
            'viewQC.viewLiteraQCList[0].searchStrategy':'',
            'viewQC.viewLiteraQCList[0].searchCondition.executableSearchExp':condition,
            'viewQC.viewLiteraQCList[0].searchCondition.sortFields':'-APD,+PD',
            'viewQC.needSearch':'true',
            'viewQC.type':'SEARCH',
            'wee.bizlog.modulelevel':'0200604'
        }


        show = s.post(show_list,data = data_cur,cookies=cookies).content

        tmp = re.search('literaList\[0\] = \{(.*?)\};',show,re.S)
        if tmp == None:
            break;

        idlist = re.findall('"(.*?)"',tmp.group(1).replace(' ',''),re.S)

        # 解析驗證碼
        vcode = get_verification_code(pic_mask)
        print vcode

        # 獲取加密后的mask
        mask_data = {
            '':'',
            'wee.bizlog.modulelevel':'02016',
            'mask':vcode
        }
        kao = s.post(mask_check_url,data=mask_data,cookies=cookies).content
        #{"downloadCount":2,"downloadItems":null,"mask":"1a75026a-5138-4460-a35e-5ef60258d1d0","pass":true,"sid":null}
        mask_jm = re.search(r'"mask":"(.*?)"',kao,re.S).group(1)
        #print mask_jm

        data_down = {
            'wee.bizlog.modulelevel':'02016',
            'checkItems':'abstractCheck',
            '__checkbox_checkItems':'abstractCheck',
            'checkItems':'TIVIEW',
            'checkItems':'APO',
            'checkItems':'APD',
            'checkItems':'PN',
            'checkItems':'PD',
            'checkItems':'ICST',
            'checkItems':'PAVIEW',
            'checkItems':'INVIEW',
            'checkItems':'PR',
            'checkItems':'ABVIEW',
            'checkItems':'ABSIMG',
            'idList[0].id':idlist[0],
            'idList[0].pn':idlist[3],
            'idList[0].an':idlist[2],
            'idList[0].lang':idlist[4],
            'checkItems':'fullTextCheck',
            '__checkbox_checkItems':'fullTextCheck',
            'checkItems':'fullImageCheck',
            '__checkbox_checkItems':'fullImageCheck',
            'mask':mask_jm
        }


        down_page = s.post(download_url,data=data_down,headers=down_head,cookies=cookies).content

        open( cur + ".zip" ,"wb").write(down_page)

    kao_data = {
        "resultPagination.limit":"10",
        "resultPagination.sumLimit":"10",
        "resultPagination.start":cnt,
        "resultPagination.totalCount":total,
        "searchCondition.searchType":"Sino_foreign",
        "searchCondition.dbId":"",
        "searchCondition.extendInfo['MODE']":"MODE_GENERAL",
        "searchCondition.searchExp":keywords,
        "wee.bizlog.modulelevel":"0200101",
        "searchCondition.executableSearchExp":executableSearchExp,
        "searchCondition.literatureSF":literatureSF,
        "searchCondition.strategy":"",
        "searchCondition.searchKeywords":"",
        "searchCondition.searchKeywords":keywords
    }
    result = s.post(show_page,data=kao_data,cookies=cookies).content
    print "next page"
    all_result = re.findall('javascript:viewLitera_search\(\'.*?\',\'(.*?)\',\'single\'\)',result,re.S)

7. 效果展示

xx1

8. TODO

可能會繼續更新 code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 爬蟲簡單的demo 專利檢索常用的十八個網站 Python爬蟲實戰，Scrapy實戰，爬取並簡單分析知網中國專利數據學習python登錄demo python簡單爬蟲抓取視頻demo Python學習之路（二）爬蟲（一） Python爬蟲學習（2）： httplib Python爬蟲學習筆記（一） python爬蟲學習系列 python爬蟲之pyquery學習