python網絡爬蟲之如何識別驗證碼

本文轉載自查看原文 2018-01-21 20:37 3746 python網絡爬蟲

有些網站的登錄方式是驗證碼登錄的方式，比如今天我們要測試的網站專利檢索及分析。

http://www.pss-system.gov.cn/sipopublicsearch/portal/uilogin-forwardLogin.shtml

登錄此類網站的關鍵是識別其中的驗證碼。那么如何識別驗證碼呢。我們首先來看下網頁源代碼。在網頁中，驗證碼的是通過下載一個圖片得到的。圖片的下載地址是src=/sipopublicsearch/portal/login-showPic.shtml

我們從實際的fiddler抓包來看，也是通過請求上面的圖片源地址得到了JPEG的圖片並顯示在瀏覽器上

那么在scrapy中我們首先就要將圖片下載到本地，然后進行識別

def parse(self,response):
     ret=response.xpath('//*[@id="codePic"]/@src').extract()
     image_source=ret[0]
     image_url=response.urljoin(image_source)
     r=requests.get(image_url)
     with open('E://scrapy_project/image2.JPEG',"wb") as code:
         code.write(r.content)

首先提取src的值出來，然后使用requests的方法進行圖片下載並保存。打開文件如下。

下一步就是開始識別圖片中的驗證碼了，這就需要用到pytesser以及PIL庫了。

首先是安裝Tesseract-OCR，在網上下載后進行安裝。默認安裝路徑是C:\Program Files\Tesseract-OCR。將該路徑添加到 系統屬性的path路徑里面。

然后再通過pip安裝pytesseract以及PIL。下面來看下如何使用。代碼如下：

im=Image.open('E:\\scrapy_project\\image2.JPEG')
 im.convert('L')
 ret=image_to_string(im，config='-psm 7’)
 print ret

結果如下：圖片中的驗證碼已經被識別出來了

image_to_string要配置psm N,參數解釋如下，一般我們選擇第7個

-psm N

    Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:

    0 = Orientation and script detection (OSD) only.

    1 = Automatic page segmentation with OSD.

    2 = Automatic page segmentation, but no OSD, or OCR.

    3 = Fully automatic page segmentation, but no OSD. (Default)

    4 = Assume a single column of text of variable sizes.

    5 = Assume a single uniform block of vertically aligned text.

    6 = Assume a single uniform block of text.

    7 = Treat the image as a single text line.

    8 = Treat the image as a single word.

    9 = Treat the image as a single word in a circle.

    10 = Treat the image as a single character.

E:\python2.7.11\python.exe E:/py_prj/test3.py

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 《python3網絡爬蟲開發實戰》--驗證碼的識別 python3編寫網絡爬蟲17-驗證碼識別 Python爬蟲學習筆記之點觸驗證碼的識別 [Python][爬蟲]利用OCR技術識別圖形驗證碼【爬蟲系列】1. 無事，Python驗證碼識別入門 python3爬蟲之驗證碼的識別——圖形驗證碼 Python驗證碼識別 Python驗證碼識別 python識別驗證碼爬蟲—GEETEST滑動驗證碼識別