Python 爬蟲入門（四）—— 驗證碼下篇（破解簡單的驗證碼）

本文轉載自查看原文 2016-02-29 11:33 5840 爬蟲/ python

　　年前寫了驗證碼上篇，本來很早前就想寫下篇來着，只是過年比較忙，還有就是驗證碼破解比較繁雜，方法不同，正確率也會有差異，我一直在找比較好的方案，但是好的方案都比較專業，設涉及到了圖形圖像處理這些，我也是一知半解，所以就耽誤了下來，在此對一直等待的同學說聲抱歉。有興趣的同學可以自行看看這方面的資料。因為我們都是入門，這次就以簡單點的驗證碼為例，講述下流程。廢話不多說，正式開始。

　　1.)獲取驗證碼

　　在上節，我們已經講述了獲取驗證碼的方法，這里不作贅述。下面是我獲取到的另一個網站的驗證碼（最后我會放一個驗證碼的壓縮包，想要練習的同學可以下載下來，尋找准確率更高的方案）。

　2.)分析驗證碼

　　a.)分析樣本空間

　　從上面的驗證碼可以看出，圖片上總共有5個字，分別是操作數1、操作符、操作數2、"等於"。所以我們提取的話，只有前三個字是有效字。同時操作數的取值范圍（0~9），操作符的取值為（加、乘）。所以總共有12個樣本空間，操作數有10個，操作符有兩個。

　　b.)分析提取范圍

　　windows用戶可以用系統自帶的畫板工具打開驗證碼，可以看到如下信息。

　　首先可以看到，驗證碼的像素是80*30，也就說橫向80像素，縱向30像素，如果給它畫上坐標系的話，坐標原點（0,0）為左上方頂點，向右為x軸（0=<x<80）,向下為y軸（0=<y<30）。(10,17)是當前鼠標（圖片中的十字）所在位置的坐標，這個可以幫助我們確定裁剪的范圍。我用的裁剪范圍分別是：

　　操作數1和操作數2的大小做好保持一致，這樣可以使兩個操作數共用樣本數據。region = (3,4,16,17) 其中（3,4）代表左上頂點的坐標，（16,17）代表右下頂點的坐標，這樣就可以構成一個矩形。大小為（16-3，17-4）即寬和高均為13像素的矩形

　3.)處理驗證碼（這里我用的是python的"PIL"圖像處理庫）

　　　a.)轉為灰度圖

　　　　PIL 在這方面也提供了極完備的支持，我們可以：

　　　　img.convert("L")

　　　　把 img 轉換為 256 級灰度圖像， convert() 是圖像實例對象的一個方法，接受一個 mode 參數，用以指定一種色彩模式，mode 的取值可以是如下幾種：

　　　　· 1 (1-bit pixels, black and white, stored with one pixel per byte)

　　　　· L (8-bit pixels, black and white)

　　　　· P (8-bit pixels, mapped to any other mode using a colour palette)

　　　　· RGB (3x8-bit pixels, true colour)

　　　　· RGBA (4x8-bit pixels, true colour with transparency mask)

　　　　· CMYK (4x8-bit pixels, colour separation)

　　　　· YCbCr (3x8-bit pixels, colour video format)

　　　　· I (32-bit signed integer pixels)

　　　　· F (32-bit floating point pixels)

　　　　代碼如下：

from PIL import Image
image = Image.open("H:\\authcode\\origin\\code3.jpg")
imgry = image.convert("L")
imgry.show()

　　　　運行結果：

　　　　然后二值化：

from PIL import Image
image = Image.open("H:\\authcode\\origin\\code3.jpg")
imgry = image.convert("L")
# imgry.show()
threshold = 100
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)
out = imgry.point(table,'1')
out.show()

　　　　運行結果：

　　　　這個時候就是比較純粹的黑白圖了。

　　　　代碼說明：

　　　　　　a).threshold = 100這個是一個閾值，具體是多少，看情況，如果比較專業的可以根據圖片的灰度直方圖來確定，一般而言，可以自己試試不同的值，看哪個效果最好。

　　　　　　b).其他的函數都是PIL自帶的，有疑問的可以自己找資料查看

　　　　b.)圖片裁剪

　　　　代碼如下：

from PIL import Image
image = Image.open("H:\\authcode\\origin\\code3.jpg")
imgry = image.convert("L")
# imgry.show()
threshold = 100
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)
out = imgry.point(table,'1')
# out.show()
region = (3,4,16,17)
result = out.crop(region)
result.show()

　　　　運行結果：

　　　　更改region的值就可以裁剪到不同的圖片，然后對其進行分類。我是把每個數字都不同的文件夾里，結果如下：

　　4.)提取特征值

　　提取特征值的算法就是因人而異了，這里我用的是，對每個分割后的驗證碼，橫向畫兩條線，縱向畫兩條線，記錄與驗證碼的交點個數（很尷尬的是我這個方案，識別率不高，這里意思到了就行了，大家懂的）。

　　就是這么個意思。這四條線的表達式為：(橫線)x=3和x=6,(豎線)y=2，y=11

　　　代碼如下：

def yCount1(image):
    count = 0;
    x = 3
    for y in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def yCount2(image):
    count = 0;
    x = 6
    for y in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def xCount1(image):
    count = 0
    y = 2
    for x in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def xCount2(image):
    count = 0
    y = 11
    for x in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count

　　把（0~9）這10個數字取特征值之后就得到如下圖的結果：

2:5:3:3-0
2:2:2:3-0
5:2:2:4-0
2:2:2:0-0
2:4:2:0-0
6:2:3:3-0
0:3:3:2-0
2:5:3:3-0
2:1:3:5-1
2:1:3:5-1
1:6:3:4-1
1:8:3:2-1
1:8:3:3-1
1:6:3:4-1
1:5:3:3-1
1:3:3:5-1
2:1:3:5-1
1:6:3:3-1
1:7:3:2-1
1:5:3:3-1
1:7:3:4-1
1:8:3:2-1
2:1:2:5-1
2:1:1:2-1
1:8:3:2-1
2:1:2:5-1
1:7:0:1-1
2:1:2:5-1
6:1:2:1-1
0:6:3:1-1
0:6:2:1-1
1:7:2:1-1
5:1:2:3-1
1:3:3:5-1
2:7:2:2-1
6:1:2:1-1
2:1:2:3-1
5:1:1:0-1
1:6:3:3-1
1:7:3:2-1
1:7:3:4-1
5:1:2:3-1
2:1:1:1-1
1:6:0:1-1
4:1:2:3-1
1:1:2:4-1
5:1:2:1-1
0:5:2:2-1
2:1:2:4-1
1:5:3:5-1
5:1:3:3-1
1:8:3:2-1
1:5:3:3-1
2:1:2:5-1
2:1:1:2-1
2:1:2:5-1
2:1:2:5-1
2:1:2:5-1
2:1:2:5-1
1:8:3:2-1
2:1:2:5-1
1:5:3:3-1
2:1:3:5-1
3:2:2:2-2
4:1:1:1-2
3:3:2:6-2
3:3:4:4-2
2:3:2:3-2
3:3:2:6-2
2:3:3:3-2
2:3:3:3-2
3:5:3:6-2

　　最后一個數字代表這個特征值的結果，比如3:5:3:6-2，代表如果一個圖片滿足3:5:3:6，那么我們就認為這個圖片上的值為2

　　這樣是有誤差的

　　首先，存在一個特征值同時輸入多個數字，比如，1:2:3:4可能輸入2，也可能輸入3，這個時候就會出現誤差。（解決方案：取出現頻率最高的結果，但是也會有誤差）

　　其次，可能存在一個特征值不在我們的樣本空間。（解決方案：擴大樣本空間）

　5.)驗證

　　完成以上幾部，就可以進行破解測試了。

　　代碼如下（crackcode是我自己寫的函數）：

　　附錄：

　　crackcode.py

#encoding=utf8
import checknumber
import splitImage
import checkoperation
def getCodeResult(image):
    image1 = splitImage.getNumImage(image,1)
    image2 = splitImage.getNumImage(image,2)
    image3 = splitImage.getNumImage(image,3)
    num1 = checknumber.getnum(image1)
    num2 = checknumber.getnum(image2)
    operation =checkoperation.getoperation(image3)
    # print `num1`+":"+`operation`+":"+`num2`
    if(int(operation) != 2):
       result =  int(num1) + int(num2)
    else:
       result =  int(num1) * int(num2)
    return result

　　checknumber.py　

#encoding=utf8
from PIL import Image
import test
import collections

f = open("../src/school")
lines = f.readlines()
ips={}
for i in range(0,len(lines)):
    ips[i] = lines[i]
def getnum(image):
    # newimage = test.handimage(image)
    newimage = image
    result = `test.yCount1(newimage)`+":"+`test.yCount2(newimage)`+":"+`test.xCount1(newimage)`+":"+`test.xCount2(newimage)`
    result_ips = []
    for x in range(len(ips)):
        if(ips[x].find(result)>-1):
            result_ips.append(ips[x].strip("\n").split('-')[1])
    d = collections.Counter(result_ips)
    if(len(d.most_common(1))==0):
        return -1
    else:
        return d.most_common(1)[0][0]

　　splitImage.py

#encoding=utf8
from PIL import Image

def getNumImage(image,type):
    imgry = image.convert("L")
    threshold = 100
    table = []
    for i in range(256):
        if i < threshold:
            table.append(0)
        else:
            table.append(1)
    out = imgry.point(table,'1')
    if(type == 1):#操作數1
        region = (3,4,16,17)
        result = out.crop(region)
        return result
    elif(type == 2):#操作數2
        region = (33,4,46,17)
        result = out.crop(region)
        return result
    else:#操作符
        region = (18,4,33,17)
        result = out.crop(region)
        return result

    return result

　　checkoperation.py

#encoding=utf8
from PIL import Image
import test
import collections

f = open("../src/operation")
lines = f.readlines()
ips={}
for i in range(0,len(lines)):
    ips[i] = lines[i]
def getoperation(image):
    # newimage = test.handimage(image)
    newimage = image
    result = `test.yCount1(newimage)`+":"+`test.yCount2(newimage)`+":"+`test.xCount1(newimage)`+":"+`test.xCount2(newimage)`
    result_ips = []
    for x in range(len(ips)):
        if(ips[x].find(result)>-1):
            result_ips.append(ips[x].strip("\n").split('-')[1])
    d = collections.Counter(result_ips)
    if(len(d.most_common(1))==0):
        return -1
    else:
        return d.most_common(1)[0][0]

　　test.py

#encoding=utf8
from pytesseract import *
from PIL import Image

def handimage(image):
    height = image.size[1]
    width = image.size[0]
    # print height,width
    for h in range(height):
        for w in range(width):
            pixel = image.getpixel((w,h))
            if(pixel<127):
                image.putpixel((w,h),0)
            else:
                image.putpixel((w,h),255)
    for h in range(height):
        for w in range(width):
            pixel = image.getpixel((w,h))
            # print pixel
    return image
def yCount1(image):
    count = 0;
    x = 3
    for y in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def yCount2(image):
    count = 0;
    x = 6
    for y in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def xCount1(image):
    count = 0
    y = 2
    for x in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count
def xCount2(image):
    count = 0
    y = 11
    for x in range(0,13):
        pixel = image.getpixel((x,y))
        if(pixel==0):
            count = count+1
    return count

operation和school分別為操作數和操作符的樣本空間，可以自己獲取。
驗證碼樣本放在百度雲了，500條：
鏈接：http://pan.baidu.com/s/1hrv5w7y 密碼：igo6
至此，破解驗證碼的流程就結束了。

　　說明：

　　a).代碼僅供學習交流

　　b).如有錯誤，多多指教

　　c).轉載請注明出處

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python 爬蟲入門（四）—— 驗證碼上篇（主要講述驗證碼驗證流程，不含破解驗證碼）爬蟲-破解驗證碼驗證碼破解爬蟲驗證碼爬蟲----破解極驗滑動驗證碼爬蟲練習三(破解滑動驗證碼) 爬蟲之簡單驗證碼處理 python+selenium十三：破解簡單的圖形驗證碼【爬蟲系列】1. 無事，Python驗證碼識別入門 Python——破解極驗滑動驗證碼