web數據采集核心技術分享系列（三）如何破解驗證碼？圖像分析？特征匹配？人工智能？第三方集成？...哪個最強大？

本文轉載自查看原文 2012-08-06 16:16 4390 web數據采集和數據挖掘

先加個目錄，方便大家查看

web數據采集核心技術分享系列（一）做一個強大的web數據采集系統，你需要什么?

web數據采集核心技術分享系列（二）如何提取信息？字符串？正則？xpath？xslt？自定義？...什么才是王道?

web數據采集核心技術分享系列（三）如何破解驗證碼？圖像分析？特征匹配？人工智能？第三方集成？...哪個最強大？

web數據采集核心技術分享系列（四）利用神經網絡實現網頁驗證碼破解

應各位熱心看客的要求建了個QQ群：254764602，歡迎大家加群一起討論，互相學習進步。

加群請輸入暗號“數據采集”，否則不加

速度進入主題，這次的話題有點大，也有點難度，所以可能一篇說不完，先寫一篇，回頭根據大家的反饋我再寫第二篇。

道高一尺魔高一丈，在驗證碼這個領域，道高一尺不難，魔高一丈非常難，所以我們就通常的驗證碼來做討論，比較特殊的或者變態的驗證碼就不做深入探討了。

一個普通的驗證碼通常是一個圖片，有幾個字符，然后有一些背景色，前景色，雜點（俗稱噪點），干擾線，字符可能會有傾斜，扭曲，粘連，變形，甚至手寫體，破解的過程總結起來就是一句話，去除干擾，簡化特征，匹配特征，得到驗證碼，我不是寫書的，不能面面俱到，我們從簡單點的開始，看圖說話,從下圖可以看出，最后一步猜驗證碼的方法有三個，分別是簡單的圖像分析+特征匹配，基於神經網絡的人工智能特征匹配，以及采用第三方google組件繼承的方式，更強大的方式依賴於集成多個第三方類庫（包含C以及C++代碼）的實現，更為復雜，為了方便大家理解，先從第一種看起

第一步，獲取驗證碼圖片

在web數據采集過程中，如果采用獲取網頁源文件的方式，獲取到圖片地址在獲取圖片，應該不會有問題，如果借助瀏覽器獲取到網頁，再得到驗證碼地址，再去獲取圖片，則會導致問題，因為一般驗證碼的地址都是隨機生成的，再次訪問會得到另外一個驗證碼，所以借助瀏覽器的童鞋們，請直接從瀏覽器中獲取圖片。

第二部，變形

對於傾斜，字體變形的驗證碼，不做變形還原是很難繼續處理的，所以必須變形，變形的原理針對不同的變形會有不同，沒有哪一個方法可以包治百病且葯到病除，所以我們也針對性討論，比如對於傾斜，要獲取到字符區域四個角，然后計算傾斜四個邊的傾斜角度，然后再向想法方向拉伸（我不是學計算機的，也不是學圖形算法的，這些都是我的個人經驗，說的不對還請不吝賜教）

貼一段代碼給大家

1 Bitmap output = input;

2 int x = input.Width;

3             int y = input.Height;
4             int startPointsCount = 10;
5             int[] yBlackCount = new int[y];
6
7             Point leftTop = new Point(0, 0);
8             Point leftBottom = new Point(0, 0);
9             Point rightTop = new Point(0, 0);
10             Point rightBottom = new Point(0, 0);
11
12             for (int j = 0; j < y; j++)
13             {
14                 for (int i = 0; i < x; i++)
15                 {
16                     if (input.GetPixel(i, j).R == 0)
17                     {
18                         yBlackCount[j]++;
19                     }
20                 }
21             }
22
23             for (int j = 1; j < y - 1; j++)
24             {
25                 Point letterStart = new Point(0, 0);
26                 Point letterEnd = new Point(0, 0);
27                 for (int i = 1; i < x - 1; i++)
28                 {
29                     if (input.GetPixel(i, j).R == 0)
30                     {
31                         letterStart = new Point(i, j);
32                         break;
33                     }
34                 }
35                 for (int i = x - 2; i > 0; i--)
36                 {
37                     if (input.GetPixel(i, j).R == 0)
38                     {
39                         letterEnd = new Point(i, j);
40                         break;
41                     }
42                 }
43                 if (yBlackCount[j] > startPointsCount && yBlackCount[j + 1] > yBlackCount[j] && leftTop.Y == 0)
44                 {
45                     //top of letters
46                     leftTop = letterStart;
47                     rightTop = letterEnd;
48                 }
49                 if (leftTop.Y > 0 && yBlackCount[j + 1] < startPointsCount && yBlackCount[j] > yBlackCount[j + 1] && leftBottom.Y == 0)
50                 {
51                     //botton of letters
52                     leftBottom = letterStart;
53                     rightBottom = letterEnd;
54                 }
55             }
56             if (leftTop.Y != 0 && leftBottom.Y != 0)
57             {
58                 int lDistince = ((leftBottom.X - leftTop.X) * y) / (leftBottom.Y - leftTop.Y);
59                 int rDistince = ((rightBottom.X - rightTop.X) * y) / (rightBottom.Y - rightTop.Y);
60                 if (lDistince > 20)
61                 {
62                     lDistince = 20;
63                 }
64                 if (lDistince < -20)
65                 {
66                     lDistince = -20;
67                 }
68                 if (rDistince > 20)
69                 {
70                     rDistince = 20;
71                 }
72                 if (rDistince < -20)
73                 {
74                     rDistince = -20;
75                 }
76
77
78                 Graphics g = Graphics.FromImage(output);
79                 Brush b = new TextureBrush(source);
80
81                 //g.FillRectangle(b, this.ClientRectangle);
82                 g.FillRectangle(b, rectangle);
83
84                 Point[] destinationPoints = {
85                     new Point(lDistince, 0),        // destination for upper-left point of original
86                     new Point(x+rDistince, 0),      // destination for upper-right point of original
87                     new Point(0, y)};               // destination for lower-left point of original
88                 g.DrawImage(source, destinationPoints);
89             }
90
91             return output;

其他的變形暫且不在這里深入，要針對具體變形才能深入展開。

3，繼續我們簡單驗證碼處理的流程，說實話web數據采集中任何一點都可以拿出來單獨寫一個系列，要想做一個強大的采集系統，不是一個人花一兩個月可以完成的，這里面的艱難只有你真正去做了，真正拿給客戶運行才能體會到，如果各位大牛都能無私的把牛逼的解決方案和源代碼開源，那么程序員的生活就會容易很多，大家都是一條船上的同路人，互相扶持多好。不好意思廢話幾句，繼續說灰度化，這個網上代碼很多，為了方便大家，我還是貼出來，如果大家覺得簡單代碼沒必要貼，下次我就不貼了。

protected static Color Gray(Color c)

{
int rgb = Convert.ToInt32(( double )((( 0.3 * c.R) + ( 0.59 * c.G)) + ( 0.11 * c.B)));
return Color.FromArgb(rgb, rgb, rgb);
}

4.轉化為黑白圖片，俗稱二值話，其實3和4都是為了簡化特征，為后續處理打好基礎，二值化的關鍵步驟是取得門限值，或者叫閥值，就是說什么樣的點應該看做黑點，什么樣的點應該看做白點，上代碼(為啥總是第一行或者最后一行就沒格式呢，誰告訴我？)

Bitmap output = new Bitmap(input.Width, input.Height);
             int tv = ComputeThresholdValue(input);
             int x = input.Width;
             int y = input.Height;
             int blackCount = 0;
             int whiteCount = 0;
             int nearDots;
             for ( int i = 0; i < x; i++)
            {
                 for ( int j = 0; j < y; j++)
                {
                     // suppose the background is white,set the border to white
                     if (i == 0 || i == input.Width - 1 || j == 0 || j == input.Height - 1)
                    {
                        output.SetPixel(i, j, Color.White);
                        whiteCount++;
                         continue;
                    }
                     // white point, background
                     if (input.GetPixel(i, j).R >= tv)
                    {
                        output.SetPixel(i, j, Color.White);
                        whiteCount++;
                    }
                     // black point, char
                     else
                    {
                        output.SetPixel(i, j, Color.Black);
                        blackCount++;
                    }
                }

}

5.切分，切分的目的是把一個字符串中的單個字符找出來，單個字符的特征處理起來就要簡單很多，切分的原理就是主要是定位到字符邊界，然后切分圖片，經過上面幾個步驟之后，圖片上是一個個的黑白字符，假設白色為底色，黑色為字符，那么對黑色點在XY坐標系里的分布進行統計，即可得到字符邊界。

/// <summary>

         /// Split picture, and get the codes into a list
         /// </summary>
         /// <param name="map"></param>
         /// <param name="count"></param>
         /// <returns></returns>
         public static List<Bitmap> Split(Bitmap map)
        {
            List<Bitmap> resultList = new List<Bitmap>();

             int x = map.Width;
             int y = map.Height;
             int maxNoisyWidth = 4; // code with width nor more thal 4 is treated as noisy code
             int maxNoisyCount = 4; // points no more than 4 is treated as noisy points

             // black is char
             // black points count per column
             int[] xBlackCount = new int[x];
             for ( int i = 0; i < x; i++)
            {
                 for ( int j = 0; j < y; j++)
                {
                     if (map.GetPixel(i, j).R == 0)
                    {
                        xBlackCount[i]++;
                    }
                }
            }
             // white points count per column
             int[] yBlackCount = new int[y];
             for ( int j = 0; j < y; j++)
            {
                 for ( int i = 0; i < x; i++)
                {
                     if (map.GetPixel(i, j).R == 0)
                    {
                        yBlackCount[j]++;
                    }
                }
            }

             // split picture
             bool charFlag = false;
             int xStart = 0;
             int xEnd = 0;
             int yStart = 0;
             int yEnd = 0;
             for ( int j = 0; j < yBlackCount.Length; j++)
            {
                 if (yBlackCount[j] >= maxNoisyCount && charFlag == false)
                {
                     // start to scan the top of all char
                    yStart = j;
                    charFlag = true;
                }
                 if (yBlackCount[j] < maxNoisyCount && charFlag == true)
                {
                     // end of scan the bottom of all char
                    yEnd = j;
                    charFlag = false;
                }
                 if (yStart != 0 && yEnd != 0)
                {
                     // got the top and bottom of all char
                     break;
                }
            }
             for ( int i = 0; i < xBlackCount.Length; i++)
            {
                 if (xBlackCount[i] >= maxNoisyCount && charFlag == false)
                {
                     // start to scan a char
                    xStart = i;
                    charFlag = true;
                }
                 if (xBlackCount[i] < maxNoisyCount && charFlag == true)
                {
                     // end of scan a char
                    xEnd = i;
                    charFlag = false;
                }
                 if (xStart != 0 && xEnd != 0)
                {
                     // got the start and end of a char,check whether it's noise
                     if (xEnd - xStart < maxNoisyWidth)
                    {
                         // reset start and end
                        xStart = 0;
                        xEnd = 0;
                         continue;
                    }
                     // create new map for a char
                    Bitmap newMap = new Bitmap(xEnd - xStart + 1, yEnd - yStart + 1);
                     for ( int ni = xStart; ni <= xEnd; ni++)
                    {
                         for ( int nj = yStart; nj <= yEnd; nj++)
                        {
                            newMap.SetPixel(ni - xStart, nj - yStart, map.GetPixel(ni, nj));
                        }
                    }
                    newMap = new Bitmap(newMap, 16, 16);
                    resultList.Add(newMap);
                     // reset start and end
                    xStart = 0;
                    xEnd = 0;
                }
            }
             return resultList;
        }

6.切分完成之后，我們得到一組圖片，每一個代表一個字符，然后進行特征計算，這里的思路首先把圖片轉化為一個矩陣，矩陣是啥不知道？查一下吧，還是有必要的，然后使用冪法求一個方陣的最大特征值和它所對應的特征向量, 向量也不知道？？我敢判定你肯定跟我一樣，大學數據沒及格過。哈哈。然后要把該向量與我們知識庫（一堆向量,每個向量都對應一個字符，這個知識庫需要通過人工對程序進行訓練得到，也就是你告訴程序，這個向量是2，那個是3,后面會講）里面的向量進行比較，求出向量之間的舉例，與其距離最小的向量就表明其特征最相近，也就是說，這兩個字符很像，我們就認為他們是同一個字符，從而得出判斷結果。

input = new Bitmap(input, 16, 16);

            Double[,] doublemap = new Double[input.Width, input.Height];
             for ( int i = 0; i < input.Width; i++)
            {
                 for ( int j = 0; j < input.Height; j++)
                {
                     if (input.GetPixel(i, j).R == 255)
                    {
                        doublemap[i, j] = Convert.ToDouble( 1);
                    }
                     else
                    {
                        doublemap[i, j] = Convert.ToDouble( 0);
                    }
                }
            }

            Double[] W = new double[input.Width]; ;
            Double max = 0;
            MatrixLab mat = new MatrixLab(input.Width, 0.001, doublemap);
            mat.returnResult( ref W, ref max);

            SampleVector vector = new SampleVector(W, "");
            Double minDistance = Double.MaxValue;
            SampleVector similarVector = null;
             foreach (SampleVector target in this._studyList)
            {
                 double distance = _metric.Compute(vector, target);
                 if (distance < minDistance)
                {
                    similarVector = target;
                    output = target.Code;
                    minDistance = distance;
                }

}

最簡單的原理先講到這里，下一篇我們深入點講解，開頭說了，現寫一篇，回頭根據大家的反饋我再寫第二篇，歡迎大家交流

本系列 web數據采集核心技術分享注重分享思路，所有的代碼都是為了配合思路的講解，想要關注如何搭建一個完整的采集系統的童鞋稍安勿躁，后續會關注這個話題，不想關注思路，只想復制代碼，F5運行,點鼠標進行數據抓取的童鞋請理解。

PS: 因本人能力有限，雖在web數據采集領域奮戰多年，卻也不可能在web數據采集的各個方面都提供最牛逼的解決方案和思路，還請各位看官本着互相交流學習，一起進步成長的態度來批評指正，歡迎留言。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 web數據采集核心技術分享系列（二）如何提取信息？字符串？正則？xpath？xslt？自定義？...什么才是王道? Spring Security OAuth2 優雅的集成短信驗證碼登錄以及第三方登錄 1 如何引用第三方滑動驗證碼人工智能：智能駕駛訓練數據，數據采集與數據標注人工智能-圖像識別技術-OCR python3爬蟲之驗證碼的識別——第三方平台超級鷹前后端分離djangorestframework—— 接入第三方的驗證碼平台第三方（秒嘀）短信驗證碼登陸 demo 智能駕駛核心技術超級強大的破解極驗滑動驗證碼--講解非常詳細