Python 3 利用機器學習模型進行手寫體數字檢測

本文轉載自查看原文 2018-01-09 09:38 3430 python/ 機器學習/ sklearn/ Machine Learning/ Python

0.引言

　介紹了如何生成手寫體數字的數據，提取特征，借助 sklearn 機器學習模型建模，進行識別手寫體數字 1-9 模型的建立和測試。

　用到的幾種模型：

　　　　1. LR，Logistic Regression，　　　　　　　　　　　　　　（線性模型）中的邏輯斯特回歸

　　　　2. Linear SVC，Support Vector Classification，　　　　　　（支持向量機）中的線性支持向量分類

　　　　3. MLPC，Multi-Layer Perceptron Classification，　　　　（神經網絡）多層感知機分類

　　　　4. SGDC，Stochastic Gradient Descent Classification，　　（線性模型）隨機梯度法求解

　　　　手寫體的識別是一個分類問題，提取圖像特征作為模型輸入，輸出到標記數字 1-9；

　主要內容：

　　　1. 生成手寫體數字數據集；

　　　2. 提取圖像特征存入 CSV；

　　　3. 利用機器學習建立和測試手寫體數字識別模型;

　　　（如果你想嘗試生成自己的數據集可以參考我的另一篇博客：http://www.cnblogs.com/AdaminXie/p/8379749.html）

　　　　源碼上傳到了我的 GitHub: https://github.com/coneypo/ML_handwritten_number，有問題可以留言或者聯系我郵箱；

　　　得到不同樣本量訓練下，幾種機器學習模型精度隨樣本的變化關系曲線：

圖 0 不同樣本數目下的四種模型的測試精度（數據集大小從 100 到 5800，間隔 100 ）

1. 開發環境

　　python:　　3.6.3

　　import PIL, cv2, pandas, numpy, os, csv, random

　　需要調用的 sklearn 庫：

1 from sklearn.linear_model import LogisticRegression     # 線性模型中的 Logistic 回歸模型
2 from sklearn.linear_model import SGDClassifier          # 線性模型中的隨機梯度下降模型
3 from sklearn.svm import LinearSVC                       # SVM 模型中的線性 SVC 模型
4 from sklearn.neural_network import MLPClassifier        # 神經網絡模型中的多層網絡模型

2.整體設計思路

圖 1 整體的框架設計

　　工程的目的，是想利用機器學習模型去訓練識別生成的隨機驗證碼圖像（單個數字 1-9 ），通過以下三個步驟實現：

　　　　1. 生成手寫體數據集

　　　　2. 提取特征向量寫入 CSV

　　　　3. sklearn 模型訓練和測試　　　　

圖 2 整體的設計流程

3. 編程過程

3.1 生成多張單個驗證碼圖像 ( generate_folders.py, generate_handwritten_numbers.py )

圖 3 生成的多張單個驗證碼圖像

　　手寫體數據集的生成在我的另一篇博客詳細介紹：（ Link：http://www.cnblogs.com/AdaminXie/p/8379749.html ）

　　思路就是 random 隨機生成數字 1-9，然后利用PIL的畫筆工具進行畫圖，對圖像進行扭曲，然后根據隨機數的真實標記 1-9，保存到對應文件夾內，用標記+序號命名。

1 draw = ImageDraw.Draw(im)  # 畫筆工具

3.2 提取特征向量寫入 CSV ( get_features.py )

　　這一步是提取圖像中的特征。生成的單個圖像是 30*30 即 900 個像素點的；

　　為了降低維度，沒有選擇 900 個像素點每點的灰度作為輸入，而是選取了 30 行每行的黑點數，和 30 列每列的黑點數作為輸入，這樣降到了 60 維。

(a) 提取 900 維特征

(b) 提取 60 維特征

圖 4 提取圖像特征

　　特征的提取也比較簡單，逐行逐列計算然后計數求和：

 1     def get_feature(img):
 2         # 提取特征
 3         # 30*30的圖像，
 4 
 5         width, height = img.size
 6 
 7         global pixel_cnt_list
 8         pixel_cnt_list=[]
 9 
10         height = 30
11         for y in range(height):
12             pixel_cnt_x = 0
13             for x in range(width):
14                 # print(img.getpixel((x,y)))
15                 if img.getpixel((x, y)) == 0:  # 黑點
16                     pixel_cnt_x += 1
17 
18             pixel_cnt_list.append(pixel_cnt_x)
19 
20         for x in range(width):
21             pixel_cnt_y = 0
22             for y in range(height):
23                 if img.getpixel((x, y)) == 0:  # 黑點
24                     pixel_cnt_y += 1
25 
26             pixel_cnt_list.append(pixel_cnt_y)
27 
28         return pixel_cnt_list

　　所以我們接下來需要做的工作是，遍歷訪問文件夾 num_1-9 中的所有圖像文件，進行特征提取，然后寫入 CSV 文件中：

 1   with open(path_csv+"tmp.csv", "w", newline="") as csvfile:
 2         writer = csv.writer(csvfile)
 3         # 訪問文件夾 1-9
 4         for i in range(1, 10):
 5             num_list = os.listdir(path_images + "num_" + str(i))
 6             print(path_images + "num_" + str(i))
 7             print("num_list:", num_list)
 8             # 讀到圖像文件
 9             if os.path.isdir(path_images + "num_" + str(i)):
10                 print("樣本個數：", len(num_list))
11                 sum_images = sum_images + len(num_list)
12 
13                 # Travsel every single image to generate the features
14                 for j in range(0, (len(num_list))):
15 
16                     # 處理讀取單個圖像文件提取特征
17                     img = Image.open(path_images + "num_" + str(i)+"/" + num_list[j])
18                     get_features_single(img)
19                     pixel_cnt_list.append(num_list[j][0])
20 
21                     # 寫入CSV
22                     writer.writerow(pixel_cnt_list)

圖 5 提取出來的 CSV 文件（前 60 列為輸入特征，第 61 列為輸出標記）

3.3 sklearn 模型訓練和測試 ( ml_ana.py, test_single_images.py )

　　之前的准備工作都做完之后，我們生成了存放着 60 維輸入特征和 1 維輸出標記的 61 列的 CSV 文件;

　　然后就可以利用這些數據，交給 sklearn 的機器學習模型進行建模處理。

3.3.1 特征數據加工

　　第一步需要對 CSV 文件中的數據進行提取，利用 pd.read_csv 進行讀取。寫入 CSV 時，前 60 列為 60 維的特征向量，第 61 列為輸出標記 1-9;

　　利用前面已經提取好的特征 CSV;

 1 # 從 CSV 中讀取數據
 2 def pre_data():
 3     # CSV61維表頭名
 4     column_names = []
 5 
 6     for i in range(0, 60):
 7         column_names.append("feature_" + str(i))
 8     column_names.append("true_number")
 9 
10     # 讀取csv
11     path_csv = "../data/data_csvs/"
12     data = pd.read_csv(path_csv + "data_10000.csv", names=column_names)
13 
14     # 提取數據集
15     global X_train, X_test, y_train, y_test
16     X_train, X_test, y_train, y_test = train_test_split(
17         data[column_names[0:60]],
18         data[column_names[60]],
19         test_size=0.25,  # 75% for 訓練，25% for 測試
20         random_state=33
21         )

　　利用sklearn庫的 train_test_split 函數 將數據進行分割，

　　　　得到訓練集數據：X_train, y_train

　　　　得到測試集數據：X_test, y_test

3.3.2 模型訓練和測試

　　經過前面一系列的准備工作做完，這里正式開始使用 sklearn 的機器學習模型建模；

　　調用 sklearn 利用訓練數據對模型進行訓練，然后利用測試數據進行性能測試，並且保存模型到本地 ( "/data/data_models/model_xxx.m")；

　　ml_ana.py:

  1 # created at 2018-01-29
  2 # updated at 2018-09-28
  3 
  4 # Author:   coneypo
  5 # Blog:     http://www.cnblogs.com/AdaminXie
  6 # GitHub:   https://github.com/coneypo/ML_handwritten_number
  7 
  8 
  9 from sklearn.model_selection import train_test_split
 10 import pandas as pd
 11 
 12 from sklearn.preprocessing import StandardScaler     # 標准化
 13 
 14 # 調用模型
 15 from sklearn.linear_model import LogisticRegression  # 線性模型中的 Logistic 回歸模型
 16 from sklearn.svm import LinearSVC                    # SVM 模型中的線性 SVC 模型
 17 from sklearn.neural_network import MLPClassifier     # 神經網絡模型中的多層網絡模型
 18 from sklearn.linear_model import SGDClassifier       # 線性模型中的隨機梯度下降模型
 19 
 20 # 保存模型
 21 from sklearn.externals import joblib
 22 
 23 
 24 # 從 CSV 中讀取數據
 25 def pre_data():
 26     # CSV61維表頭名
 27     column_names = []
 28 
 29     for i in range(0, 60):
 30         column_names.append("feature_" + str(i))
 31     column_names.append("true_number")
 32 
 33     # 讀取csv
 34     path_csv = "../data/data_csvs/"
 35     data = pd.read_csv(path_csv + "data_10000.csv", names=column_names)
 36 
 37     # 提取數據集
 38     global X_train, X_test, y_train, y_test
 39     X_train, X_test, y_train, y_test = train_test_split(
 40         data[column_names[0:60]],
 41         data[column_names[60]],
 42         test_size=0.25,  # 75% for 訓練，25% for 測試
 43         random_state=33
 44         )
 45 
 46 
 47 path_saved_models = "../data/data_models/"
 48 
 49 
 50 # LR, logistic regression, 邏輯斯特回歸分類（線性模型）
 51 def way_LR():
 52     X_train_LR = X_train
 53     y_train_LR = y_train
 54 
 55     X_test_LR = X_test
 56     y_test_LR = y_test
 57 
 58     # 數據預加工
 59     # ss_LR = StandardScaler()
 60     # X_train_LR = ss_LR.fit_transform(X_train_LR)
 61     # X_test_LR = ss_LR.transform(X_test_LR)
 62 
 63     # 初始化LogisticRegression
 64     LR = LogisticRegression()
 65 
 66     # 調用LogisticRegression中的fit()來訓練模型參數
 67     LR.fit(X_train_LR, y_train_LR)
 68 
 69     # 使用訓練好的模型lr對X_test進行預測
 70     # 結果儲存在y_predict_LR中
 71     global y_predict_LR
 72     y_predict_LR = LR.predict(X_test_LR)
 73 
 74     # 評分函數
 75     global score_LR
 76     score_LR = LR.score(X_test_LR, y_test_LR)
 77     print("The accurary of LR:", '\t', score_LR)
 78 
 79     # 保存模型
 80     joblib.dump(LR, path_saved_models + "model_LR.m")
 81 
 82     return LR
 83 
 84 
 85 # 多層感知機分類（神經網絡）
 86 def way_MLPC():
 87     X_train_MLPC = X_train
 88     y_train_MLPC = y_train
 89 
 90     X_test_MLPC = X_test
 91     y_test_MLPC = y_test
 92 
 93     # ss_MLPC = StandardScaler()
 94     # X_train_MLPC = ss_MLPC.fit_transform(X_train_MLPC)
 95     # X_test_MLPC = ss_MLPC.transform(X_test_MLPC)
 96 
 97     MLPC = MLPClassifier(hidden_layer_sizes=(13, 13, 13), max_iter=500)
 98     MLPC.fit(X_train_MLPC, y_train_MLPC)
 99 
100     global y_predict_MLPC
101     y_predict_MLPC = MLPC.predict(X_test_MLPC)
102 
103     global score_MLPC
104     score_MLPC = MLPC.score(X_test_MLPC, y_test_MLPC)
105     print("The accurary of MLPC:", '\t', score_MLPC)
106 
107     # 保存模型
108     joblib.dump(MLPC, path_saved_models + "model_MLPC.m")
109 
110     return MLPC
111 
112 
113 # Linear SVC， Linear Supported Vector Classifier, 線性支持向量分類(SVM支持向量機)
114 def way_LSVC():
115     X_train_LSVC = X_train
116     y_train_LSVC = y_train
117 
118     X_test_LSVC = X_test
119     y_test_LSVC = y_test
120 
121     # Standard Scaler
122     # ss_LSVC = StandardScaler()
123     # X_train_LSVC = ss_LSVC.fit_transform(X_train_LSVC)
124     # X_test_LSVC = ss_LSVC.transform(X_test_LSVC)
125 
126     LSVC = LinearSVC()
127     LSVC.fit(X_train_LSVC, y_train_LSVC)
128 
129     global y_predict_LSVC
130     y_predict_LSVC = LSVC.predict(X_test_LSVC)
131 
132     global score_LSVC
133     score_LSVC = LSVC.score(X_test_LSVC, y_test_LSVC)
134     print("The accurary of LSVC:", '\t', score_LSVC)
135 
136     # 保存模型
137     joblib.dump(LSVC, path_saved_models + "model_LSVC.m")
138 
139     return LSVC
140 
141 
142 # SGDC, stochastic gradient decent 隨機梯度下降法求解(線性模型)
143 def way_SGDC():
144     X_train_SGDC = X_train
145     y_train_SGDC = y_train
146 
147     X_test_SGDC = X_test
148     y_test_SGDC = y_test
149 
150     # ss_SGDC = StandardScaler()
151     # X_train_SGDC = ss_SGDC.fit_transform(X_train_SGDC)
152     # X_test_SGDC = ss_SGDC.transform(X_test_SGDC)
153 
154     SGDC = SGDClassifier(max_iter=5)
155 
156     SGDC.fit(X_train_SGDC, y_train_SGDC)
157 
158     global y_predict_SGDC
159     y_predict_SGDC = SGDC.predict(X_test_SGDC)
160 
161     global score_SGDC
162     score_SGDC = SGDC.score(X_test_SGDC, y_test_SGDC)
163     print("The accurary of SGDC:", '\t', score_SGDC)
164 
165     # 保存模型
166     joblib.dump(SGDC, path_saved_models + "model_SGDC.m")
167 
168     return SGDC
169 
170 
171 pre_data()
172 way_LR()
173 way_LSVC()
174 way_MLPC()
175 way_SGDC()

3.3.3 測試 ( test_single_images.py )

　　對於一張手寫體數字，提取特征然后利用保存的模型進行預測；

 1 # created at 2018-01-29
 2 # updated at 2018-09-28
 3 
 4 # Author:   coneypo
 5 # Blog:     http://www.cnblogs.com/AdaminXie
 6 # GitHub:   https://github.com/coneypo/ML_handwritten_number
 7 
 8 # 利用保存到本地的訓練好的模型，來檢測單張 image 的標記
 9 
10 from sklearn.externals import joblib
11 from PIL import Image
12 
13 img = Image.open("../test/test_1.png")
14 
15 # Get features
16 from generate_datebase import get_features
17 features_test_png = get_features.get_features_single(img)
18 
19 path_saved_models = "../data/data_models/"
20 
21 # LR
22 LR = joblib.load(path_saved_models + "model_LR.m")
23 predict_LR = LR.predict([features_test_png])
24 print("LR:", predict_LR[0])
25 
26 # LSVC
27 LSVC = joblib.load(path_saved_models + "model_LSVC.m")
28 predict_LSVC = LSVC.predict([features_test_png])
29 print("LSVC:", predict_LSVC[0])
30 
31 # MLPC
32 MLPC = joblib.load(path_saved_models + "model_MLPC.m")
33 predict_MLPC = MLPC.predict([features_test_png])
34 print("MLPC:", predict_MLPC[0])
35 
36 # SGDC
37 SGDC = joblib.load(path_saved_models + "model_SGDC.m")
38 predict_SGDC = SGDC.predict([features_test_png])
39 print("SGDC:", predict_SGDC[0])

3.3.4 繪制樣本數-精度圖像

　　可以繪圖來更加直觀的精度：

 1 # 2018-01-29
 2 # By TimeStamp
 3 # cnblogs: http://www.cnblogs.com/AdaminXie/
 4 # plot_from_csv.py
 5 # 從存放樣本數-精度的CSV中讀取數據，繪制圖形
 6 
 7 
 8 import numpy as np
 9 import matplotlib.pyplot as plt
10 import pandas as pd
11 
12 # CSV路徑
13 path_csv = "F:/***/P_ML_handwritten_number/data/score_csv/"
14 
15 # 存儲x軸坐標
16 x_array = []
17 
18 # 存儲精度
19 LR_score_arr = []
20 LSVC_score_arr = []
21 MLPC_score_arr = []
22 SGDC_score_arr = []
23 
24 # 讀取CSV數據
25 column_names = ["samples", "acc_LR", "acc_LSVC", "acc_MLPC", "acc_SGDC"]
26 rd_csv = pd.read_csv(path_csv + "score_100to5800.csv", names=column_names)
27 
28 print(rd_csv.shape)
29 
30 for i in range(len(rd_csv)):
31     x_array.append(float(rd_csv["samples"][i]))
32     LR_score_arr.append(float(rd_csv["acc_LR"][i]))
33     LSVC_score_arr.append(float(rd_csv["acc_LSVC"][i]))
34     MLPC_score_arr.append(float(rd_csv["acc_MLPC"][i]))
35     SGDC_score_arr.append(float(rd_csv["acc_SGDC"][i]))
36 
37 ################ 3次線性擬合 ################
38 xray = np.array(x_array)
39 y_LR = np.array(LR_score_arr)
40 y_LSVC = np.array(LSVC_score_arr)
41 y_MLPC = np.array(MLPC_score_arr)
42 y_SGDC = np.array(SGDC_score_arr)
43 
44 z1 = np.polyfit(xray, y_LR, 5)
45 z2 = np.polyfit(xray, y_LSVC, 5)
46 z3 = np.polyfit(xray, y_MLPC, 5)
47 z4 = np.polyfit(xray, y_SGDC, 5)
48 
49 p1 = np.poly1d(z1)
50 p2 = np.poly1d(z2)
51 p3 = np.poly1d(z3)
52 p4 = np.poly1d(z4)
53 
54 y_LR_vals = p1(xray)
55 y_LSVC_vals = p2(xray)
56 y_MLPC_vals = p3(xray)
57 y_SGDC_vals = p4(xray)
58 #################################
59 
60 # 標明線條說明
61 plt.annotate("— LR", xy=(5030, 0.34), color='b', size=12)
62 plt.annotate("— LSVC", xy=(5030, 0.26), color='r', size=12)
63 plt.annotate("— MLPC", xy=(5030, 0.18), color='g', size=12)
64 plt.annotate("— SGDC", xy=(5030, 0.10), color='black', size=12)
65 
66 # 畫擬合曲線
67 plt.plot(xray, y_LR_vals, color='b')
68 plt.plot(xray, y_LSVC_vals, color='r')
69 plt.plot(xray, y_MLPC_vals, color='g')
70 plt.plot(xray, y_SGDC_vals, color='black')
71 
72 # 畫離散點
73 plt.plot(xray, y_LR, color='b', linestyle='None', marker='.', label='y_test', linewidth=100)
74 plt.plot(xray, y_LSVC, color='r', linestyle='None', marker='.', label='y_test', linewidth=0.01)
75 plt.plot(xray, y_MLPC, color='g', linestyle='None', marker='.', label='y_test', linewidth=0.01)
76 plt.plot(xray, y_SGDC, color='black', linestyle='None', marker='.', label='y_test', linewidth=0.01)
77 
78 # 繪制y=1參考線
79 plt.plot([0, 6000], [1, 1], 'k--')
80 
81 # 設置y軸坐標范圍
82 plt.ylim(0, 1.1)
83 
84 # 標明xy軸
85 plt.xlabel('samples')
86 plt.ylabel('accuracy')
87 
88 plt.show()