前言
也算是自己接觸的第一個實例化的完整實現的小項目了吧(以前的的作業之類的都是沒完全弄懂就交了不算哈),本此分享為簡易鳶尾花的分類,實現語言是python 3.7,實現環境就是jupyter notebook。
1.數據集簡介
本次數據集是從sklearn庫中導入的load_iris()數據集,數據分四列,分別代表花萼長度、花萼寬度、花瓣長度、花瓣寬度,標簽列或者說是target列有三種花種的類別數據表示分別是:'setosa', 'versicolor', 'virginica';在數據集中以0,1,2 的形式展示。在本次實驗中選擇的是二分類任務這就要求我們對數據集進行一定的划分(這點的代碼方面給了我很大的提升,個人覺得受益匪淺,我入坑的前進道路進了一大步)
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris from pylab import mpl iris = load_iris() iris.feature_names, iris.target_names,iris.target,iris,data # 顯示你需要看的所有數據
2.看看數據
下面我們來看看數據集的具體數字,如下圖所示,四列是feature_names的數據,也就是花萼的有關數據,中間一行是target 也就是標簽--花的種類,可以看到有50個0、50個1、50個2。
3.篩選數據並可視化
接下來就是數據篩選,我們選擇二分類需要的數據,在這里決定對0和1進行分類,數據選擇為前一百個。利用花萼長度和花萼寬度進行一個預測,四個數據的以此類推大致輸出就是z = sigmoid(w1x1 + w2x2),w是權重值,x是特征。
1 x = iris.data[0:100,0:2] # 數據選擇0到100行。前兩列的標簽數據,也就是0與1的花萼花瓣的數據 2 y = iris.target[0:100] # 這是選中對應的前100個標簽數據 3 samples_0 = x[y == 0,:] # samples-0 是標簽 y==0的集合 4 samples_1 = x[y == 1,:] # samples-1 是標簽 y==1的集合 5 plt.scatter(samples_0[:,0],samples_0[:,1],marker ='o',color = 'r') 6 plt.scatter(samples_1[:,0],samples_1[:,1],marker ='o',color = 'b') #畫出散點圖 7 mpl.rcParams['font.sans-serif'] = ['SimHei'] # 沒有這行代碼,畫出的圖的xy軸標簽數據會亂碼沒發現顯示 8 plt.xlabel("花萼長度") 9 plt.ylabel("花萼寬度")
顯示的圖像如下:
接下來就是划分訓練集和測試集;由於原始數據排列得很好所以我們要有預謀的進行打亂
xtrain = np.vstack([x[0:40,:],x[60:100,:]]) # 選中X的前40行和60到99行的數據,列數為X的所有列雖然只有2列 ytrain = np.concatenate([y[0:40],y[60:100]]) #把x對應的標簽數據對應挑出來 xtest = x[40:60,:] # 把剩下的當作數據集 ytest = y[40:60] # 剩下的標簽當作數據集 xtest.shape,ytest.shape
4.設計函數模型並輸出
接下來就是定義Logistic類和各種函數了:
class LR(): def __init__(self): self.w = None # w就是我們要訓練的權重值 def sigmoid(self,z): a = 1/(1+ np.exp(-z)) # 也可以用lambda函數進行設置 return a def output(self,x): # 輸出嘛,肯定是經過sigmoid激活函數進行轉換 z = np.dot(self.w,x.T) a = self.sigmoid(z) return a def comloss(self,x,y): numtrain = x.shape[0] # x 是數據 y是標簽 輸出的是x的行數,將0變成1就是列數 a = self.output(x) loss = np.sum(-y * np.log(a) - (1-y)*np.log(1-a))/numtrain # 自定義的損失函數,可以在這里進行調試 dw = np.dot((a-y),x) / numtrain # dot就是向量之間點乘的函數,dw是導數,這里用了梯度下降法 return loss,dw def train(self,x,y,learningrate = 0.1,num_interations = 10000 ): # 學習率就是梯度下降的直接影響步長的量,10000是迭代次數 numtrain,numfeatures = x.shape self.w = 0.001 * np.random.randn(1,numfeatures) loss = [] for i in range(num_interations): error,dw = self.comloss(x,y) loss.append(error) self.w -= learningrate * dw # 更新權重 if i % 200 == 0: # 每200次進行一次損失輸出 print('steps:[%d/%d],loss: %f' % (i,num_interations,error)) return loss def predict(self,x): a = self.output(x) ypred = np.where(a >= 0.5,1,0) return ypred LR = LR() loss = LR.train(xtrain,ytrain) plt.plot(loss) 輸出: steps:[0/10000],loss: 0.692566 steps:[200/10000],loss: 0.237656 steps:[400/10000],loss: 0.155935 steps:[600/10000],loss: 0.121161 steps:[800/10000],loss: 0.101404 steps:[1000/10000],loss: 0.088442 steps:[1200/10000],loss: 0.079171 steps:[1400/10000],loss: 0.072148 steps:[1600/10000],loss: 0.066605 steps:[1800/10000],loss: 0.062095 steps:[2000/10000],loss: 0.058337 steps:[2200/10000],loss: 0.055146 steps:[2400/10000],loss: 0.052395 steps:[2600/10000],loss: 0.049991 steps:[2800/10000],loss: 0.047869 steps:[3000/10000],loss: 0.045979 steps:[3200/10000],loss: 0.044280 steps:[3400/10000],loss: 0.042744 steps:[3600/10000],loss: 0.041346 steps:[3800/10000],loss: 0.040068 steps:[4000/10000],loss: 0.038892 steps:[4200/10000],loss: 0.037806 steps:[4400/10000],loss: 0.036800 steps:[4600/10000],loss: 0.035864 steps:[4800/10000],loss: 0.034990 steps:[5000/10000],loss: 0.034173 steps:[5200/10000],loss: 0.033405 steps:[5400/10000],loss: 0.032683 steps:[5600/10000],loss: 0.032002 steps:[5800/10000],loss: 0.031358 steps:[6000/10000],loss: 0.030749 steps:[6200/10000],loss: 0.030170 steps:[6400/10000],loss: 0.029621 steps:[6600/10000],loss: 0.029097 steps:[6800/10000],loss: 0.028599 steps:[7000/10000],loss: 0.028122 steps:[7200/10000],loss: 0.027667 steps:[7400/10000],loss: 0.027231 steps:[7600/10000],loss: 0.026813 steps:[7800/10000],loss: 0.026413 steps:[8000/10000],loss: 0.026028 steps:[8200/10000],loss: 0.025657 steps:[8400/10000],loss: 0.025301 steps:[8600/10000],loss: 0.024958 steps:[8800/10000],loss: 0.024627 steps:[9000/10000],loss: 0.024308 steps:[9200/10000],loss: 0.023999 steps:[9400/10000],loss: 0.023701 steps:[9600/10000],loss: 0.023413 steps:[9800/10000],loss: 0.023134
損失表達圖像:
可以看出來經過梯度下降法的不斷迭代其損失不斷下降。
經過訓練,權重值已經算出來了,接下來對其進行一個可視化;
plt.scatter(samples_0[:,0],samples_0[:,1],marker ='o',color = 'r') plt.scatter(samples_1[:,0],samples_1[:,1],marker ='o',color = 'b') mpl.rcParams['font.sans-serif'] = ['SimHei'] plt.xlabel("花萼長度") plt.ylabel("花萼寬度") x1 = np.arange(4,7.5,0.05) # 數據都是根據 數據集合理設置的 目的是為了不留太多空白 x2 = (-LR.w[0][0]* x1)/LR.w[0][1] # plt.plot(x1,x2,'-',color = 'black')
輸出圖像為;
為了便於讀者理解,在這里加兩行代碼進行分析:
numtest = xtest.shape[0] prediction = LR.predict(xtest) acc = np.sum(prediction == ytest)/numtest print("准確率",acc) LR.w # 輸出最終預測的權重值
輸出:
好了 到此為止,預測,損失,訓練的權重值,准確率都出來了,本次實驗比較簡單,但是折射出很多細節問題,還是值得小白入手訓練的。
貼一下完整代碼(其實上面就是完整了,為了減少學生白嫖黨的煩惱還是總結一下吧,畢竟我也是這么過來的哈哈):
import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris from pylab import mpl iris = load_iris() iris.feature_names, iris.target_names x = iris.data[0:100,0:2] y = iris.target[0:100] samples_0 = x[y == 0,:] # samples-0 是標簽 y==0的集合 samples_1 = x[y == 1,:] # samples-1 是標簽 y==1的集合 plt.scatter(samples_0[:,0],samples_0[:,1],marker ='o',color = 'r') plt.scatter(samples_1[:,0],samples_1[:,1],marker ='o',color = 'b') mpl.rcParams['font.sans-serif'] = ['SimHei'] plt.xlabel("花萼長度") plt.ylabel("花萼寬度") # 划分數據集 xtrain = np.vstack([x[0:40,:],x[60:100,:]]) # 原始x 定義的是兩列數據 ytrain = np.concatenate([y[0:40],y[60:100]]) #原始y 是標簽值 xtest = x[40:60,:] ytest = y[40:60] xtest.shape,ytest.shape class LR(): def __init__(self): self.w = None def sigmoid(self,z): a = 1/(1+ np.exp(-z)) return a def output(self,x): z = np.dot(self.w,x.T) a = self.sigmoid(z) return a def comloss(self,x,y): numtrain = x.shape[0] # x 是數據 y是標簽 輸出的是x的行數,1就是列數 a = self.output(x) loss = np.sum(-y * np.log(a) - (1-y)*np.log(1-a))/numtrain dw = np.dot((a-y),x) / numtrain return loss,dw def train(self,x,y,learningrate = 0.1,num_interations = 10000 ): numtrain,numfeatures = x.shape self.w = 0.001 * np.random.randn(1,numfeatures) loss = [] for i in range(num_interations): error,dw = self.comloss(x,y) loss.append(error) self.w -= learningrate * dw # 更新權重 if i % 200 == 0: print('steps:[%d/%d],loss: %f' % (i,num_interations,error)) return loss def predict(self,x): a = self.output(x) ypred = np.where(a >= 0.5,1,0) return ypred LR = LR() loss = LR.train(xtrain,ytrain) plt.plot(loss) plt.scatter(samples_0[:,0],samples_0[:,1],marker ='o',color = 'r') plt.scatter(samples_1[:,0],samples_1[:,1],marker ='o',color = 'b') mpl.rcParams['font.sans-serif'] = ['SimHei'] plt.xlabel("花萼長度") plt.ylabel("花萼寬度") x1 = np.arange(4,7.5,0.05) x2 = (-LR.w[0][0]* x1)/LR.w[0][1] plt.plot(x1,x2,'-',color = 'black') numtest = xtest.shape[0] prediction = LR.predict(xtest) acc = np.sum(prediction == ytest)/numtest print("准確率",acc) LR.w # 輸出最終預測的權重值
建議從上而下依次復現才會理解。