基於python的鳶尾花二分類


前言

   也算是自己接觸的第一個實例化的完整實現的小項目了吧(以前的的作業之類的都是沒完全弄懂就交了不算哈),本此分享為簡易鳶尾花的分類,實現語言是python 3.7,實現環境就是jupyter notebook。

1.數據集簡介

   本次數據集是從sklearn庫中導入的load_iris()數據集,數據分四列,分別代表花萼長度、花萼寬度、花瓣長度、花瓣寬度,標簽列或者說是target列有三種花種的類別數據表示分別是'setosa', 'versicolor', 'virginica';在數據集中以0,1,2 的形式展示。在本次實驗中選擇的是二分類任務這就要求我們對數據集進行一定的划分(這點的代碼方面給了我很大的提升,個人覺得受益匪淺,我入坑的前進道路進了一大步)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from pylab import mpl

iris = load_iris()
iris.feature_names, iris.target_names,iris.target,iris,data  # 顯示你需要看的所有數據

 2.看看數據

   下面我們來看看數據集的具體數字,如下圖所示,四列是feature_names的數據,也就是花萼的有關數據,中間一行是target 也就是標簽--花的種類,可以看到有50個0、50個1、50個2。

3.篩選數據並可視化

     接下來就是數據篩選,我們選擇二分類需要的數據,在這里決定對0和1進行分類,數據選擇為前一百個。利用花萼長度和花萼寬度進行一個預測,四個數據的以此類推大致輸出就是z = sigmoid(w1x1 + w2x2),w是權重值,x是特征。

1 x = iris.data[0:100,0:2]  # 數據選擇0到100行。前兩列的標簽數據,也就是0與1的花萼花瓣的數據
2 y = iris.target[0:100]    # 這是選中對應的前100個標簽數據
3 samples_0 = x[y == 0,:]  # samples-0 是標簽 y==0的集合
4 samples_1 = x[y == 1,:]  # samples-1 是標簽 y==1的集合
5 plt.scatter(samples_0[:,0],samples_0[:,1],marker ='o',color = 'r')
6 plt.scatter(samples_1[:,0],samples_1[:,1],marker ='o',color = 'b')  #畫出散點圖
7 mpl.rcParams['font.sans-serif'] = ['SimHei']  # 沒有這行代碼,畫出的圖的xy軸標簽數據會亂碼沒發現顯示
8 plt.xlabel("花萼長度")
9 plt.ylabel("花萼寬度")

顯示的圖像如下:

 接下來就是划分訓練集和測試集;由於原始數據排列得很好所以我們要有預謀的進行打亂

xtrain = np.vstack([x[0:40,:],x[60:100,:]]) # 選中X的前40行和60到99行的數據,列數為X的所有列雖然只有2列
ytrain = np.concatenate([y[0:40],y[60:100]]) #把x對應的標簽數據對應挑出來
xtest = x[40:60,:]  # 把剩下的當作數據集
ytest = y[40:60]   # 剩下的標簽當作數據集
 
xtest.shape,ytest.shape

4.設計函數模型並輸出

接下來就是定義Logistic類和各種函數了:

class LR():
    def __init__(self):
        self.w = None     # w就是我們要訓練的權重值
        
    def sigmoid(self,z):
        a = 1/(1+ np.exp(-z))  # 也可以用lambda函數進行設置
        return a 
    
    def output(self,x):       # 輸出嘛,肯定是經過sigmoid激活函數進行轉換
        z = np.dot(self.w,x.T)
        a = self.sigmoid(z)
        return a 
    
    def comloss(self,x,y):
        numtrain = x.shape[0] # x 是數據 y是標簽 輸出的是x的行數,將0變成1就是列數
        a = self.output(x)
        loss = np.sum(-y * np.log(a) - (1-y)*np.log(1-a))/numtrain  # 自定義的損失函數,可以在這里進行調試
        dw = np.dot((a-y),x) / numtrain  # dot就是向量之間點乘的函數,dw是導數,這里用了梯度下降法
        
        return loss,dw
        
    def train(self,x,y,learningrate = 0.1,num_interations = 10000 ):   # 學習率就是梯度下降的直接影響步長的量,10000是迭代次數
        numtrain,numfeatures = x.shape
        self.w = 0.001 * np.random.randn(1,numfeatures)
        loss = []
        
        for i in range(num_interations):
            error,dw = self.comloss(x,y)
            loss.append(error)
            self.w -= learningrate * dw # 更新權重
            
            if i % 200 == 0:        # 每200次進行一次損失輸出
                print('steps:[%d/%d],loss: %f' % (i,num_interations,error))
                
        return loss
    
    def predict(self,x):
        a = self.output(x)
        ypred = np.where(a >= 0.5,1,0)
        return ypred
    
LR = LR()
loss = LR.train(xtrain,ytrain)
plt.plot(loss)  


輸出:
steps:[0/10000],loss: 0.692566
steps:[200/10000],loss: 0.237656
steps:[400/10000],loss: 0.155935
steps:[600/10000],loss: 0.121161
steps:[800/10000],loss: 0.101404
steps:[1000/10000],loss: 0.088442
steps:[1200/10000],loss: 0.079171
steps:[1400/10000],loss: 0.072148
steps:[1600/10000],loss: 0.066605
steps:[1800/10000],loss: 0.062095
steps:[2000/10000],loss: 0.058337
steps:[2200/10000],loss: 0.055146
steps:[2400/10000],loss: 0.052395
steps:[2600/10000],loss: 0.049991
steps:[2800/10000],loss: 0.047869
steps:[3000/10000],loss: 0.045979
steps:[3200/10000],loss: 0.044280
steps:[3400/10000],loss: 0.042744
steps:[3600/10000],loss: 0.041346
steps:[3800/10000],loss: 0.040068
steps:[4000/10000],loss: 0.038892
steps:[4200/10000],loss: 0.037806
steps:[4400/10000],loss: 0.036800
steps:[4600/10000],loss: 0.035864
steps:[4800/10000],loss: 0.034990
steps:[5000/10000],loss: 0.034173
steps:[5200/10000],loss: 0.033405
steps:[5400/10000],loss: 0.032683
steps:[5600/10000],loss: 0.032002
steps:[5800/10000],loss: 0.031358
steps:[6000/10000],loss: 0.030749
steps:[6200/10000],loss: 0.030170
steps:[6400/10000],loss: 0.029621
steps:[6600/10000],loss: 0.029097
steps:[6800/10000],loss: 0.028599
steps:[7000/10000],loss: 0.028122
steps:[7200/10000],loss: 0.027667
steps:[7400/10000],loss: 0.027231
steps:[7600/10000],loss: 0.026813
steps:[7800/10000],loss: 0.026413
steps:[8000/10000],loss: 0.026028
steps:[8200/10000],loss: 0.025657
steps:[8400/10000],loss: 0.025301
steps:[8600/10000],loss: 0.024958
steps:[8800/10000],loss: 0.024627
steps:[9000/10000],loss: 0.024308
steps:[9200/10000],loss: 0.023999
steps:[9400/10000],loss: 0.023701
steps:[9600/10000],loss: 0.023413
steps:[9800/10000],loss: 0.023134

損失表達圖像:

可以看出來經過梯度下降法的不斷迭代其損失不斷下降。

經過訓練,權重值已經算出來了,接下來對其進行一個可視化;

plt.scatter(samples_0[:,0],samples_0[:,1],marker ='o',color = 'r')
plt.scatter(samples_1[:,0],samples_1[:,1],marker ='o',color = 'b')
mpl.rcParams['font.sans-serif'] = ['SimHei']
plt.xlabel("花萼長度")
plt.ylabel("花萼寬度")
x1 = np.arange(4,7.5,0.05)   # 數據都是根據 數據集合理設置的 目的是為了不留太多空白
x2 = (-LR.w[0][0]* x1)/LR.w[0][1]  #
plt.plot(x1,x2,'-',color = 'black')

輸出圖像為;

 

為了便於讀者理解,在這里加兩行代碼進行分析:

numtest = xtest.shape[0]
prediction = LR.predict(xtest)
acc = np.sum(prediction == ytest)/numtest
print("准確率",acc)
LR.w  # 輸出最終預測的權重值


輸出:
准確率 0.95
array([[  6.4110022 , -11.06167206]])

好了 到此為止,預測,損失,訓練的權重值,准確率都出來了,本次實驗比較簡單,但是折射出很多細節問題,還是值得小白入手訓練的。 

貼一下完整代碼(其實上面就是完整了,為了減少學生白嫖黨的煩惱還是總結一下吧,畢竟我也是這么過來的哈哈):

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from pylab import mpl

iris = load_iris()
iris.feature_names, iris.target_names
x = iris.data[0:100,0:2]
y = iris.target[0:100]
samples_0 = x[y == 0,:]  # samples-0 是標簽 y==0的集合
samples_1 = x[y == 1,:]  # samples-1 是標簽 y==1的集合
plt.scatter(samples_0[:,0],samples_0[:,1],marker ='o',color = 'r')
plt.scatter(samples_1[:,0],samples_1[:,1],marker ='o',color = 'b')
mpl.rcParams['font.sans-serif'] = ['SimHei']
plt.xlabel("花萼長度")
plt.ylabel("花萼寬度")
# 划分數據集
xtrain = np.vstack([x[0:40,:],x[60:100,:]]) # 原始x 定義的是兩列數據
ytrain = np.concatenate([y[0:40],y[60:100]]) #原始y 是標簽值
xtest = x[40:60,:]
ytest = y[40:60]

xtest.shape,ytest.shape





class LR():
    def __init__(self):
        self.w = None
        
    def sigmoid(self,z):
        a = 1/(1+ np.exp(-z))
        return a 
    
    def output(self,x):
        z = np.dot(self.w,x.T)
        a = self.sigmoid(z)
        return a 
    
    def comloss(self,x,y):
        numtrain = x.shape[0] # x 是數據 y是標簽 輸出的是x的行數,1就是列數
        a = self.output(x)
        loss = np.sum(-y * np.log(a) - (1-y)*np.log(1-a))/numtrain
        dw = np.dot((a-y),x) / numtrain
        
        return loss,dw
        
    def train(self,x,y,learningrate = 0.1,num_interations = 10000 ):
        numtrain,numfeatures = x.shape
        self.w = 0.001 * np.random.randn(1,numfeatures)
        loss = []
        
        for i in range(num_interations):
            error,dw = self.comloss(x,y)
            loss.append(error)
            self.w -= learningrate * dw # 更新權重
            
            if i % 200 == 0:
                print('steps:[%d/%d],loss: %f' % (i,num_interations,error))
                
        return loss
    
    def predict(self,x):
        a = self.output(x)
        ypred = np.where(a >= 0.5,1,0)
        return ypred
    
LR = LR()
loss = LR.train(xtrain,ytrain)
plt.plot(loss)   




plt.scatter(samples_0[:,0],samples_0[:,1],marker ='o',color = 'r')
plt.scatter(samples_1[:,0],samples_1[:,1],marker ='o',color = 'b')
mpl.rcParams['font.sans-serif'] = ['SimHei']
plt.xlabel("花萼長度")
plt.ylabel("花萼寬度")
x1 = np.arange(4,7.5,0.05)
x2 = (-LR.w[0][0]* x1)/LR.w[0][1]
plt.plot(x1,x2,'-',color = 'black')



numtest = xtest.shape[0]
prediction = LR.predict(xtest)
acc = np.sum(prediction == ytest)/numtest
print("准確率",acc)
LR.w  # 輸出最終預測的權重值

建議從上而下依次復現才會理解。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM