Python機器學習（基礎篇---監督學習（線性分類器））

本文轉載自查看原文 2019-03-13 21:59 1230

監督學習經典模型

機器學習中的監督學習模型的任務重點在於，根據已有的經驗知識對未知樣本的目標/標記進行預測。根據目標預測變量的類型不同，我們把監督學習任務大體分為分類學習與回歸預測兩類。監督學習任務的基本流程：首先准備訓練數據，可以是文本、圖像、音頻等；然后抽取所需要的特征，形成特征向量，接着把這些特征向量連同對應的標記/目標（Labels）一並送入學習算法中，訓練一個預測模型，然后采用同樣的特征抽取方法作用於新測試數據，得到用於測試的特征向量，最后使用預測模型對這些待測試的特征向量進行預測並得到結果。

1.分類學習

最基礎的是二分類問題，即判斷是非，從兩個類別中選擇一個作為預測結果。多分類問題，即在多余兩個類別中選擇一個，多標簽分類問題，判斷一個樣本是否同時屬於多個不同類別。

1.1線性分類器

模型介紹：線性分類器是一種假設特征與分類結果存在線性關系的模型。通過累加計算每個維度的特征與各自權重的乘積來幫助類別決策。

如果我們定義x=<x1,x2,...,xn>來代表n維特征列向量，同時用n維列向量w=<w1,w2,...wn>來代表對應得權重，避免坐標過坐標原點，假設截距為b。線性關系可表達為：

f（w,x,b）=w^Tx+b

我們所要處理的簡單二分類問題希望f∈{0,1}；因此需要一個函數把原先的f∈R映射到（0,1），邏輯斯蒂函數：

g(z)=1/(1+e^-z)

將z替換為f，邏輯斯蒂回歸模型：

h_w,b(x)=g(f(w,x,b))=1/(1+e^-f)=1/(1+e-^(wTx+b)

實例1：良/惡性乳腺癌腫瘤預測----------邏輯斯蒂回歸分類器

數據描述：

Number of Instances: 699 (as of 15 July 1992)

Number of Attributes: 10 plus the class attribute
Attribute Information: (class attribute has been moved to last column)

   #  Attribute                     Domain

   -- -----------------------------------------

   1. Sample code number            id number

   2. Clump Thickness               1 - 10

   3. Uniformity of Cell Size       1 - 10

   4. Uniformity of Cell Shape      1 - 10

   5. Marginal Adhesion             1 - 10

   6. Single Epithelial Cell Size   1 - 10

   7. Bare Nuclei                   1 - 10

   8. Bland Chromatin               1 - 10

   9. Normal Nucleoli               1 - 10

  10. Mitoses                       1 - 10

  11. Class:                        (2 for benign, 4 for malignant)

Missing attribute values: 16

   There are 16 instances in Groups 1 to 6 that contain a single missing

   (i.e., unavailable) attribute value, now denoted by "?".

Class distribution:

   Benign: 458 (65.5%)

   Malignant: 241 (34.5%)

#步驟一：良/惡性乳腺癌腫瘤數據預處理

#導入pandas與numpy工具包

import pandas as pd

import numpy as np

#創建特征列表

column_names=['Sample code number','Clump Thickness','Uniformity of Cell Size',

              'Uniformity of Cell Shape','Marginal Adhesion',

              'Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin',

              'Normal Nucleoli','Mitoses','Class']

#使用pandas.read_csv函數從互聯網讀取指定數據

data=pd.read_csv('

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',names=column_names)

# print(data)#[699 rows x 11 columns]

# print(data[:5])

#Sample code number Clump Thickness Uniformity of Cell Size \

#0 1000025 5 1

#1 1002945 5 4

#2 1015425 3 1

#3 1016277 6 8

#4 1017023 4 1

#Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size \

#0 1 1 2

#1 4 5 7

#2 1 1 2

#3 8 1 3

#4 1 3 2

#Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class

#0 1 3 1 1 2

#1 10 3 2 1 2

#2 2 3 1 1 2

#3 4 3 7 1 2

#4 1 3 1 1 2

data=data.replace(to_replace='?',value=np.nan)

data=data.dropna(how='any')

print(data.shape)#(683, 11)

#步驟二：准備良/惡性乳腺癌腫瘤訓練、測試數據

#使用sklearn.cross_validation里的train_test_split模塊用於分割數據

from sklearn.cross_validation import train_test_split

#隨機采樣25%的數據用於測試，剩下的75%用於構建訓練集合

X_train,X_test,y_train,y_test=train_test_split(data[column_names[1:10]],data[column_names[10]],test_size=0.25,random_state=33)

#檢查訓練樣本的數量和類別分布

print(y_train.value_counts())

# 2 344

# 4 168

# Name: Class, dtype: int64

print(y_test.value_counts())

# 2 100

# 4 71

# Name: Class, dtype: int64

#步驟三：使用線性分類模型從事良/惡性腫瘤預測任務

#從sklearn.preprocessing里導入StandardScaler

from sklearn.preprocessing import StandardScaler

#從sklearn.preprocessing里導入LogisticRegression與SGDClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import SGDClassifier

#標准化數據，保證每個維度的特征數據方差為1，均值為0。使得預測結果不會被某些維度過大的特征值而主導

ss=StandardScaler()

X_train=ss.fit_transform(X_train)

X_test=ss.fit_transform(X_test)

#初始化LogisticRegression與SGDClassifier

lr=LogisticRegression()

sgdc=SGDClassifier()

#調用LogisticRegression中的fit函數/模塊用來訓練模型參數

lr.fit(X_train,y_train)

#使用訓練好的模型lr對X_test

lr_y_predict=lr.predict(X_test)

#調用SGDClassifier中的fit函數/模塊用來訓練模型參數

sgdc.fit(X_train,y_train)

sgdc_y_predict=sgdc.predict(X_test)

#步驟四：使用線性分類模型從事良/惡性腫瘤預測任務的性能分析

#從sklearn.metrics里導入classification_report模塊

from sklearn.metrics import classification_report

#使用邏輯斯蒂回歸模型自帶的評分函數score獲得模型在測試集上的准確性結果

print('Accuracy of LR Classifier:',lr.score(X_test,y_test))

#利用classification_report模塊獲得LogisticRegression其他三個指標的結果。

print(classification_report(y_test,lr_y_predict,target_names=['Benign','Malignant']))

#使用隨機梯度下降模型自帶的評分函數score獲得模型在測試集上的准確性結果

print('Accuracy of SGD Classifier:',sgdc.score(X_test,y_test))

#利用classification_report模塊獲得LogisticRegression其他三個指標的結果。

print(classification_report(y_test,sgdc_y_predict,target_names=['Benign','Malignant']))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習-有監督學習-分類算法機器學習基礎---無監督學習之降維【機器學習基礎】無監督學習（3）——AutoEncoder 【機器學習基礎】半監督學習簡介 Python 機器學習實戰 —— 無監督學習（上） python大戰機器學習——半監督學習【機器學習】半監督學習 python機器學習基礎教程-監督學習監督學習與無監督學習的區別_機器學習 [機器學習筆記] 1監督學習