(筆記)機器學習入門 2.2 KNN和朴素貝葉斯，精確率召回率，GridSearchCV調優

本文轉載自查看原文 2021-09-15 20:14 112 機器學習

k近鄰算法(KNN)

定義：如果一個樣本在特征空間中的k個最相似(即特征空間中最鄰近)的樣本中的大多數屬於某一個類別，則該樣本也屬於這個類別。

來源：KNN算法最早是由Cover和Hart提出的一種分類算法

優點：
簡單，易於理解，易於實現，無需估計參數，無需訓練

缺點：
懶惰算法，對測試樣本分類時的計算量大，內存開銷大
必須指定K值，K值選擇不當則分類精度不能保證

使用場景：小數據場景，幾千～幾萬樣本，具體情況具體分析

api: sklearn.neighbors.KNeighborsClassifier

k近鄰算法實例-預測入住位置

數據來源：kaggle預測入住位置

x, y為位置坐標，accuracy准確率暫時沒用上，time入住時間，place_id入住位置

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier # KNN
from sklearn.model_selection import train_test_split # 數據集划分
from sklearn.preprocessing import StandardScaler # 標准化

data = pd.read_csv(r'C:\Users\Administrator\Downloads\機器學習代碼和資料\facebook-v-predicting-check-ins_2\train\train.csv')

# step1 : 把data處理成我們需要的數據

data = data.query('x>1.25 & x < 1.5 & y > 2.5 & y < 2.75') # 使用一小部分數據，太大了跑不動
time_vue = pd.to_datetime(data['time'],unit='s')

'''# 構造一些特征
time_vue = pd.DatetimeIndex(time_vue) # 把日期格式轉換成 字典格式
data['day'] = time_vue.day
data['hour'] = time_vue.hour
data['weekday'] = time_vue.weekday
data.drop(['time'], axis=1,inplace=True)'''

place_cnt = data.groupby('place_id').count()
tf = place_cnt[place_cnt.x>3].reset_index() # 篩選出cnt>3的行后重設index
data = data[data['place_id'].isin(tf.place_id)] # 從data表中選出那些place_id也tf中也存在的行
data.drop('row_id',axis=1,inplace=True) # 刪除列時要axis=1

# step2 prepocess

x_train, x_test, y_train, y_test = train_test_split(data.drop('place_id',axis=1),data['place_id'],test_size=.25)

std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)

# step 3 使用api進行預測

knn = KNeighborsClassifier(n_neighbors=5) # 指定k
knn.fit(x_train,y_train) # 訓練
y_predict = knn.predict(x_test) # 用測試集進行預測
print('預測結果為：',y_predict)
print("准確率為：",knn.score(x_test,y_test))

jupy運行截圖：

朴素貝葉斯

條件概率：就是事件A在另外一個事件B已經發生條件下的發生概率

記作：\(P(B|A) = \frac{P(AB)}{P(A)}\)

全概率公式：\(P(B) = \sum P(A_i)P(B|A_i)\)

\(P(A|B)=\frac{P(AB)}{P(B)}\)，分母條件概率，分子全概率展開得

貝葉斯公式：\(P(A|B) = \frac{P(B|A)*P(A)}{\sum P(A_i)P(B|A_i)}\)

拉普拉斯平滑系數：

從上面的例子我們得到娛樂概率為0，這是不合理的，如果詞頻列表里面
有很多出現次數都為0，很可能計算結果都為零

解決方法：拉普拉斯平滑系數

\(P(F1│C)=\frac{Ni+α}{N+αm}\)

\(α\)為指定的系數一般為1，\(m\)為訓練文檔中統計出的特征詞個數

朴素貝葉斯api:sklearn.naive_bayes.MultinomialNB

優點：
朴素貝葉斯模型發源於古典數學理論，有穩定的分類效率。

對缺失數據不太敏感，算法也比較簡單，常用於文本分類。

分類准確度高，速度快

缺點：
需要知道先驗概率P(F1,F2,…|C)，因此在某些時候會由於假設的先驗
模型的原因導致預測效果不佳。


from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
import pandas as pd

news = fetch_20newsgroups(subset='all')

# 進行數據分割
x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25)

# 對數據集進行特征抽取
tf = TfidfVectorizer()

# 以訓練集當中的詞的列表進行每篇文章重要性統計['a','b','c','d']
x_train = tf.fit_transform(x_train)

print(tf.get_feature_names())

x_test = tf.transform(x_test)

# 進行朴素貝葉斯算法的預測
mlt = MultinomialNB(alpha=1.0) # alpha拉普拉斯平滑系數

print(x_train.toarray())

mlt.fit(x_train, y_train)

y_predict = mlt.predict(x_test)

print("預測的文章類別為：", y_predict)

# 得出准確率
print("准確率為：", mlt.score(x_test, y_test))

print("每個類別的精確率和召回率：", classification_report(y_test, y_predict, target_names=news.target_names))

精確率和召回率

二分類和一些概念：

來源：ICPC沈陽K（簽到）

A binary classifier is an algorithm that predicts the classes of instances, which may be positive (\({+}\)) or negative (\({-}\)). A typical binary classifier consists of a scoring function \({S}\) that gives a score for every instance and a threshold \(\theta\) that determines the category. Specifically, if the score of an instance \(S(x) \geq \theta\), then the instance \({x}\) is classified as positive; otherwise, it is classified as negative. Clearly, choosing different thresholds may yield different classifiers.

Of course, a binary classifier may have misclassification: it could either classify a positive instance as negative (false negative) or classify a negative instance as positive (false positive).

Given a dataset and a classifier, we may define the true positive rate \({TPR}\) and the false positive rate \({FPR}\) as follows:
\({TPR} = \frac{\# {TP}} {\# {TP} + \# {FN}}, \quad {FPR} = \frac{\# {FP}} {\# {TN} + \# {FP}}\)

where \(\# TP\) is the number of true positives in the dataset; \(\# FP\), \(\#TN\), \(\#FN\) are defined likewise.

Now you have trained a scoring function, and you want to evaluate the performance of your classifier. The classifier may exhibit different TPR and FPR if we change the threshold \(\theta\). Let \({TPR}(\theta), FPR(\theta)\) be the \({TPR, FPR}\) when the threshold is \(\theta\), define the \({area\;under\;curve}\) (\({AUC}\)) as

\({AUC} = \int_{0}^{1} \max_{\theta \in \mathbb{R}} \{TPR(\theta)|FPR(\theta) \leq r\}\)

where the integrand, called \({receiver\;operating\;characteristic}\) (ROC), means the maximum possible of \({TPR}\) given that \(FPR \leq r\).

精確率(Precision)與召回率(Recall)

精確率是針對我們預測結果而言的，它表示的是預測為正的樣本中有多少是真正的正樣本。那么預測為正就有兩種可能了，一種就是把正類預測為正類(TP)，另一種就是把負類預測為正類(FP)，也就是

\(P = \frac{TP}{TP+FP}\)

而召回率是針對我們原來的樣本而言的，它表示的是樣本中的正例有多少被預測正確了。那也有兩種可能，一種是把原來的正類預測成正類(TP)，另一種就是把原來的正類預測為負類(FN)。來源

\(R = \frac{TP}{TP+FN}\)

在信息檢索領域，精確率和召回率又被稱為查准率和查全率，

查准率＝檢索出的相關信息量 / 檢索出的信息總量
查全率＝檢索出的相關信息量 / 系統中的相關信息總量

交叉驗證與網格搜索對K-近鄰算法調優

交叉驗證：為了讓被評估的模型更加准確可信

過程：將拿到的數據，分為訓練和驗證集。以下圖為例：將數據分成5份，其中一份作為驗證集。然后經過5次(組)的測試，每次都更換不同的驗證集。即得到5組模型的結果，取平均值作為最終結果。又稱5折交叉驗證。

超參數搜索-網格搜索：KNN調優

通常情況下，有很多參數是需要手動指定的（如k-近鄰算法中的K值），這種叫超參數。但是手動過程繁雜，所以需要對模型預設幾種超參數組合。每組超參數都采用交叉驗證來進行評估。最后選出最優參數組合建立模型。

(source)在k折交叉驗證方法中其中K-1份作為訓練數據，剩下的一份作為驗真數據：

這個過程一共需要進行K次，將最后K次使用實現選擇好的評分方式的評分求平均返回，然后找出最大的一個評分對用的參數組合。這也就完成了交叉驗證這一過程。

api:sklearn.model_selection.GridSearchCV(GridSearchCV可以拆分為兩部分，GridSearch和CV，即網格搜索和交叉驗證。)

# 讀取數據
data = pd.read_csv("./data/FBlocation/train.csv")

# print(data.head(10))

# 處理數據
# 1、縮小數據,查詢數據曬訊
data = data.query("x > 1.0 &  x < 1.25 & y > 2.5 & y < 2.75")

# 處理時間的數據
time_value = pd.to_datetime(data['time'], unit='s')

print(time_value)

# 把日期格式轉換成 字典格式
time_value = pd.DatetimeIndex(time_value)

# 構造一些特征
data['day'] = time_value.day
data['hour'] = time_value.hour
data['weekday'] = time_value.weekday

# 把時間戳特征刪除
data = data.drop(['time'], axis=1)

print(data)

# 把簽到數量少於n個目標位置刪除
place_count = data.groupby('place_id').count()

tf = place_count[place_count.row_id > 3].reset_index()

data = data[data['place_id'].isin(tf.place_id)]

# 取出數據當中的特征值和目標值
y = data['place_id']

x = data.drop(['place_id'], axis=1)

# 進行數據的分割訓練集合測試集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

# 特征工程（標准化）
std = StandardScaler()

# 對測試集和訓練集的特征值進行標准化
x_train = std.fit_transform(x_train)

x_test = std.transform(x_test)

# 進行算法流程 # 超參數
knn = KNeighborsClassifier()

# 構造一些參數的值進行搜索
param = {"n_neighbors": [3, 5, 10]}

# 進行網格搜索
gc = GridSearchCV(knn, param_grid=param, cv=2)

gc.fit(x_train, y_train)

# 預測准確率
print("在測試集上准確率：", gc.score(x_test, y_test))

print("在交叉驗證當中最好的結果：", gc.best_score_)

print("選擇最好的模型是：", gc.best_estimator_)

print("每個超參數每次交叉驗證的結果：", gc.cv_results_)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 NLP學習筆記07---專家系統、機器學習、朴素貝葉斯、評估的標准(精確率、召回率) Python機器學習筆記：朴素貝葉斯算法機器學習（五）—朴素貝葉斯機器學習（一）—朴素貝葉斯機器學習 - 朴素貝葉斯機器學習-朴素貝葉斯機器學習Sklearn系列：（四）朴素貝葉斯機器學習--朴素貝葉斯模型原理機器學習回顧篇（5）：朴素貝葉斯算法機器學習之朴素貝葉斯法