什么是adaboost?
Boosting,也稱為增強學習或提升法,是一種重要的集成學習技術,能夠將預測精度僅比隨機猜度略高的弱學習器增強為預測精度高的強學習器,這在直接構造強學習器非常困難的情況下,為學習算法的設計提供了一種有效的新思路和新方法。作為一種元算法框架,Boosting幾乎可以應用於所有目前流行的機器學習算法以進一步加強原算法的預測精度,應用十分廣泛,產生了極大的影響。而AdaBoost正是其中最成功的代表,被評為數據挖掘十大算法之一。在AdaBoost提出至今的十幾年間,機器學習領域的諸多知名學者不斷投入到算法相關理論的研究中去,扎實的理論為AdaBoost算法的成功應用打下了堅實的基礎。AdaBoost的成功不僅僅在於它是一種有效的學習算法,還在於
1)它讓Boosting從最初的猜想變成一種真正具有實用價值的算法;
2)算法采用的一些技巧,如:打破原有樣本分布,也為其他統計學習算法的設計帶來了重要的啟示;
3)相關理論研究成果極大地促進了集成學習的發展;
算法描述?
算法流程?
該算法其實是一個簡單的弱分類算法提升過程,這個過程通過不斷的訓練,可以提高對數據的分類能力。整個過程如下所示:
1. 先通過對N個訓練樣本的學習得到第一個弱
分類器;
2. 將分錯的樣本和其他的新數據一起構成一個新的N個的訓練樣本,通過對這個樣本的學習得到第二個弱分類器 ;
3. 將1和2都分錯了的樣本加上其他的新樣本構成另一個新的N個的訓練樣本,通過對這個樣本的學習得到第三個弱分類器;
4. 最終經過提升的強分類器。即某個數據被分為哪一類要由各分類器權值決定。
由Adaboost算法的描述過程可知,該算法在實現過程中根據訓練集的大小初始化樣本權值,使其滿足均勻分布,在后續操作中通過公式來改變和規范化算法迭代后樣本的權值。樣本被錯誤分類導致權值增大,反之權值相應減小,這表示被錯分的訓練樣本集包括一個更高的權重。這就會使在下輪時訓練樣本集更注重於難以識別的樣本,針對被錯分樣本的進一步學習來得到下一個弱分類器,直到樣本被正確分類[36]。在達到規定的迭代次數或者預期的誤差率時,則強分類器構建完成。
具體例子:

有十個樣本,初始化的權重1/10=0.1。然后選取一個閾值,當x<2.5是分為1,x>2.5時分為-1,此時錯誤分類的有編號為7,8,9的三個,錯誤率為3/10=0.3。

然后計算第一個弱分類器的系數。 再更新每一個樣本的權重。
以上解釋摘自:百度百科、統計學習方法
下面是代碼實現: 代碼來源: https://github.com/eriklindernoren/ML-From-Scratch
from __future__ import division, print_function import numpy as np import math from sklearn import datasets import matplotlib.pyplot as plt import pandas as pd # Import helper functions from mlfromscratch.utils import train_test_split, accuracy_score, Plot # Decision stump used as weak classifier in this impl. of Adaboost class DecisionStump(): def __init__(self): # Determines if sample shall be classified as -1 or 1 given threshold self.polarity = 1 # The index of the feature used to make classification self.feature_index = None # The threshold value that the feature should be measured against self.threshold = None # Value indicative of the classifier's accuracy self.alpha = None class Adaboost(): """Boosting method that uses a number of weak classifiers in ensemble to make a strong classifier. This implementation uses decision stumps, which is a one level Decision Tree. Parameters: ----------- n_clf: int The number of weak classifiers that will be used. """ def __init__(self, n_clf=5): self.n_clf = n_clf def fit(self, X, y): n_samples, n_features = np.shape(X) # Initialize weights to 1/N w = np.full(n_samples, (1 / n_samples)) self.clfs = [] # Iterate through classifiers for _ in range(self.n_clf): clf = DecisionStump() # Minimum error given for using a certain feature value threshold # for predicting sample label min_error = float('inf') # Iterate throught every unique feature value and see what value # makes the best threshold for predicting y for feature_i in range(n_features): feature_values = np.expand_dims(X[:, feature_i], axis=1) unique_values = np.unique(feature_values) # Try every unique feature value as threshold for threshold in unique_values: p = 1 # Set all predictions to '1' initially prediction = np.ones(np.shape(y)) # Label the samples whose values are below threshold as '-1' prediction[X[:, feature_i] < threshold] = -1 # Error = sum of weights of misclassified samples error = sum(w[y != prediction]) # If the error is over 50% we flip the polarity so that samples that # were classified as 0 are classified as 1, and vice versa # E.g error = 0.8 => (1 - error) = 0.2 if error > 0.5: error = 1 - error p = -1 # If this threshold resulted in the smallest error we save the # configuration if error < min_error: clf.polarity = p clf.threshold = threshold clf.feature_index = feature_i min_error = error # Calculate the alpha which is used to update the sample weights, # Alpha is also an approximation of this classifier's proficiency clf.alpha = 0.5 * math.log((1.0 - min_error) / (min_error + 1e-10)) # Set all predictions to '1' initially predictions = np.ones(np.shape(y)) # The indexes where the sample values are below threshold negative_idx = (clf.polarity * X[:, clf.feature_index] < clf.polarity * clf.threshold) # Label those as '-1' predictions[negative_idx] = -1 # Calculate new weights # Missclassified samples gets larger weights and correctly classified samples smaller w *= np.exp(-clf.alpha * y * predictions) # Normalize to one w /= np.sum(w) # Save classifier self.clfs.append(clf) def predict(self, X): n_samples = np.shape(X)[0] y_pred = np.zeros((n_samples, 1)) # For each classifier => label the samples for clf in self.clfs: # Set all predictions to '1' initially predictions = np.ones(np.shape(y_pred)) # The indexes where the sample values are below threshold negative_idx = (clf.polarity * X[:, clf.feature_index] < clf.polarity * clf.threshold) # Label those as '-1' predictions[negative_idx] = -1 # Add predictions weighted by the classifiers alpha # (alpha indicative of classifier's proficiency) y_pred += clf.alpha * predictions # Return sign of prediction sum y_pred = np.sign(y_pred).flatten() return y_pred def main(): data = datasets.load_digits() X = data.data y = data.target digit1 = 1 digit2 = 8 idx = np.append(np.where(y == digit1)[0], np.where(y == digit2)[0]) y = data.target[idx] # Change labels to {-1, 1} y[y == digit1] = -1 y[y == digit2] = 1 X = data.data[idx] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5) # Adaboost classification with 5 weak classifiers clf = Adaboost(n_clf=5) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print ("Accuracy:", accuracy) # Reduce dimensions to 2d using pca and plot the results Plot().plot_in_2d(X_test, y_pred, title="Adaboost", accuracy=accuracy) if __name__ == "__main__": main()
運行結果:
Accuracy: 0.9213483146067416