sklearn
Table of Contents
- home http://scikit-learn.org/stable/index.html
- docs http://scikit-learn.org/stable/documentation.html
- 一張圖說明如何選擇正確算法
1 Overview
- supervised learning
- classification # Identifying to which set of categories a new observation belong to.
- regression # Predicting a continuous value for a new example.
- unsupervised learning
- clustering # Automatic grouping of similar objects into sets.
- dimensionality reduction # Reducing the number of random variables to consider.
- model selection and evaluation # Comparing, validating and choosing parameters and models.
- dataset transformations # Feature extraction and normalization.
- dataset loading utilities
2 Building Blocks
http://scipy-lectures.github.io/ sklearn底層使用的三駕馬車numpy, scipy, matplotlib.
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
http://cs231n.github.io/python-numpy-tutorial/
numpy. 數組/矩陣的表示和運算能力. # import numpy as np
numpy provides:
- extension package to Python for multi-dimensional arrays
- closer to hardware (efficiency)
- designed for scientific computation (convenience)
- also known as array oriented computing
array attributes
- ndim # 維度
- shape # 每個維度大小
- dtype # 存儲類型
- T # 轉置矩陣
- size # 元素個數
- itemsize # 每個元素占用內存大小
- nbytes # 占用內存大小
index array
- a[d1, d2, …] # 多維訪問
- a[<array>, …] # fancy indexing
pylab. 繪圖能力 # import pylab as plt
pylab功能強大相對比較原始,用戶還需要編寫許多代碼才能畫出比較漂亮的圖. 可以考慮一些其他的可視化庫比如ggplot或者是seaborn. 參考鏈接
scipy. 復雜數值處理運算能力.
The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc. scipy can be compared to other standard scientific-computing libraries, such as the GSL (GNU Scientific Library for C and C++), or Matlab’s toolboxes. scipy is the core package for scientific routines in Python; it is meant to operate efficiently on numpy arrays, so that numpy and scipy work hand in hand.
3 Supervised Learning
3.1 Support Vector Machines
http://scikit-learn.org/stable/modules/svm.html
svm可以用來做classification, regression以及outliers detection(異常檢測).
在sklearn里面svm具體分為SVC/SVR和NuSVC/NuSVR. 兩者的區別在 這里 可以看到,但是差別應該不大:"It can be shown that the Nu-SVC formulation is a reparametrization of the C-SVC and therefore mathematically equivalent."
classification有三種分類器分別是SVC, NuSVC, LinearSVC. 其中LinearSVC相同於我SVC使用'linear'核方法,區別在於SVC底層使用libsvm, 而LinearSVC則使用liblinear. 另外LinearSVC得到的結果最后也不會返回support_(支持向量). 對於多分類問題SVC使用one-vs-one來生成分類器,也就是說需要構造C(n,2)個分類器。LinearSVC使用one-vs-rest來生成分類器,也就是構造n個分類器。LinearSVC也有比較復雜的算法只構造一個分類器就可以進行多分類。regression有兩種回歸器分別是SVR和NuSVR. classifier和regressor都允許直接輸出概率值。用於異常檢測是OneClassSVM.
kernel函數支持 1.linear 2. polynomial 3. rbf 4. sigmoid(tanh). 對於unbalanced的問題,sklearn實現允許指定 1.class_weight 2.sample_weight. 其中class_weight表示每個class對應的權重,這個在構造classifier時候就需要設置。如果不確定的話就設置成為'auto'。sample_weight則表示每個實例對應的權重,這個可以在調用訓練方法fit的時候傳入。另外一個比較重要的參數是C(懲罰代價), 通常來說設置成為1.0就夠了。但是如果數據中太多噪音的話那么最好減小一些。
在計算效率方面,SVM是通過QP來求解的。基於libsvm的實現時間復雜度在O(d * n^2) ~ O(d * n^3)之間,變化取決於如何使用cache. 所以如果我們內存足夠的話那么可以調大cache_size來加快計算速度。其中d表示feature大小,如果數據集合比較稀疏的話,那么可以認為d是non-zero的feature平均數量。libsvm處理數據集合大小最好不要超過10k. 相比之下,liblinear的效率則要好得多,可以很容易訓練million級別的數據集合。
#!/usr/bin/env python #coding:utf-8 #Copyright (C) dirlt from sklearn import datasets iris = datasets.load_iris() digits = datasets.load_digits() from sklearn import svm from sklearn import cross_validation from sklearn.metrics import classification_report clf = svm.SVC(gamma = 0.001, C = 1.0) # (data, target) = (iris.data, iris.target) (data, target) = (digits.data, digits.target) X_tr, X_tt, y_tr, y_tt = cross_validation.train_test_split(data, target, test_size = 0.3, random_state = 0) clf.fit(X_tr, y_tr) y_true, y_pred = y_tt, clf.predict(X_tt) print(classification_report(y_true, y_pred))
3.2 Ensemble methods
http://scikit-learn.org/stable/modules/ensemble.html
emsemble方法通常分為兩類:
- averaging methods. 平均方法,使用不同的算法構建出幾個不同的假設然后取平均效果。算法得到的假設都比較好但是容易overfitting, 通過取平均效果降低variance. 通常算法只是作用在部分數據上。這類方法有Bagging, Random Forest等。sklearn提供了bagging meta-estimator允許傳入base-estimator來自動做averaging. RF還提供了兩個不同版本,另外一個版本在生成決策樹選擇threshold上也做了隨機。
- boosting methods. 增強方法,使用同一個算法不斷地修正和迭代然后組合。算法得到的假設一般都比較弱,但是通過組合在一起得到效果比較好的假設。通常算法作用在全部數據上。這類方法有AdaBoost, Gradient Boosting等。sklearn提供的AdaBoost內部base-estimator默認是DecisionTree, 而GBDT內部base-estimator固定就是decision-tree但是允許自定義損失函數。
使用Decision Tree來做分類和回歸時另外一個好處是可以知道每個feature的重要性:位於DecisionTree越高的feature越重要。不過我的理解是這種feature重要性只能用在DecisionTree這種訓練方式上。
#note: 從下面程序效果上看,GBDT比RF稍微差一些,並且GBDT運行時間要明顯長於RF。用iris數據集合的話兩者效果差不多。
#!/usr/bin/env python #coding:utf-8 #Copyright (C) dirlt from sklearn import datasets iris = datasets.load_iris() digits = datasets.load_digits() from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn import cross_validation from sklearn.metrics import classification_report # (data, target) = (iris.data, iris.target) (data, target) = (digits.data, digits.target) X_tr, X_tt, y_tr, y_tt = cross_validation.train_test_split(data, target, test_size = 0.3, random_state = 0) print '----------RandomForest----------' clf = RandomForestClassifier(n_estimators = 100, bootstrap = True, oob_score = True) clf.fit(X_tr, y_tr) print 'OOB Score = %.4f' % clf.oob_score_ print 'Feature Importance = %s' % clf.feature_importances_ y_true, y_pred = y_tt, clf.predict(X_tt) print(classification_report(y_true, y_pred)) print '----------GradientBoosting----------' clf = GradientBoostingClassifier(n_estimators = 100, learning_rate = 0.6, random_state = 0) clf.fit(X_tr, y_tr) print 'Feature Importance = %s' % clf.feature_importances_ y_true, y_pred = y_tt, clf.predict(X_tt) print(classification_report(y_true, y_pred))
3.3 Nearest Neighbors
http://scikit-learn.org/stable/modules/neighbors.html
NN可以同時用來做監督和非監督學習。其中非監督學習的NN是其他一些學習方法的基礎。
在實現上sklearn提供了幾種算法來尋找最近點:1. brute-force 2. kd-tree 3. ball-tree 4. auto. 其中auto是根據數量大小自動選擇算法的。brute-force是采用暴力搜索算法,kd-tree和ball-tree則建立了內部數據結構來加快檢索。假設數據維度是d, 數據集合大小是N的話,那么三個算法時間復雜度分別是O(dN), O(d*logN), O(d*logN). 不過如果d過大的話kd-tree會退化稱為O(dN).
如果數據量比較小的話那么1比2,3要好,所以在實現上kd-tree/ball-tree發現如果數據集合較小的話就會改用brute-force來做。這個閾值稱為leaf_size. leaf_size大小會影響到 1. 構建索引時間(反比) 2. 查詢時間(合適的leaf_size可以達到最優) 3. 內存大小(反比). 所以盡可能地增大leaf_size但是確保不會影響查詢時間。
classifier和regressor基本上就是在這些數據結構上做了一層包裝。我們可以指定距離函數以及查找到最近點之后的合成函數. 默認距離函數是minkowski(p=2, 也就歐幾里得距離), 合成函數包含uniform和distance(和距離成反比). KNeighborsClassifier是選擇附近k個點,而RadiusNeighborsClassifier則是選擇附近在radius范圍內的所有點。另外還有一個NearestCentroid分類器:假設y有k個classes的話,根據這些class歸納為k類並且計算出中心(centroid), 然后判斷離哪個中心近就預測哪個class.
#!/usr/bin/env python #coding:utf-8 #Copyright (C) dirlt from sklearn import datasets iris = datasets.load_iris() digits = datasets.load_digits() from sklearn.neighbors import KNeighborsClassifier from sklearn import cross_validation from sklearn.metrics import classification_report # (data, target) = (iris.data, iris.target) (data, target) = (digits.data, digits.target) X_tr, X_tt, y_tr, y_tt = cross_validation.train_test_split(data, target, test_size = 0.3, random_state = 0) clf = KNeighborsClassifier(n_neighbors = 10) clf.fit(X_tr, y_tr) y_true, y_pred = y_tt, clf.predict(X_tt) print(classification_report(y_true, y_pred))
3.4 Naive Bayes
http://scikit-learn.org/stable/modules/naive_bayes.html
朴素貝葉斯用於分類問題,其中兩項主要工作就是計算 1.P(X|y) 2.P(y). 兩者都是通過MLE(maximum likehood estimation)來完成的。P(y)相對來說比較好計算,計算P(X|y)有下面三種辦法:
- 如果Xi是連續量的話,Gaussian Naive Bayes. 取y=k的所有Xi數據點,假設這個分布服從高斯分布。計算出這個高斯分布的mean和std之后,就可以計算P(X|y=k)。這個模型系數有d * k個。
- 如果Xi是離散量的話,Multinomial Naive Bayes. 那么P(X=u|y=k) = P(X=u, y=k) / P(y=k). 這個模型系數有k * ∑ {Xi}個。模型里面還有一個平滑參數。
- 進一步如果Xi是(0,1)的話,Bernoulli Naive Bayes. 通常我們需要提供參數binarize,這個方法用來將X轉換成為(0,1).
#!/usr/bin/env python #coding:utf-8 #Copyright (C) dirlt from sklearn import datasets iris = datasets.load_iris() digits = datasets.load_digits() from sklearn.naive_bayes import MultinomialNB, GaussianNB from sklearn import cross_validation from sklearn.metrics import classification_report (data, target) = (iris.data, iris.target) clf = GaussianNB() # (data, target) = (digits.data, digits.target) # clf = MultinomialNB() X_tr, X_tt, y_tr, y_tt = cross_validation.train_test_split(data, target, test_size = 0.3, random_state = 0) clf.fit(X_tr, y_tr) y_true, y_pred = y_tt, clf.predict(X_tt) print(classification_report(y_true, y_pred))
4 Model selection and evaluation
4.1 Cross-validation: evaluating estimator performance
http://scikit-learn.org/stable/modules/cross_validation.html
- 使用train_test_split分開training_set和test_set.
- 使用k-fold等方式從training_set中分出validation_set做cross_validation.
- 使用cross_val_score來進行cross_validation並且計算cross_validation效果.
#!/usr/bin/env python #coding:utf-8 #Copyright (C) dirlt import numpy as np from sklearn import cross_validation from sklearn import datasets from sklearn import svm # iris.data.shape = (150, 4); n_samples = 150, n_features = 4 iris = datasets.load_iris() # 分出40%作為測試數據集合. random_state作為隨機種子 X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size = 0.4, random_state = 0) # 假設這里我們已經完成參數空間搜索 clf = svm.SVC(gamma = 0.001, C = 100., kernel = 'linear') # 使用cross_validation查看參數效果 scores = cross_validation.cross_val_score(clf, X_train, y_train, cv = 3) print("Accuracy on cv: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) # 如果效果不錯的話,就是可以使用這個模型計算測試數據 clf.fit(X_train, y_train) print(np.mean(clf.predict(X_test) == y_test))
4.2 Grid Search: searching for estimator parameters
http://scikit-learn.org/stable/modules/grid_search.html
參數空間搜索方式大致分為三類: 1.暴力 2.隨機 3.adhoc. 其中23和特定算法相關。
我們這里以暴力搜索為例。我們只需要以字典方式提供搜索參數的可選列表即可。因為搜索代碼內部會使用cross_validation來做驗證,所以我們只需提供cross_validatio參數即可。下面代碼摘自這個 例子 。
#!/usr/bin/env python #coding:utf-8 #Copyright (C) dirlt from __future__ import print_function from sklearn import datasets from sklearn.cross_validation import train_test_split from sklearn.grid_search import GridSearchCV from sklearn.metrics import classification_report from sklearn.svm import SVC # Loading the Digits dataset digits = datasets.load_digits() # To apply an classifier on this data, we need to flatten the image, to # turn the data in a (samples, feature) matrix: (n_samples, h, w) = digits.images.shape # 這里也可以直接用digits.data和digits.target. digits.data已經是reshape之后結果. X = digits.images.reshape((n_samples, -1)) y = digits.target # Split the dataset in two equal parts X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0) # Set the parameters by cross-validation # 提供參數的可選列表 tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] # 鏈接中給的代碼還對cross_validation效果評價方式(scoring)進行了搜索 clf = GridSearchCV(SVC(), tuned_parameters, cv=5) # 使用k-fold划分出validation_set. k = 5 clf.fit(X_train, y_train) print("Best parameters set found on development set:") print(clf.best_estimator_) print("Grid scores on development set:") for params, mean_score, scores in clf.grid_scores_: print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params)) print("Detailed classification report:") print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.") y_true, y_pred = y_test, clf.predict(X_test) print(classification_report(y_true, y_pred))
代碼最后使用最優模型作用在測試數據上,然后使用classification_report打印評分結果.
Best parameters set found on development set: SVC(C=10, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.001, kernel=rbf, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) Grid scores on development set: 0.986 (+/-0.001) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.001} 0.963 (+/-0.004) for {'kernel': 'rbf', 'C': 1, 'gamma': 0.0001} 0.989 (+/-0.003) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.001} 0.985 (+/-0.003) for {'kernel': 'rbf', 'C': 10, 'gamma': 0.0001} 0.989 (+/-0.003) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.001} 0.983 (+/-0.003) for {'kernel': 'rbf', 'C': 100, 'gamma': 0.0001} 0.989 (+/-0.003) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.001} 0.983 (+/-0.003) for {'kernel': 'rbf', 'C': 1000, 'gamma': 0.0001} 0.976 (+/-0.005) for {'kernel': 'linear', 'C': 1} 0.976 (+/-0.005) for {'kernel': 'linear', 'C': 10} 0.976 (+/-0.005) for {'kernel': 'linear', 'C': 100} 0.976 (+/-0.005) for {'kernel': 'linear', 'C': 1000} Detailed classification report: The model is trained on the full development set. The scores are computed on the full evaluation set. precision recall f1-score support 0 1.00 1.00 1.00 60 1 0.95 1.00 0.97 73 2 1.00 0.97 0.99 71 3 1.00 1.00 1.00 70 4 1.00 1.00 1.00 63 5 0.99 0.97 0.98 89 6 0.99 1.00 0.99 76 7 0.98 1.00 0.99 65 8 1.00 0.96 0.98 78 9 0.97 0.99 0.98 74 avg / total 0.99 0.99 0.99 719
4.3 Pipeline: chaining estimators
http://scikit-learn.org/stable/modules/pipeline.html
將多個階段串聯起來自動化
4.4 Model evaluation: quantifying the quality of predictions
http://scikit-learn.org/stable/modules/model_evaluation.html
There are 3 different approaches to evaluate the quality of predictions of a model: # 有3中不同方式來評價模型預測結果
- Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they are designed to solve. # 模型自身內部的評價比如損失函數等
- Scoring parameter: Model-evaluation tools using cross-validation (such as cross_validation.cross_val_score and grid_search.GridSearchCV) rely on an internal scoring strategy. # cv的評價,通常是數值表示. 比如'f1'.
- Metric functions: The metrics module implements functions assessing prediction errors for specific purposes. # 作用在測試數據的評價,可以是數值表示,也可以是文本圖像等表示. 比如'classification_report'.
其中23是比較相關的。差別在於3作用在測試數據上是我們需要進一步分析的,所以相對來說評價方式會更多一些。而2還是在模型選擇階段所以我們更加傾向於單一數值表示。
sklearn還提供了DummyEstimator. 它只有有限的幾種比較dummy的策略,主要是用來給出baseline.
DummyClassifier implements three such simple strategies for classification:
- 'stratified' generates randomly predictions by respecting the training set’s class distribution,
- 'most_frequent' always predicts the most frequent label in the training set,
- 'uniform' generates predictions uniformly at random.
- 'constant' always predicts a constant label that is provided by the user.
DummyRegressor also implements three simple rules of thumb for regression:
- 'mean' always predicts the mean of the training targets.
- 'median' always predicts the median of the training targests.
- 'constant' always predicts a constant value that is provided by the user.
4.5 Model persistence
http://scikit-learn.org/stable/modules/model_persistence.html
可以使用python自帶的pickle模塊,或者是sklearn的joblib模塊。joblib相對pickle能更有效地序列化到磁盤上,但缺點是不能夠像pickle一樣序列化到string上。
4.6 Validation curves: plotting scores to evaluate models
http://scikit-learn.org/stable/modules/learning_curve.html
Every estimator has its advantages and drawbacks. Its generalization error can be decomposed in terms of bias, variance and noise. The bias of an estimator is its average error for different training sets. The variance of an estimator indicates how sensitive it is to varying training sets. Noise is a property of the data. # bias是指模型對不同訓練數據的偏差,variance則是指模型對不同訓練數據的敏感程度,噪音則是數據自身屬性。這三個問題造成預測偏差。
#note: 這個特性應該是從0.15才有的。之前我用apt-get安裝的sklearn-0.14.1沒有learning_curve這個模塊。
validation curve
觀察模型某個參數變化對於training_set和validation_set結果影響,來確定是否underfitting或者overfitting. 參考這個 例子 繪圖
If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible. All three cases can be found in the plot below where we vary the parameter gamma on the digits dataset.
可以看到gamma在5 * 10^{-4}附近cross-validation score開始下滑,但是training score還是不錯的,說明overfitting.
learning curve
觀察增加數據量是否能夠改善效果。通常增加數據量會使得traning score和validation score不斷收斂。如果兩者收斂處score比較低的話(high-bias), 那么增加數據量是不能夠改善效果的話,那么我們就需要更換模型。相反如果兩者收斂位置score比較高的話,那么增加數據量就可以改善效果。參考這個 例子 繪圖
第一幅圖是是用朴素貝葉斯的learning curve. 可以看到high-bias情況。第二幅圖是使用SVM(RBF kernel)的learning curve. 學習情況明顯比朴素貝葉斯要好。
【轉自】: http://dirlt.com/sklearn.html