機器學習應用示例：多輸出分類問題

本文轉載自查看原文 2019-05-04 06:07 844 機器學習實踐/ 回歸和分類算法

本文使用的數據是一個多輸出分類問題，每個數據都被歸納為9個種類的屬性，每個種類下又細分為多個標簽，需要預測的是每個數據在這9個種類下的具體標簽（注：數據在每個種類下只能有一個標簽）。詳細的數據介紹和問題描述可參考此鏈接。

本文主要是對DataCamp上的課程Machine Learning with the Experts: School Budgets中提到的方法進行總結和擴充。

1. 建立拆分訓練集和測試集的函數

該函數使用數據的二維標簽矩陣[shape: (數據數，標簽數)]，矩陣中的每個值為0或1。該函數保證每個標簽下（即在對應的標簽矩陣列中取值為1）都有一定數量的數據分入測試集中

import numpy as np
import pandas as pd
from warnings import warn
### Takes a label matrix 'y' and returns the indices for a sample with size
### 'size' if 'size' > 1 or 'size' * len(y) if 'size' <= 1.
### The sample is guaranteed to have > 'min_count' of each label.
def multilabel_sample(y, size=1000, min_count=5, seed=None):
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).any():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')
    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some labels do not have enough examples. Change min_count if necessary.')
    if size <= 1:
        size = np.floor(y.shape[0] * size)
    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count #size should be at least this value for having >min_count of each label
    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))
    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])
    sample_idxs = np.array([], dtype=choices.dtype)
    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])
    sample_idxs = np.unique(sample_idxs)
    # now that we have at least min_count of each, we can just random sample
    sample_count = int(size - sample_idxs.shape[0])
    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices, size=sample_count, replace=False)
    return np.concatenate([sample_idxs, remaining_sampled])   
### Takes a features matrix X and a label matrix Y
### Returns (X_train, X_test, Y_train, Y_test) where each label in Y is represented at least min_count times.     
def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):       
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])
    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)
    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask
    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])

View Code

2. 定義評價標准

針對每個數據，分別評估其在每個種類下的標簽分類，再將所有種類的評估結果進行平均

### 數據共對應9個種類，每個種類又有不同的標簽個數
LABEL_INDICES = [range(0, 37), range(37, 48), range(48, 51), range(51, 76), range(76, 79), \
                 range(79, 82), range(82, 87), range(87, 96), range(96, 104)]
### Logarithmic Loss metric
### predicted, actual: 2D numpy array
def multi_multi_log_loss(predicted, actual, label_column_indices=LABEL_INDICES, eps=1e-15):
    label_scores = np.ones(len(label_column_indices), dtype=np.float64)
    # calculate log loss for each set of columns that belong to a category
    for k, this_label_indices in enumerate(label_column_indices):
        # get just the columns for this label
        preds_k = predicted[:, this_label_indices].astype(np.float64)
        # normalize so probabilities sum to one (unless sum is zero, then we clip)
        preds_k /= np.clip(preds_k.sum(axis=1).reshape(-1, 1), eps, np.inf)
        actual_k = actual[:, this_label_indices]
        # shrink predictions
        y_hats = np.clip(preds_k, eps, 1 - eps)
        sum_logs = np.sum(actual_k * np.log(y_hats))
        label_scores[k] = (-1.0 / actual.shape[0]) * sum_logs
    return np.average(label_scores)

View Code

3. 讀取並拆分數據集

### 讀取並拆分數據
df = pd.read_csv('TrainingData.csv', index_col=0)
LABELS = ['Function', 'Use', 'Sharing', 'Reporting', 'Student_Type', 'Position_Type', \
          'Object_Type', 'Pre_K', 'Operating_Status']
NON_LABELS = [c for c in df.columns if c not in LABELS]
NUMERIC_COLUMNS = ['FTE', 'Total']
label_dummies = pd.get_dummies(df[LABELS], prefix_sep='__')
X_train, X_test, y_train, y_test = multilabel_train_test_split(df[NON_LABELS], label_dummies, size=0.2, seed=123)

View Code

4. 處理文本特征

將每個數據的所有文本特征整合成一個字符串

from sklearn.preprocessing import FunctionTransformer
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):   
    to_drop = set(to_drop) & set(data_frame.columns.tolist()) #Drop non-text columns that are in the df
    text_data = data_frame.drop(to_drop, axis=1)
    text_data.fillna('', inplace=True)
    # Join all text items in a row that have a space in between
    return text_data.apply(lambda x: ' '.join(x), axis=1)
get_text_data = FunctionTransformer(combine_text_columns, validate=False)

View Code

對合成的字符串進行分詞，並且使用1-gram和2-gram提取新的文本特征，記錄所有數據的字符串中出現的所有一元和二元詞組，計算每個數據的字符串中每個一元和二元詞組出現的次數

from sklearn.feature_extraction.text import CountVectorizer
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=[!"#$%&\'()*+,-./:;<=>?@[\\\\\]^_`{|}~\\s]+)'  #(?=re)表示當re也匹配成功時輸出'('前面的部分
text_vectorizer = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC, ngram_range=(1, 2))
### Alternative Option: HashingVectorizer(token_pattern=TOKENS_ALPHANUMERIC, ngram_range=(1, 2), alternate_sign=False, norm=None, binary=False, n_features=...)  
### 若CountVectorizer生成的特征太多，用HashingVectorizer替代可以控制特征數目，同時不犧牲太多精度
### HashingVectorizer將所有數據的字符串中每個一元和二元詞組映射為一個哈希值（多個詞組可對應同一個值），計算每個數據的字符串中每個哈希值出現的次數

View Code

使用卡方檢驗對生成的文本特征進行選擇，卡方檢驗的步驟如下：

（1）計算觀測矩陣O，O_ij表示具有第i個標簽的所有數據的第j個特征之和

（2）計算期望矩陣E，E_ij表示若按照第i個標簽在數據中所占比例進行分配，第i個標簽對應的第j個特征之和

（3）計算每個特征對應的卡方統計量，在該特征與數據標簽無關的假設條件下該統計量服從自由度為(總標簽數-1)的卡方分布

（4）特征對應的卡方統計量越大，該特征在模型訓練和預測時就越重要

from sklearn.feature_selection import chi2, SelectKBest
###    observation O = np.dot(y.T, X), X的維數為(num_sample, num_feature)且其中的元素>=0, y的維數為(num_sample, num_label)且其中的元素為0或1, O為2D matrix (num_label, num_feature)
###    expectation E = np.dot(y.mean(axis=0).T, X.sum(axis=0)), E為2D matrix (num_label, num_feature)
###    卡方統計量 = ((O-E)**2/E).sum(axis=0), 1D array (num_feature,)
chi_k = 300 #選出300個特征
text_feature_selector = SelectKBest(chi2, chi_k)

View Code

5. 合並數值和文本特征，並使用"feature interaction"生成新的特征

from sklearn.preprocessing import Imputer
from sklearn.pipeline import FeatureUnion, Pipeline
get_numeric_data = FunctionTransformer(lambda x: x[NUMERIC_COLUMNS], validate=False)
num_text_feature = FeatureUnion([('numeric_features', Pipeline([('selector', get_numeric_data), ('imputer', Imputer())])), ('text_features', Pipeline([('selector', get_text_data), ('vectorizer', text_vectorizer), ('dim_red', text_feature_selector)]))])
### Feature Interaction
### 同sklearn中的PolynomialFeatures，但由於CountVectorizer或HashingVectorizer得到的是稀疏矩陣，不能直接用PolynomialFeatures
from itertools import combinations
from scipy import sparse
from sklearn.base import BaseEstimator, TransformerMixin
class SparseInteractions(BaseEstimator, TransformerMixin):
    def __init__(self, degree=2, feature_name_separator='&&'):
        self.degree = degree
        self.feature_name_separator = feature_name_separator
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        if not sparse.isspmatrix_csc(X):
            X = sparse.csc_matrix(X)
        if hasattr(X, "columns"):
            self.orig_col_names = X.columns
        else:
            self.orig_col_names = np.array([str(i) for i in range(X.shape[1])])
        spi = self._create_sparse_interactions(X)
        return spi
    def get_feature_names(self):
        return self.feature_names
    def _create_sparse_interactions(self, X):
        out_mat = []
        self.feature_names = self.orig_col_names.tolist()
        for sub_degree in range(2, self.degree + 1):
            for col_ixs in combinations(range(X.shape[1]), sub_degree):
                # add name for new column
                name = self.feature_name_separator.join(self.orig_col_names[list(col_ixs)])
                self.feature_names.append(name)
                # get column multiplications value
                out = X[:, col_ixs[0]]
                for j in col_ixs[1:]:
                    out = out.multiply(X[:, j])
                out_mat.append(out)
        return sparse.hstack([X] + out_mat)

View Code

6. 使用Logistic回歸建立模型

from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MaxAbsScaler
pl = Pipeline([('union', num_text_feature), ('inter', SparseInteractions(degree=2)), \
               ('scale', MaxAbsScaler()), ('clf', OneVsRestClassifier(LogisticRegression()))])
pl.fit(X_train, y_train)
predictions = pl.predict_proba(X_test)
print("Test Logloss: {}".format(multi_multi_log_loss(predictions, y_test.values)))

View Code

以上即是完整的模型建立過程，該問題還可從以下方面繼續加以考慮：

自然語言處理：例如可以去掉the, a, an等頻率較高但無特定意義的詞匯（stop-word removal）
使用的模型：例如可以使用隨機森林、XGBOOST等模型進行訓練和預測
優化：例如可以對最終的Pipeline的各個部分的超參數進行網格或隨機搜索調優

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習--分類問題多標簽分類及多輸出分類 [Python機器學習]鳶尾花分類機器學習應用多輸出回歸問題機器學習算法分類機器學習：SVM做多分類問題機器學習：多分類及多標簽分類機器學習的應用實例深度學習-Tensorflow2.2-多分類{8}-多輸出模型實例-20 機器學習分類算法評價指標