文本分類流程詳細總結（keras）

本文轉載自查看原文 2021-09-09 18:17 138 NLP

一、背景

在進行深度學習的時候，需要進行模型的預處理和數據轉換，這里記錄一下內容和方法，方便以后的使用和查找。根據模型的過程，將會按照數據集的處理、標簽轉化、文本向量化、模型構建、添加評估內容等幾個基礎的方面進行介紹。

二、內容介紹

2.1 數據的讀取

數據的讀取一般是直接使用pandas進行讀取。這里需要注意的問題就是編碼的問題，在進行操作的時候，往往會出現無法識別的編碼，下面進行一個總結和一些情況的處理，有助於以后能夠快速找到問題，同時給出知乎的編碼差距。

1.編碼問題
	一般常用的編碼是utf-8的編碼格式。有時候會出現無法解析的編碼
	編碼集大小：GBK < GB2312 < GB18030，具體可以查看編碼差距。
2. 特殊情況的處理
	可以使用notepad++查看字節編碼，然后使用對應的編碼進行處理
上述方法可以解決字符編碼的問題。

2.2 數據的洗滌

數據洗滌一般使用pandas中的apply的方法。下面給出字符替換的內容（參考Tokenizer源碼中數據洗滌的方法）,需要其他的內容直接在處理函數中增加即可。

def data_detail(text: str) -> str:
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
    for i in filters:
    	text.replace(i, "")
    return text
    
pf.text = pf.text.apply(data_detail)

2.3 文本向量化

對於文本數據需要進行文本的向量化，像圖像數據或者其他的數字數據不需要進行這一步操作，可以直接放入模型中進行訓練。文本話的方法也是訓練一個文本轉化器，然后通過序列化的方法，將文本轉成對於的id值，這一步操作可以使用內置方法，也可以自己建立一個詞典，然后使用numpy將得到的結果向量化，就可以進行訓練。

# 構建一個轉化器
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
# 訓練文本
tokenizer.fit_on_texts(text_c)
# 將文本序列化
# 默認情況下，沒有出現的詞不會給出id會自動過濾掉，否則需要在Tokenizer的時候給出oov_token的參數，這樣沒有出現的詞就會用oov_token的標簽代替。
sequences = tokenizer.texts_to_sequences(texts)
# 填補到統一位數，post表示在后面補位，默認是在前面補位，同時可以指定補充的值默認為0
data = pad_sequences(sequences, maxlen=MAX_NUM_WORDS, padding="post")

2.4 標簽one-hot化

在進行訓練的時候，需要將標簽也轉碼，因此我們需要將其編程one-hot編碼，這里也分為兩種情況。第一是數字標簽且不要求按照順序排列，可以直接使用to_categorical轉碼。第二是非數字標簽，或者要求從1或者0開始的，那就是需要使用到LabelEncoder()去訓練一個標簽編碼器，然后進行標注，標注完成之后再使用to_categorical去編碼，同時可以使用標簽器將標簽轉成數據。下面介紹一下使用：

# 注：對於多分類的標簽一樣適用，直接使用to_categorical即可完成。
"""
數字標簽
label = to_categorical(x)
"""
"""
字符標簽，或者要求連續的標簽
"""
# 標簽編碼器
from sklearn import preprocessing
# 聲明標簽數據編碼器
label_coder = preprocessing.LabelEncoder()
# 訓練
label_coder.fit(labels)
# 轉成數字
result_tr = label_coder.transform(labels)
# one-hot
labels = to_categorical(result_tr)

2.5 數據分割

在訓練的時候，需要對數據進行一個分割，或者是重排。可以先將標簽以及數據都處理好再去分割，或者重排。

# 划分數據
from sklearn.model_selection import train_test_split
# 重排數據
import random
# 數據分割
# 划分的結果得到四個內容：訓練集、訓練集標簽、測試集、測試集標簽。random_state保證了打亂數據的一致性，也就是每次打亂的結果都一直，結果之間就可以比較。
train_x, test_x, train_y, test_y = train_test_split(data, labels, random_state=42)

# 數據重排
# 對訓練集數據進行重排，避免數據對結果造成影響。
def none_shuffle(x_1, y_1, state_random=42):
    _array = list(zip(x_1, y_1))
    random.seed(state_random)
    random.shuffle(_array)
    x_, y_ = zip(*_array)
    return np.array(x_, dtype="int32"), np.array(y_, dtype="float32")
train_x, train_y = none_shuffle(train_x, train_y)

2.6 模型構建

通過前面的處理，已經將數據准備完成，下面就將進行模型構建，模型構建的方式通常分為兩種。第一個是使用Sequential然后用add的方法去疊加模型；另一個是使用Model將輸入、輸出指定，包括Input，和輸出形式，這樣就可以完成模型的構建。這兩種方法的區別是：使用Sequential的可以直接在預測中調用方法預測，方法簡單但是在擴展性上不好，多模型復雜結構是無法完成的。使用Model的方法需要使用np.max的方法獲取最大輸出值的位置，可以一層一層指定，拓展性高，可以完成復雜模型的疊加和處理。

"""
在使用keras的時候需要注意模型導入的方式，使用keras導入的和使用TensorFlow.keras導入模型這兩種方式不能互通，有時候會出現bug提示，無法識別網絡層。
"""
# 第一種使用Sequential構建模型
model = Sequential()
model.add(Embedding(400, 256))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Attention())
model.add(FM(256))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(len(label_coder.classes_), activation='softmax'))

# 第二種使用Model構建模型
MAX_LEN = 400
input_layer = Input(shape=(MAX_LEN,))
layer = Embedding(input_dim=len(id2char), output_dim=256)(input_layer)
layer = Bidirectional(LSTM(256, return_sequences=True))(layer)
layer = Flatten()(layer)
output_layer = Dense(len(df_train['label'].unique()), activation='softmax')(layer)
model = Model(inputs=input_layer, outputs=output_layer)
model.summary()

2.7 模型訓練

模型訓練通常需要一個指標或者優化，包括設置損失函數、評估函數等。

# 損失函數指定
# categorical_crossentropy用於多分類和激活函數softmax匹配
# binary_crossentropy用於二分類和激活函數sigmoid匹配

# 模型編譯，模型通過編譯之后才可以訓練，可以通過以下的內容對模型訓練進行優化。
optimizer：指定優化器，優化器可以通過字符添加，也可以引入keras中的optimizers指定，callbacks：可以通過keras中的callbacks對模型設置ModelCheckpoint、EarlyStopping、TensorBoard、ReduceLROnPlateau等，對模型訓練過程更加精准的控制。
metrics：設置評估的參數，可以在 訓練過程中看到，系統給出的是accuracy，其他的需要自己編寫、實現。
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy',recall_threshold(0.5),precision_threshold(0.5)])

# 訓練的時候，對於多輸入，按照輸入的順序做一個一個整體輸入，同時可以通過validation_data指定驗證數據。具體實現在后面給出一些例子。
# 單輸入
history = model.fit(x_train, y_train)
                    batch_size=256,
                    epochs=100,
                    validation_data=(x_test, y_test),
                    verbose=1,
                    callbacks=callbacks)
# 多輸入
# 下面的三個輸入會對應model中的inputs參數,它也是多個輸入和這個對應起來即可。其他參數和單輸入訓練一致。
history = model.fit([x_train, wordnet_train, kg_train], y_train)

給出metrics的方法包括f1、召回率、精確率。

def F1_macro(y_true, y_pred):
    # matthews_correlation
    y_pred_pos = K.round(K.clip(y_pred, 0, 1))
    y_pred_neg = 1 - y_pred_pos
    y_pos = K.round(K.clip(y_true, 0, 1))
    y_neg = 1 - y_pos
    tp = K.sum(y_pos * y_pred_pos)
    tn = K.sum(y_neg * y_pred_neg)
    fp = K.sum(y_neg * y_pred_pos)
    fn = K.sum(y_pos * y_pred_neg)
    numerator = (tp * tn - fp * fn)
    denominator = K.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    return numerator / (denominator + K.epsilon())
    
def recall_threshold(threshold = 0.5):
    def recall(y_true, y_pred):
        """Recall metric.
        Computes the recall over the whole batch using threshold_value.
        """
        threshold_value = threshold
        # Adaptation of the "round()" used before to get the predictions. Clipping to make sure that the predicted raw values are between 0 and 1.
        y_pred = K.cast(K.greater(K.clip(y_pred, 0, 1), threshold_value), K.floatx())
        # Compute the number of true positives. Rounding in prevention to make sure we have an integer.
        true_positives = K.round(K.sum(K.clip(y_true * y_pred, 0, 1)))
        # Compute the number of positive targets.
        possible_positives = K.sum(K.clip(y_true, 0, 1))
        recall_ratio = true_positives / (possible_positives + K.epsilon())
        return recall_ratio
    return recall

def precision_threshold(threshold=0.5):
    def precision(y_true, y_pred):
        """Precision metric.
        Computes the precision over the whole batch using threshold_value.
        """
        threshold_value = threshold
        # Adaptation of the "round()" used before to get the predictions. Clipping to make sure that the predicted raw values are between 0 and 1.
        y_pred = K.cast(K.greater(K.clip(y_pred, 0, 1), threshold_value), K.floatx())
        # Compute the number of true positives. Rounding in prevention to make sure we have an integer.
        true_positives = K.round(K.sum(K.clip(y_true * y_pred, 0, 1)))
        # count the predicted positives
        predicted_positives = K.sum(y_pred)
        # Get the precision ratio
        precision_ratio = true_positives / (predicted_positives + K.epsilon())
        return precision_ratio
    return precision

給出callbacks的方法。

model_dir = main_model_dir + time.strftime('%Y-%m-%d %H-%M-%S') + "/"
model_file = model_dir + "{epoch:02d}-val_acc-{val_acc:.2f}-val_loss-{val_loss:.2f}.hdf5"
# 保存最好模型
checkpoint = ModelCheckpoint(
    model_file, 
    monitor='val_acc', 
    save_best_only=True)

# 提前結束
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,
    verbose=1,
    restore_best_weights=True)

# 減少學習率
reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=5,
    verbose=1)
    
callbacks = [checkpoint, reduce_lr, early_stopping]

2.8 結果預測

結果預測也分為兩個，一個是Sequential的方法，一個是Model的方法。對於多輸入的預測，將測試集數據放在一起作為整體的輸入即可。

# Sequential的方法
"""
對於sequential可以直接使用predict_classes進行預測，會直接返回對應的預測編碼，也就是標簽信息。
"""
y_pre = model.predict_classes(test_x)

# Model的方法
"""
首先進行結果的預測，然后會返回一個n為的結果，在通過argmax返回最大結果的位置，最大值對應標簽結果。
"""
predict = model.predict(x_test, verbose=1, batch_size=40)
sc=np.argmax(predict,axis=1)

# 得到標簽，通過標簽編碼器，轉成對應的數據。
show = label_coder.inverse_transform(y_pre)
print(show[:5])

# 多輸入的情況，其他和單輸入的結果轉化一樣
scores = model.predict([x_test, wordnet_test, kg_test], verbose=1, batch_size=40)

2.9 結果分析

對於分類結果的分析主要包括以下的內容：結果的評估，訓練曲線的分析，分類報告等內容。

結果評估

# 結果評估
"""
首先獲取模型評估的方法，然后將測試集傳入即可。會得到一個整體的結果。
"""
print(model.metrics_names)
# 單輸入
score, acc,recall,precision= model.evaluate(x_test, y_test, batch_size=128)
print("\nTest loss score: %.4f, accuracy: %.4f, recall: %.4f,precision: %.4f" % (score, acc,recall,precision))

# 多輸入
"""
與訓練和預測一樣，將需要的多個測試集做一個整體輸入。
"""
score = model.evaluate([x_test, wordnet_test, kg_test], y_test, batch_size=BATCH_SIZE)
print("ACCURACY:", score[1])
print("LOSS:", score[0])

訓練曲線分析
訓練完的結果可以通過history進行保存。然后繪制訓練曲線，曲線結果分析的詳細介紹。

# 查詢history中的關鍵字數據
print(history.history.keys())
# 繪制曲線
"""
loss、val_loss需要減小，表示訓練結果的偏差越來越小
acc、val_acc要增加，表示結果越來越好
其他情況需要去
"""
plot_performance(history=history)

繪制圖像代碼

import matplotlib.pyplot as plt
def plot_performance(history=None, figure_directory=None, ylim_pad=[0, 0]):
	xlabel = 'Epoch'
    legends = ['Training', 'Validation']
    plt.figure(figsize=(20, 5))
    y1 = history.history['acc']
    y2 = history.history['val_acc']
    min_y = min(min(y1), min(y2))-ylim_pad[0]
    max_y = max(max(y1), max(y2))+ylim_pad[0]
    plt.subplot(121)
    plt.plot(y1)
    plt.plot(y2)
    plt.title('Model Accuracy\n'+date_time(1), fontsize=17)
    plt.xlabel(xlabel, fontsize=15)
    plt.ylabel('Accuracy', fontsize=15)
    plt.ylim(min_y, max_y)
    plt.legend(legends, loc='upper left')
    plt.grid()
    y1 = history.history['loss']
    y2 = history.history['val_loss']
    min_y = min(min(y1), min(y2))-ylim_pad[1]
    max_y = max(max(y1), max(y2))+ylim_pad[1]
    plt.subplot(122)
    plt.plot(y1)
    plt.plot(y2)
    plt.title('Model Loss\n'+date_time(1), fontsize=17)
    plt.xlabel(xlabel, fontsize=15)
    plt.ylabel('Loss', fontsize=15)
    plt.ylim(min_y, max_y)
    plt.legend(legends, loc='upper left')
    plt.grid()
    if figure_directory:
        plt.savefig(figure_directory+"/history")
    plt.show()

分類報告
前面已經有了評估報告，這里的分類報告可以查看到具體每個類的准確率，和詳細信息。

from sklearn.metrics import f1_score, accuracy_score,confusion_matrix,classification_report, recall_score, precision_score
re_sult = classification_report(x_ture_t, sc, zero_division=1)

# average表示取值的不同，有micro、binary等具體可以查看源碼
print(f"recall_score:{recall_score(x_ture_t, sc, average='macro')}")
print(f"f1_score:{f1_score(x_ture_t, sc, average='macro')}")
print(f"accuracy_score:{accuracy_score(x_ture_t, sc)}")
print(f"precision_score:{precision_score(x_ture_t, sc, average='macro')}")

2.10 模型保存

模型保存也分為兩種：全部信息和權重信息。

model.save("xxx.h5")
model.save_weights("xxx.h5")

三、總結

記錄這個的主要目的就是方便查看以及可以鞏固學習，處理方法都很通用，把它寫成電子檔方便查找，后面使用的時候可以直接引用。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 keras 文本分類 LSTM 文本分類（TextCNN，Keras）用keras實現基本的文本分類任務基於keras的fasttext短文本分類 Keras lstm 文本分類示例基於keras中IMDB的文本分類 demo keras實戰教程二(文本分類BiLSTM) Text-CNN-文本分類-keras 文本分類項目總結 NLP之BERT中文文本分類超詳細教程