基於fastText模型的文本分類

本文轉載自查看原文 2020-08-16 20:46 1125 自然語言處理-文本分類/ 自然語言處理

轉自：https://mp.weixin.qq.com/s/m01J5Mi25txyRkKo7_BAuw

1. 數據及背景

https://tianchi.aliyun.com/competition/entrance/531810/information（阿里天池-零基礎入門NLP賽事）

2. fastText模型剖析

2.1 概念

FastText是一種典型的深度學習詞向量的表示方法，它的核心思想是將整篇文檔的詞及n-gram向量疊加平均得到文檔向量，然后使用文檔向量做softmax多分類。這中間涉及到兩個技巧：字符級n-gram特征的引入以及分層Softmax分類。

2.2 模型框架

fastText模型架構和word2vec的CBOW模型架構非常相似。下面是fastText模型架構圖：

注意：此架構圖沒有展示詞向量的訓練過程。可以看到，和CBOW一樣，fastText模型也只有三層：輸入層、隱含層、輸出層（Hierarchical Softmax），輸入都是多個經向量表示的單詞，輸出都是一個特定的target，隱含層都是對多個詞向量的疊加平均。

不同的是，CBOW的輸入是目標單詞的上下文，fastText的輸入是多個單詞及其n-gram特征，這些特征用來表示單個文檔；CBOW的輸入單詞被onehot編碼過，fastText的輸入特征是被embedding過；CBOW的輸出是目標詞匯，fastText的輸出是文檔對應的類標。

值得注意的是，fastText在輸入時，將單詞的字符級別的n-gram向量作為額外的特征；在輸出時，fastText采用了分層Softmax，大大降低了模型訓練時間。

2.3 字符級別的n-gram

word2vec把語料庫中的每個單詞當成原子的，它會為每個單詞生成一個向量。這忽略了單詞內部的形態特征，比如："apple" 和"apples"，"達觀數據"和"達觀"，這兩個例子中，兩個單詞都有較多公共字符，即它們的內部形態類似，但是在傳統的word2vec中，這種單詞內部形態信息因為它們被轉換成不同的id丟失了。

為了克服這個問題，fastText使用了字符級別的n-grams來表示一個單詞。對於單詞"apple"，假設n的取值為3，則它的trigram有:

其中，<表示前綴，>表示后綴。於是，我們可以用這些trigram來表示"apple"這個單詞，進一步，我們可以用這5個trigram的向量疊加來表示"apple"的詞向量。

這帶來兩點好處：

對於低頻詞生成的詞向量效果會更好。因為它們的n-gram可以和其它詞共享。
對於訓練詞庫之外的單詞，仍然可以構建它們的詞向量。我們可以疊加它們的字符級n-gram向量。

2.4 分層softmax

fastText的結構：

文本分詞后排成列做輸入。
lookup table變成想要的隱層維數。
隱層后接huffman Tree。這個tree就是分層softmax減少計算量的精髓。

3. 簡單實現fastText

為了簡化任務：

訓練詞向量時，我們使用正常的word2vec方法，而真實的fastText還附加了字符級別的n-gram作為特征輸入；
我們的輸出層使用簡單的softmax分類，而真實的fastText使用的是Hierarchical Softmax。

首先定義幾個常量：

VOCAB_SIZE = 2000
EMBEDDING_DIM =100
MAX_WORDS = 500
CLASS_NUM = 5
VOCAB_SIZE表示詞匯表大小，這里簡單設置為2000；

EMBEDDING_DIM表示經過embedding層輸出，每個詞被分布式表示的向量的維度，這里設置為100。比如對於“達觀”這個詞，會被一個長度為100的類似於[ 0.97860014, 5.93589592, 0.22342691, -3.83102846, -0.23053935, …]的實值向量來表示；

MAX_WORDS表示一篇文檔最多使用的詞個數，因為文檔可能長短不一（即詞數不同），為了能feed到一個固定維度的神經網絡，我們需要設置一個最大詞數，對於詞數少於這個閾值的文檔，我們需要用“未知詞”去填充。比如可以設置詞匯表中索引為0的詞為“未知詞”，用0去填充少於閾值的部分；

CLASS_NUM表示類別數，多分類問題，這里簡單設置為5。

模型搭建遵循以下步驟：

添加輸入層（embedding層）。Embedding層的輸入是一批文檔，每個文檔由一個詞匯索引序列構成。例如：[10, 30, 80, 1000] 可能表示“我昨天來到達觀數據”這個短文本，其中“我”、“昨天”、“來到”、“達觀數據”在詞匯表中的索引分別是10、30、80、1000；Embedding層將每個單詞映射成EMBEDDING_DIM維的向量。於是：input_shape=(BATCH_SIZE, MAX_WORDS), output_shape=(BATCH_SIZE,MAX_WORDS, EMBEDDING_DIM)；
添加隱含層（投影層）。投影層對一個文檔中所有單詞的向量進行疊加平均。keras提供的GlobalAveragePooling1D類可以幫我們實現這個功能。這層的input_shape是Embedding層的output_shape，這層的output_shape=( BATCH_SIZE, EMBEDDING_DIM)；
添加輸出層（softmax層）。真實的fastText這層是Hierarchical Softmax，因為keras原生並沒有支持Hierarchical Softmax，所以這里用Softmax代替。這層指定了CLASS_NUM，對於一篇文檔，輸出層會產生CLASS_NUM個概率值，分別表示此文檔屬於當前類的可能性。這層的output_shape=(BATCH_SIZE, CLASS_NUM)
指定損失函數、優化器類型、評價指標，編譯模型。損失函數我們設置為categorical_crossentropy，它就是我們上面所說的softmax回歸的損失函數；優化器我們設置為SGD，表示隨機梯度下降優化器；評價指標選擇accuracy，表示精度。

用訓練數據feed模型時，你需要：

將文檔分好詞，構建詞匯表。詞匯表中每個詞用一個整數（索引）來代替，並預留“未知詞”索引，假設為0；
對類標進行onehot化。假設我們文本數據總共有3個類別，對應的類標分別是1、2、3，那么這三個類標對應的onehot向量分別是[1, 0,0]、[0, 1, 0]、[0, 0, 1]；
對一批文本，將每個文本轉化為詞索引序列，每個類標轉化為onehot向量。就像之前的例子，“我昨天來到達觀數據”可能被轉化為[10, 30, 80, 1000]；它屬於類別1，它的類標就是[1, 0, 0]。由於我們設置了MAX_WORDS=500，這個短文本向量后面就需要補496個0，即[10, 30, 80, 1000, 0, 0, 0, …, 0]。因此，batch_xs的維度為( BATCH_SIZE,MAX_WORDS)，batch_ys的維度為（BATCH_SIZE, CLASS_NUM）。

代碼如下：

# coding: utf-8
from __future__ import unicode_literals

from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import GlobalAveragePooling1D
from keras.layers import Dense

VOCAB_SIZE = 2000
EMBEDDING_DIM = 100
MAX_WORDS = 500
CLASS_NUM = 5


def build_fastText():
    model = Sequential()
    # 將詞匯數VOCAB_SIZE映射為EMBEDDING_DIM維
    model.add(Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_WORDS))
    # 平均文檔中所有詞的embedding
    model.add(GlobalAveragePooling1D())
    # softmax分類
    model.add(Dense(CLASS_NUM, activation='softmax'))
    # 定義損失函數、優化器、分類度量指標
    model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])
    return model

if __name__ == '__main__':
    model = build_fastText()
    print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 500, 100)          200000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 5)                 505       
=================================================================
Total params: 200,505
Trainable params: 200,505
Non-trainable params: 0
_________________________________________________________________
None

4. 使用fastText文本分類

4.1 加載庫

import time
import numpy as np
import fasttext
import pandas as pd

from sklearn.metrics import f1_score
from sklearn.utils import shuffle
from sklearn.model_selection import StratifiedKFold

4.2 fastText分類

主要超參數：

lr: 學習率
dim: 詞向量的維度
epoch: 每輪的個數
wordNgrams: 詞的n-gram，一般設置為2或3
loss: 損失函數 ns(negative sampling, 負采樣)、hs(hierarchical softmax, 分層softmax)、softmax、ova(One-VS-ALL)

def fasttext_model(nrows, train_num, lr=1.0, wordNgrams=2, minCount=1, epoch=25, loss='hs', dim=100):
    start_time = time.time()

    # 轉換為FastText需要的格式
    train_df = pd.read_csv('/content/drive/My Drive/nlpdata/news/train_set.csv', sep='\t', nrows=nrows)

    # shuffle
    train_df = shuffle(train_df, random_state=666)

    train_df['label_ft'] = '__label__' + train_df['label'].astype('str')
    train_df[['text', 'label_ft']].iloc[:train_num].to_csv('/content/drive/My Drive/nlpdata/news/fastText_train.csv', index=None, header=None, sep='\t')

    model = fasttext.train_supervised('/content/drive/My Drive/nlpdata/news/fastText_train.csv', lr=lr, wordNgrams=wordNgrams, verbose=2, 
                                      minCount=minCount, epoch=epoch, loss=loss, dim=dim)

    train_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[:train_num]['text']]
    print('Train f1_score:', f1_score(train_df['label'].values[:train_num].astype(str), train_pred, average='macro'))
    val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[train_num:]['text']]
    print('Val f1_score:', f1_score(train_df['label'].values[train_num:].astype(str), val_pred, average='macro'))
    train_time = time.time()
    print('Train time: {:.2f}s'.format(train_time - start_time))

     # 預測並保存
    test_df = pd.read_csv('/content/drive/My Drive/nlpdata/news/test_a.csv')

    test_pred = [model.predict(x)[0][0].split('__')[-1] for x in test_df['text']]
    test_pred = pd.DataFrame(test_pred, columns=['label'])
    test_pred.to_csv('/content/drive/My Drive/nlpdata/news/test_fastText_ridgeclassifier.csv', index=False)
    print('Test predict saved.')
    end_time = time.time()
    print('Predict time:{:.2f}s'.format(end_time - train_time))


if __name__ == '__main__':  
    nrows = 200000
    train_num = int(nrows * 0.7)
    lr=0.01
    wordNgrams=2
    minCount=1
    epoch=25
    loss='hs'

    fasttext_model(nrows, train_num)

Train f1_score: 0.998663548149514
Val f1_score: 0.911468448971427
Train time: 257.32s
Test predict saved.
Predict time:13.40s

4.3 K折交叉驗證

在使用FastText中，有一些模型的參數需要選擇，這些參數會在一定程度上影響模型的精度，那么如何選擇這些參數呢？有兩種方式：

通過閱讀文檔，要弄清楚這些參數的含義，哪些參數會增加模型的復雜度；
通過在驗證集上進行驗證模型精度，找到模型是否過擬合或欠擬合。

這里我們采用第二種方法，用K折交叉驗證的思想進行參數調節。注意：每折的划分必須保證標簽的分布與整個數據集的分布一致。

models = []
scores = []
pred_list = []

# K折交叉驗證
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=666)
for train_index, test_index in skf.split(train_df['text'], train_df['label_ft']):

    train_df[['text', 'label_ft']].iloc[train_index].to_csv('/content/drive/My Drive/nlpdata/news/fastText_train.csv', index=None, header=None, sep='\t')

    model = fasttext.train_supervised('/content/drive/My Drive/nlpdata/news/fastText_train.csv', lr=lr, wordNgrams=wordNgrams, verbose=2, 
                                          minCount=minCount, epoch=epoch, loss=loss)
    models.append(model)

    val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[test_index]['text']]
    score = f1_score(train_df['label'].values[test_index].astype(str), val_pred, average='macro')
    print('score', score)
    scores.append(score)

print('mean score: ', np.mean(scores))
train_time = time.time()
print('Train time: {:.2f}s'.format(train_time - start_time))

所有代碼

def fasttext_kfold_model(nrows, train_num, n_splits, lr=1.0, wordNgrams=2, minCount=1, epoch=25, loss='hs', dim=100):
    start_time = time.time()

    # 轉換為FastText需要的格式
    train_df = pd.read_csv('/content/drive/My Drive/nlpdata/news/train_set.csv', sep='\t', nrows=nrows)

    # shuffle
    train_df = shuffle(train_df, random_state=666)

    train_df['label_ft'] = '__label__' + train_df['label'].astype('str')

    models = []
    train_scores = []
    val_scores = []

    # K折交叉驗證
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=666)
    for train_index, test_index in skf.split(train_df['text'], train_df['label_ft']):
        train_df[['text', 'label_ft']].iloc[train_index].to_csv('/content/drive/My Drive/nlpdata/news/fastText_train.csv', index=None, header=None, sep='\t')

        model = fasttext.train_supervised('/content/drive/My Drive/nlpdata/news/fastText_train.csv', lr=lr, wordNgrams=wordNgrams, verbose=2, 
                                          minCount=minCount, epoch=epoch, loss=loss)
        models.append(model)

        train_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[train_index]['text']]
        train_score = f1_score(train_df['label'].values[train_index].astype(str), train_pred, average='macro')
        # print('Train length: ', len(train_pred))
        print('Train score: ', train_score)
        train_scores.append(train_score)

        val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[test_index]['text']]
        val_score = f1_score(train_df['label'].values[test_index].astype(str), val_pred, average='macro')
        # print('Val length: ', len(val_pred))
        print('Val score', val_score)
        val_scores.append(val_score)

    print('mean train score: ', np.mean(train_scores))
    print('mean val score: ', np.mean(val_scores))
    train_time = time.time()
    print('Train time: {:.2f}s'.format(train_time - start_time))

    return models

def fasttext_kfold_predict(models, n_splits):

    pred_list = []

    start_time = time.time()
    # 預測並保存
    test_df = pd.read_csv('/content/drive/My Drive/nlpdata/news/test_a.csv')

    # 消耗時間較長
    for model in models:
        test_pred = [model.predict(x)[0][0].split('__')[-1] for x in test_df['text']]
        pred_list.append(test_pred)

    test_pred_label = pd.DataFrame(pred_list).T.apply(lambda row: np.argmax(np.bincount([row[i] for i in range(n_splits)])), axis=1)
    test_pred_label.columns='label'

    test_pred_label.to_csv('/content/drive/My Drive/nlpdata/news/test_fastText_ridgeclassifier.csv', index=False)
    print('Test predict saved.')
    end_time = time.time()
    print('Predict time:{:.2f}s'.format(end_time - start_time))


if __name__ == '__main__':
  nrows = 200000
  train_num = int(nrows * 0.7)
  n_splits = 3
  lr=0.1
  wordNgrams=2
  minCount=1
  epoch=25
  loss='hs'
  dim=200

    """
    Train score:  0.9635013320936988
    Val score 0.9086640111428032
    Train score:  0.9623510782430645
    Val score 0.9094998879044359
    Train score:  0.9628121318772955
    Val score 0.9096191534698315
    mean train score:  0.9628881807380196
    mean val score:  0.9092610175056901
    Train time: 740.60s
    """   

    models = fasttext_kfold_model(nrows, train_num, n_splits, lr=lr, wordNgrams=wordNgrams, minCount=minCount, epoch=epoch, loss=loss, dim=dim)
    fasttext_kfold_predict(models, n_splits=n_splits)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 fastText文本分類算法 fastText-文本分類基於keras的fasttext短文本分類帶監督的文本分類算法FastText FastText 文本分類使用心得文本分類模型文本分類和詞向量訓練工具fastText的參數和用法文本分類需要CNN？No！fastText完美解決你的需求（后篇）【NLP-06】fastText文本分類算法文本分類（TextRNN/TextCNN/TextRCNN/FastText/HAN）