fastText 訓練和使用

本文轉載自查看原文 2021-02-22 12:56 613

fastText是一種Facebook AI Research在16年開源的一個文本分類器。其特點就是fast。相對於其它文本分類模型，如SVM，Logistic Regression和neural network等模型，fastText在保持分類效果的同時，大大縮短了訓練時間。
fastText專注於文本分類，在許多標准問題上的分類效果非常好。

訓練fastText

 
                 trainDataFile =  
                 'train.txt' 
                
                 classifier = fasttext.train_supervised( 
                
                 input = trainDataFile, 
                
                 label_prefix =  
                 '__label__' 
                 , 
                
                 dim = 256, 
                
                 epoch = 50, 
                
                 lr = 1, 
                
                 lr_update_rate = 50, 
                
                 min_count = 3, 
                
                 loss =  
                 'softmax' 
                 , 
                
                 word_ngrams = 2, 
                
                 bucket = 1000000) 
                
                 classifier.save_model( 
                 "Model.bin" 
                 )

在訓練fastText的時候有兩點需要特別注意，一個是word_ngrams，一個是loss，這兩個是fastText的精髓所在，之后會提到。

在使用fastText進行文本訓練的時候需要提前分詞，這里的ngrams是根據分詞的結果來組織架構的；

事實上在訓練文本分類的時候有個副產物就是word2vec，fastText在實現文本分類的時候其實和cbow非常類似，就是把word2vec求和之后過了一個fc進行的分類。

使用fastText進行預測

使用fastText進行預測是非常簡單的，可以直接使用下述的代碼進行預測。

 
                 testDataFile =  
                 'test.txt' 
                
                 classifier = fasttext.load_model( 
                 'Model.bin' 
                 )  
                
                 result = classifier.test(testDataFile) 
                
                 print  
                 '測試集上數據量' 
                 , result[0] 
                
                 print  
                 '測試集上准確率' 
                 , result[1] 
                
                 print  
                 '測試集上召回率' 
                 , result[2]

Bag of tricks for efficient text classification

1）分層softmax：對於類別過多的類目，fastText並不是使用的原生的softmax過交叉熵，而是使用的分層softmax，這樣會大大提高模型的訓練和預測的速度。

2）n-grams：fastText使用了字符級別的n-grams來表示一個單詞。對於單詞“apple”，假設n的取值為3，則它的trigram有

“<ap”, “app”, “ppl”, “ple”, “le>”

其中，<表示前綴，>表示后綴。於是，我們可以用這些trigram來表示“apple”這個單詞，進一步，我們可以用這5個trigram的向量疊加來表示“apple”的詞向量。

這帶來兩點好處：

1. 對於低頻詞生成的詞向量效果會更好。因為它們的n-gram可以和其它詞共享。

2. 對於訓練詞庫之外的單詞，仍然可以構建它們的詞向量。我們可以疊加它們的字符級n-gram向量。

fastText 運行速度快的原因

1）多線程訓練：fastText在訓練的時候是采用的多線程進行訓練的。每個訓練線程在更新參數時並沒有加鎖，這會給參數更新帶來一些噪音，但是不會影響最終的結果。無論是 google 的 word2vec 實現，還是 fastText 庫，都沒有加鎖。線程的默認是12個，可以手動的進行設置。

2）分層softmax：fastText在計算softmax的時候采用分層softmax，這樣可以大大提高運行的效率。

fastText 所有可選參數

 
                 The following arguments are mandatory: 
                
                 -input              training file path 
                
                 -output             output file path 
                
                 The following arguments are optional: 
                
                 -verbose            verbosity level [2] 
                
                 The following arguments  
                 for  
                 the dictionary are optional: 
                
                 -minCount           minimal number of word occurrences [1] 
                
                 -minCountLabel      minimal number of label occurrences [0] 
                
                 -wordNgrams         max length of word ngram [1] 
                
                 -bucket             number of buckets [2000000] 
                
                 -minn               min length of  
                 char  
                 ngram [0] 
                
                 -maxn               max length of  
                 char  
                 ngram [0] 
                
                 -t                  sampling threshold [0.0001] 
                
                 -label              labels prefix [__label__] 
                
                 The following arguments  
                 for  
                 training are optional: 
                
                 -lr                 learning rate [0.1] 
                
                 -lrUpdateRate       change the rate of updates  
                 for  
                 the learning rate [100] 
                
                 -dim                size of word vectors [100] 
                
                 -ws                 size of the context window [5] 
                
                 -epoch              number of epochs [5] 
                
                 -neg                number of negatives sampled [5] 
                
                 -loss               loss function {ns, hs, softmax} [softmax] 
                
                 -thread             number of threads [12] 
                
                 -pretrainedVectors  pretrained word vectors  
                 for  
                 supervised learning [] 
                
                 -saveOutput         whether output  
                 params  
                 should be saved [0] 
                
                 The following arguments  
                 for  
                 quantization are optional: 
                
                 -cutoff             number of words and ngrams to retain [0] 
                
                 -retrain            finetune embeddings  
                 if  
                 a cutoff  
                 is  
                 applied [0] 
                
                 -qnorm              quantizing the norm separately [0] 
                
                 -qout               quantizing the classifier [0] 
                
                 -dsub               size of each sub-vector [2]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 英文詞向量：使用fastText預訓練的詞向量 fasttext(1) -- 認識 fasttext 和初步使用 windows+python3.6下安裝fasttext+fasttext在win上的使用+gensim（fasttext） 3種常用的詞向量訓練方法的代碼，Word2Vec, FastText, GloVe快速訓練 FastText 文本分類使用心得詞表征 3：GloVe、fastText、評價詞向量、重新訓練詞向量 Fasttext原理 Tensorflow使用GPU訓練 FastText原理介紹 fasttext源碼剖析