fastText的參數和用法
fastText由Facebook開源,主要基於fasttext這篇文章的思路paper,主要用於兩個任務:訓練詞向量和文本分類。
下載地址與document :fasttext官網
fasttext的 主要功能:
Training Supervised Classifier [supervised] Supervised Classifier Training for Text Classification. 訓練分類器,就是文本分類,fasttext 的主營業務。
Training SkipGram Model [skipgram] Learning Word Representations/Word Vectors using skipgram technique. 訓練skipgram的方式的詞向量。
Quantization [quantize] Quantization is a process applied on a model so as to reduce the memory usage during prediction. 量化壓縮,降低模型體積。
Predictions [predict] Predicting labels for a given text : Text Classification. 對於文本分類任務,用於預測類別。
Predictions with Probabilities [predict-prob] Predicting probabilities in addition to labels for a given text : Text Classification. 帶有概率的預測類別。
Training of CBOW model [cbow] Learning Word Representations/Word Vectors using CBOW (Continuous Bag Of Words) technique. cbow方式訓練詞向量。
Print Word Vectors [print-word-vectors] Printing of Word Vectors for a trained model with each line representing a word vector. 打印一個單詞的詞向量。
Print Sentence Vectors [print-sentence-vectors] Printing of Sentence Vectors for a trained model with each line representing a vector for a paragraph. 打印文本向量,每個文本的向量長度是一樣的,代表所有單詞的綜合特征。
Query Nearest Neighbors [nn] 找到某個單詞的近鄰。
Query for Analogies [analogies] 找到某個單詞的類比詞,比如 A - B + C。柏林 - 德國 + 法國 = 巴黎 這類的東西。
命令行的fasttext使用:
1 基於自己的語料訓練word2vec
fasttext skipgram -input xxxcorpus -output xxxmodel
訓練得到兩個文件:xxxmodel.bin 和 xxxmodel.vec,分別是模型文件和詞向量形式的模型文件
參數可選 skipgram 或者 cbow,分別對應SG和CBOW模型。
2 根據訓練好的model查看某個詞的neighbor
fasttext nn xxxmodel.bin
Query word? 后輸入單詞,即可獲得其近鄰單詞。
3 其它的一些參數:
-minn 和 -maxn :subwords的長度范圍,default是3和6
-epoch 和 -lr :輪數和學習率,default是5和0.05
-dim:詞向量的維度,越大越🐮🍺,但是會占據更多內存,並降低計算速度。
-thread:運行的線程數,不解釋。
python 模塊的應用方式:
參數含義與功能基本相同,用法如下。
給一個栗子:
def train_word_vector(train_fname, test_fname, epoch, lr, save_model_fname, thr):
""" train text classification, and save model """
dim = 500 # size of word vectors [100]
ws = 5 # size of the context window [5]
minCount = 500 # minimal number of word occurences [1]
minCountLabel = 1 # minimal number of label occurences [1]
minn = 1 # min length of char ngram [0]
maxn = 2 # max length of char ngram [0]
neg = 5 # number of negatives sampled [5]
wordNgrams = 2 # max length of word ngram [1]
loss = 'softmax' # loss function {ns, hs, softmax, ova} [softmax]
lrUpdateRate = 100 # change the rate of updates for the learning rate [100]
t = 0.0001 # sampling threshold [0.0001]
label = '__label__' # label prefix ['__label__']
model = fasttext.train_supervised(train_fname, lr=lr, epoch=epoch, dim=dim, ws=ws,
minCount=minCount, minCountLabel=minCountLabel,
minn=minn, maxn=maxn, neg=neg,
wordNgrams=wordNgrams, loss=loss,
lrUpdateRate=lrUpdateRate,
t=t, label=label, verbose=True)
model.save_model(save_model_fname)
return model
if __name__ == "__main__":
""" param settings """
model = train_word_vector(train_fname, test_fname,
epoch, lr, save_model_fname, thr)
model.get_nearest_neighbors(some_word)
model.predict('sentence') # 得到輸出類別
model.test(filename) # 輸出三元組,(樣本數量, acc, acc) 這里的acc是對二分類來說的
無監督學習詞向量和有監督訓練文本分類的 API如下:
train_unsupervised parameters
input # training file path (required)
model # unsupervised fasttext model {cbow, skipgram} [skipgram]
lr # learning rate [0.05]
dim # size of word vectors [100]
ws # size of the context window [5]
epoch # number of epochs [5]
minCount # minimal number of word occurences [5]
minn # min length of char ngram [3]
maxn # max length of char ngram [6]
neg # number of negatives sampled [5]
wordNgrams # max length of word ngram [1]
loss # loss function {ns, hs, softmax, ova} [ns]
bucket # number of buckets [2000000]
thread # number of threads [number of cpus]
lrUpdateRate # change the rate of updates for the learning rate [100]
t # sampling threshold [0.0001]
verbose # verbose [2]
train_supervised parameters
input # training file path (required)
lr # learning rate [0.1]
dim # size of word vectors [100]
ws # size of the context window [5]
epoch # number of epochs [5]
minCount # minimal number of word occurences [1]
minCountLabel # minimal number of label occurences [1]
minn # min length of char ngram [0]
maxn # max length of char ngram [0]
neg # number of negatives sampled [5]
wordNgrams # max length of word ngram [1]
loss # loss function {ns, hs, softmax, ova} [softmax]
bucket # number of buckets [2000000]
thread # number of threads [number of cpus]
lrUpdateRate # change the rate of updates for the learning rate [100]
t # sampling threshold [0.0001]
label # label prefix [’_label_’]
verbose # verbose [2]
pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []