fastText 训练和使用

本文转载自查看原文 2021-02-22 12:56 613

fastText是一种Facebook AI Research在16年开源的一个文本分类器。其特点就是fast。相对于其它文本分类模型，如SVM，Logistic Regression和neural network等模型，fastText在保持分类效果的同时，大大缩短了训练时间。
fastText专注于文本分类，在许多标准问题上的分类效果非常好。

训练fastText

 
         trainDataFile =  
         'train.txt' 
        
         classifier = fasttext.train_supervised( 
        
         input = trainDataFile, 
        
         label_prefix =  
         '__label__' 
         , 
        
         dim = 256, 
        
         epoch = 50, 
        
         lr = 1, 
        
         lr_update_rate = 50, 
        
         min_count = 3, 
        
         loss =  
         'softmax' 
         , 
        
         word_ngrams = 2, 
        
         bucket = 1000000) 
        
         classifier.save_model( 
         "Model.bin" 
         )

在训练fastText的时候有两点需要特别注意，一个是word_ngrams，一个是loss，这两个是fastText的精髓所在，之后会提到。

在使用fastText进行文本训练的时候需要提前分词，这里的ngrams是根据分词的结果来组织架构的；

事实上在训练文本分类的时候有个副产物就是word2vec，fastText在实现文本分类的时候其实和cbow非常类似，就是把word2vec求和之后过了一个fc进行的分类。

使用fastText进行预测

使用fastText进行预测是非常简单的，可以直接使用下述的代码进行预测。

 
         testDataFile =  
         'test.txt' 
        
         classifier = fasttext.load_model( 
         'Model.bin' 
         )  
        
         result = classifier.test(testDataFile) 
        
         print  
         '测试集上数据量' 
         , result[0] 
        
         print  
         '测试集上准确率' 
         , result[1] 
        
         print  
         '测试集上召回率' 
         , result[2]

Bag of tricks for efficient text classification

1）分层softmax：对于类别过多的类目，fastText并不是使用的原生的softmax过交叉熵，而是使用的分层softmax，这样会大大提高模型的训练和预测的速度。

2）n-grams：fastText使用了字符级别的n-grams来表示一个单词。对于单词“apple”，假设n的取值为3，则它的trigram有

“<ap”, “app”, “ppl”, “ple”, “le>”

其中，<表示前缀，>表示后缀。于是，我们可以用这些trigram来表示“apple”这个单词，进一步，我们可以用这5个trigram的向量叠加来表示“apple”的词向量。

这带来两点好处：

1. 对于低频词生成的词向量效果会更好。因为它们的n-gram可以和其它词共享。

2. 对于训练词库之外的单词，仍然可以构建它们的词向量。我们可以叠加它们的字符级n-gram向量。

fastText 运行速度快的原因

1）多线程训练：fastText在训练的时候是采用的多线程进行训练的。每个训练线程在更新参数时并没有加锁，这会给参数更新带来一些噪音，但是不会影响最终的结果。无论是 google 的 word2vec 实现，还是 fastText 库，都没有加锁。线程的默认是12个，可以手动的进行设置。

2）分层softmax：fastText在计算softmax的时候采用分层softmax，这样可以大大提高运行的效率。

fastText 所有可选参数

 
         The following arguments are mandatory: 
        
         -input              training file path 
        
         -output             output file path 
        
         The following arguments are optional: 
        
         -verbose            verbosity level [2] 
        
         The following arguments  
         for  
         the dictionary are optional: 
        
         -minCount           minimal number of word occurrences [1] 
        
         -minCountLabel      minimal number of label occurrences [0] 
        
         -wordNgrams         max length of word ngram [1] 
        
         -bucket             number of buckets [2000000] 
        
         -minn               min length of  
         char  
         ngram [0] 
        
         -maxn               max length of  
         char  
         ngram [0] 
        
         -t                  sampling threshold [0.0001] 
        
         -label              labels prefix [__label__] 
        
         The following arguments  
         for  
         training are optional: 
        
         -lr                 learning rate [0.1] 
        
         -lrUpdateRate       change the rate of updates  
         for  
         the learning rate [100] 
        
         -dim                size of word vectors [100] 
        
         -ws                 size of the context window [5] 
        
         -epoch              number of epochs [5] 
        
         -neg                number of negatives sampled [5] 
        
         -loss               loss function {ns, hs, softmax} [softmax] 
        
         -thread             number of threads [12] 
        
         -pretrainedVectors  pretrained word vectors  
         for  
         supervised learning [] 
        
         -saveOutput         whether output  
         params  
         should be saved [0] 
        
         The following arguments  
         for  
         quantization are optional: 
        
         -cutoff             number of words and ngrams to retain [0] 
        
         -retrain            finetune embeddings  
         if  
         a cutoff  
         is  
         applied [0] 
        
         -qnorm              quantizing the norm separately [0] 
        
         -qout               quantizing the classifier [0] 
        
         -dsub               size of each sub-vector [2]

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 fastText 训练和使用 FastText训练词向量英文词向量：使用fastText预训练的词向量 fasttext(1) -- 认识 fasttext 和初步使用 fasttext使用笔记 fastText训练word2vec并用于训练任务 fastText 文本分类和词向量训练工具fastText的参数和用法 windows+python3.6下安装fasttext+fasttext在win上的使用+gensim（fasttext） fasttext的使用，预料格式，调用方法