[sphinx]中文語言模型訓練

本文轉載自查看原文 2015-09-15 17:21 1841 sphinx/ 語言模型/ 語音識別

一，不用分詞的短詞組語言模型訓練

參考資源：http://cmusphinx.sourceforge.net/wiki/tutoriallm sphinx官方教程

1）文本准備

生成文本文件，內含一行一個的單詞。頭尾有<s> </s>標記，如下所示，其中單詞前后都有空格。文件為utf-8格式，文件名為test.txt。

<s> 蘇菲 </s>
<s> 百事 </s>
<s> 雀巢 </s>
<s> 寶潔 </s>
<s> 殼牌 </s>
<s> 統一 </s>
<s> 高通 </s>
<s> 科勒 </s>

2）上傳此文件到服務器上，生成詞頻分析文件

text2wfreq < test.txt | wfreq2vocab > test.vocab

中間過程如下：

text2wfreq : Reading text from standard input...
wfreq2vocab : Will generate a vocabulary containing the most
              frequent 20000 words. Reading wfreq stream from stdin...
text2wfreq : Done.
wfreq2vocab : Done.

結果文件為test.vocab,其中格式為：

## Vocab generated by v2 of the CMU-Cambridge Statistcal
## Language Modeling toolkit.
##
## Includes 178 words ##
</s>
<s>
一號店
上好佳
上海灘
絲塔芙
絲芙蘭

3）生成arpa文件

text2idngram -vocab test.vocab -idngram test.idngram < test.txt
idngram2lm -vocab_type 0 -idngram test.idngram -vocab test.vocab -arpa test.lm

第一條命令中間過程為

text2idngram
Vocab                  : test.vocab
Output idngram         : test.idngram
N-gram buffer size     : 100
Hash table size        : 2000000
Temp directory         : cmuclmtk-Mtadbf
Max open files         : 20
FOF size               : 10
n                      : 3
Initialising hash table...
Reading vocabulary... 
Allocating memory for the n-gram buffer...
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.

Sorting n-grams...
Writing sorted n-grams to temporary file cmuclmtk-Mtadbf/1
Merging 1 temporary files...

2-grams occurring:      N times         > N times       Sug. -spec_num value
      0                                             351             364
      1                             348               3              13
      2                               2               1              11
      3                               0               1              11
      4                               0               1              11
      5                               0               1              11
      6                               0               1              11
      7                               0               1              11
      8                               0               1              11
      9                               0               1              11
     10                               0               1              11

3-grams occurring:      N times         > N times       Sug. -spec_num value
      0                                             525             540
      1                             522               3              13
      2                               3               0              10
      3                               0               0              10
      4                               0               0              10
      5                               0               0              10
      6                               0               0              10
      7                               0               0              10
      8                               0               0              10
      9                               0               0              10
     10                               0               0              10
text2idngram : Done.

結果文件為test.idngram,其中格式為

^@^@^@^A^@^@^@^B^@^@^@^C^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^D^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^E^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^F^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^G^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@^H^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@  ^@^@^@^A^@^@^@^A^@^@^@^B^@^@^@
@
@

第二條命令，中間過程為產生很多warning，但是最后顯示done，這里語言模型應該是有問題了。

Warning : P(2) = 0 (0 / 177)
ncount = 1
Warning : P(2) = 0 (0 / 177)
ncount = 1
Warning : P(2) = 0 (0 / 177)
ncount = 1
Warning : P(2) = 0 (0 / 177)
ncount = 1
。。。。。。。

Writing out language model...
ARPA-style 3-gram will be written to test.lm
idngram2lm : Done.

結果文件為test.lm,打開查看內容

This is a CLOSED-vocabulary model
  (OOVs eliminated from training data and are forbidden in test data)
Good-Turing discounting was applied.
1-gram frequency of frequency : 174
2-gram frequency of frequency : 348 2 0 0 0 0 0
3-gram frequency of frequency : 522 3 0 0 0 0 0
1-gram discounting ratios : 0.99
2-gram discounting ratios : 0.00
3-gram discounting ratios : 0.00
This file is in the ARPA-standard format introduced by Doug Paul.

此處意思是只有1-gram，缺乏2-gram和3-gram，事實上翻看后面這個lm中的內容，列出的2-gram對和3-gram，是以行為分界。

二使用語言模型

使用sphinx官網自帶的中文聲學模型，和中文詞典，以及此處訓練得到的語言模型。識別特定的一些字串。此處有160個單詞，和這160個單詞的發音得到的詞典，以及包含這些詞的一個龐大豐富的聲學模型，所以按照邏輯，識別過程找到對應的每個字后，再依據這個語言模型中不同字的組合形成的詞語，能識別出正確的詞組。

windows上安裝了pocketsphinx，使用如下：

pocketsphinx_continuous.exe -inmic yes -lm test.lm -dict test.dic -hmm zh_broadcastnews_ptm256_8000

此處，-lm引入的模型是直接生成的lm后綴的模型，而武林秘籍中是先把lm模型轉為dmp模型，再在此處使用，不知道問題是否在這里。

三 nextplan

1）使用全部詞串，詞串都經過分詞，訓練語言模型，然后和固有聲學模型一起使用

在線分詞工具，先不論性能好壞，如下可直接用：

php分詞系統演示： http://www.phpbone.com/phpanalysis/demo.php?ac=done

SCWS中文分詞： http://www.xunsearch.com/scws/demo.php

NLPIR 中科院計算機所NLP: http://ictclas.nlpir.org/nlpir/ (只想說這就是我心目中的NLP有趣的方式）

這個結果還需要做處理，當下不太實用。

2）錄制300個句子，訓練聲學模型，和對應的語言模型一起使用。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用SRILM訓練大的語言模型 [轉]語言模型訓練工具SRILM kenlm訓練ngram語言模型語言模型kenlm的訓練及使用【sphinx】中文聲學模型訓練各種預訓練語言模型介紹看MindSpore加持下，如何「煉出」首個千億參數中文預訓練語言模型？ NLP中的預訓練語言模型（二）—— Facebook的SpanBERT和RoBERTa 【知識總結】預訓練語言模型BERT的發展由來預訓練語言模型的前世今生 - 從Word Embedding到BERT