【sphinx】sphinx4學習筆記

本文轉載自查看原文 2015-12-22 18:58 2278 sphinx

sphinx-core工程是個java工程，內帶多個例子，其中涵多個功能，例如錄音，對齊等。（還未挨個實驗）
spinx-test中兩個demo，一個是helloworld，另一個是hellongram，就是語音識別。可以用到的參數文件有hellongram0.xml，hellongram9.xml,hellongram1.xml.其中helloworld.xml中沒有用到語言模型，而是用JSFG來定義句子的語言規則，貌似是用正則表達式規定了待識別的句子只有如下可能：（hello）（jim|kate|tom）等。
hellongram0.xml是用於ngram語言模型的例子，xml文件中定義了聲學模型，語言模型，詞典的存放路徑。
recognize開始，先load所有可用的模型，到如下階段，load模型定義文件mdef，然后分別根據均值，方差，轉換矩陣分配池子和大小。然后對每一個聲元（senone）建立一個池子（distFloor：最低分數值，看起來是識別最低閾值，varianceFloor：最低方差值）

variancePool = loadDensityFile(dataLocation + "variances",
                varianceFloor);
        mixtureWeightsPool = loadMixtureWeights(dataLocation
                + "mixture_weights", mixtureWeightFloor);
        transitionsPool = loadTransitionMatrices(dataLocation
                + "transition_matrices");
        transformMatrix = loadTransformMatrix(dataLocation
                + "feature_transform");

        senonePool = createSenonePool(distFloor, varianceFloor);

目前問題：試用demo自帶的wsj模型目錄時候，加載成功可以運行。而讀取我訓練的模型時候，加載錯誤。debug問題發現兩個模型之間有如下不同

================
wsg-0.xml模型格式：
-------------------------------
senone：4147
numGausePerSenone:8
means:33176=4147*8
variances:33176
streams:1

=================


male_result(my)模型格式：
-----------------------------------
senone：186
GaussianPerSenone:256
means:1024
varians:1024
streams:4

分析原因，是否針對sphinx4加載的模型，有些參數是固定的，比如streams的個數，以及Gauss個數

【解決】

修改sphinx-config訓練參數文件中，將semi改為cont，應該注意到其后的備注，使用pocketsphinx時候，是用semi格式，用sphinx3時候有cont格式，則對應的stream是1，gause數目是8.以此，得到cont模型，加載到sphinx4環境中，編譯，ok！運行順利，使用自己的模型，然后用自己的聲音測試，結果如下：

Start speaking. Press Ctrl-C to quit.

resultList.size=1
bestFinalToken=0050 -6.8291255E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[長城][</s>]}
50 </s> -10008.177 0.0
50 長城 68886.47 0.0
4 <sil> 0.0 0.0
0 <s> 0.0 0.0
0result=<s> <sil> 長城 </s>
resultList.size=1
bestFinalToken=0050 -6.8291255E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[長城][</s>]}
50 </s> -10008.177 0.0
50 長城 68886.47 0.0
4 <sil> 0.0 0.0
0 <s> 0.0 0.0
resultList.size=2
bestFinalToken=0077 -7.2286605E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[長城][</s>]}
77 </s> -10008.177 0.0
59 長城 68886.47 0.0
4 <sil> 0.0 0.0
0 <s> 0.0 0.0
1result=<s> <sil> 長城 </s>
resultList.size=2
bestFinalToken=0077 -7.2286605E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[長城][</s>]}
best token=0077 -7.2286605E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[長城][</s>]}
77 </s> -10008.177 0.0
59 長城 68886.47 0.0
4 <sil> 0.0 0.0
0 <s> 0.0 0.0
You said: [長城]

Start speaking. Press Ctrl-C to quit.

sphinx4代碼解析

@override，表示下面的方法是重寫父類的方法
@component 注解，是spring特有的，可以避免xml配置文件置入代碼。加這個標簽，可以由spring初始化時候自動掃描這些下面的類嵌入xml配置中，減少配置文件代碼。提高效率。
java處理語音的包 javax.sound.sampled介紹

javax.sound.sampled 提供用於捕獲、處理和回放取樣的音頻數據的接口和類。 


變量：
protected  AudioFormat AudioInputStream.format  流中包含的音頻數據的格式。 方法： AudioFormat AudioFileFormat.getFormat() 獲得音頻文件中包含的音頻數據的格式。 AudioFormat AudioInputStream.getFormat() 獲得此音頻輸入流中聲音數據的音頻格式。 AudioFormat DataLine.getFormat() 獲得數據行的音頻數據的當前格式（編碼、樣本頻率、信道數，等等）。 AudioFormat[] DataLine.Info.getFormats() 獲得數據行支持的音頻格式的集合。 static AudioFormat[] AudioSystem.getTargetFormats(AudioFormat.Encoding targetEncoding, AudioFormat sourceFormat) 
           使用已安裝的格式轉換器，獲得具有特定編碼的格式，以及系統可以從指定格式的流中獲得的格式。

參考鏈接：

類 javax.sound.sampled.AudioFormat
的使用 http://www.766.com/doc/javax/sound/sampled/class-use/AudioFormat.html

利用純java捕獲和播放音頻 http://www.cnblogs.com/haore147/p/3662536.html

javax sound介紹 http://blog.csdn.net/kangojian/article/details/4449956

sphinx4-demo解析

helloword
hellongram
transcriber
sphinx4模型訓練解析

一模型訓練需求點

1 Must train any kind (refers to tying) of model ---必須要能訓練任何類型的模型
2 Variable number of states per model---每個模型要有多種狀態數目
3 Mixtures with different numbers of Gaussians--多個數目的混合高斯
4 Different kinds of densities at different states--不同的狀態要有不同的密度
5 Any kind of context  ---- 任何上下文都能支持
6 Simultaneous training of multiple feature streams or models能同時訓練多種特征類型的語音或者多種類型的模型

二語言模型訓練的關注點

【語言模型步驟】
Create a generic NGram language model interface--建立一個基本的nram模型界面
Create ARPA LM loader that can load language models in the ARPA format. This loader will not be optimized for efficiency, but for simplicity.--
Create a language model for the alphabet---建立一個導入arpa格式模型的模型導入接口，能保證方便導入
Incorporate use of the language model into the Linguist/Search---將語言模型和語言學部分，搜索部分結合
Test the natural spell with the LM to see what kind of improvement we get.---測試自然發音看看是否有提高
Create a more efficient version of the LM loader that can load very large language models--創建一個能加載更大模型的接口

三聲學模型格式

三解碼部分

（1）搜索方法：breadth first(橫向優先搜索）， viterbi decoding（維特比搜索）,depth-first（深度優先）,a* decoding

四 sphinx4使用向導

sphinx4-5prealpha目錄中，自帶pom.xml文件，代表這個工程是一個maven工程。ecplise中加載：exist maven program。然后編譯
sphinx4自帶三種識別方式：
- LiveSpeechRecognizer 在線識別，工程中麥克分中獲取語音，識別結果
- StreamSpeechRecognizer 流文件識別，將識別語音從入口參數帶入，識別出來結果
- SpeechAligner 語音對齊，將語音和給出的文字做對應
- 以上三種，無論哪種模式，相同的需求是：聲學模型，語言模型或者語法，詞典，以及入口語音文件

sphinx4自帶一個基礎工程接口，是transcriber，可以將語音中的內容識別成文字。稱為轉錄。

public class TranscriberDemo {       
                                     
    public static void main(String[] args) throws Exception {
                                     
        Configuration configuration = new Configuration();

        configuration 設定聲學模型，語言模型，詞典的路徑
                .setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
        configuration
                .setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
        configuration
                .setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");

        StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(
                configuration);
        InputStream stream = new FileInputStream(new File("test.wav"))) 識別參數中文件的內容

        recognizer.startRecognition(stream); 開始識別
        SpeechResult result;
        while ((result = recognizer.getResult()) != null) {得到結果 getResult
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
        }
        recognizer.stopRecognition();識別停止
    }
}

在線識別，是用在線識別類生成一個對象，然后獲取麥克風中的語音

LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);

3. 流文件識別，是從流文件中讀取對象做識別

StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);

注意讀入的語音流的格式只能是

RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz
或者
 RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz
如果是后者，還需要特意設置采樣率
   configuration.setSampleRate(8000);

4.sphinx4-api帶四個demo，分別如下

Transcriber - demonstrates how to transcribe a file----轉錄語音，也就是語音識別為文字
Dialog - demonstrates how to lead dialog with a user對話系統
SpeakerID - speaker identification 說話人識別（who is he?
Aligner - demonstration of audio to transcription timestamping 聲音與文本對齊

【參考】

一些影響性能的因素總結 http://www.codes51.com/article/detail_159118.html

cmu sphinx中一些問題的搜索討論 http://sourceforge.net/p/cmusphinx/discussion/sphinx4/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Sphinx學習之sphinx的安裝篇 Sphinx語音識別學習記錄（一）-基本運行測試 sphinx的使用 Sphinx語音識別學習記錄（六）-我的目標和幾個想像的方案（閑置中）學習筆記1 Sphinx 實時索引 sphinx是支持結果聚類的 sphinx 制作python 文檔 PHP讀取sphinx實例 sphinx,coreseek安裝