Kaldi如何准備自己的數據


Introduction

跑完kaldi的一些腳本例子,你可能想要自己用Kaldi跑自己的數據集。這里將會闡述如何准備好數據。

run.sh較上的部分是有關數據准備的,通常local與數據集相關。

例如:RM數據集

local/rm_data_prep.sh /export/corpora5/LDC/LDC93S3A/rm_comp || exit 1;

utils/prepare_lang.sh data/local/dict '!SIL' data/local/lang data/lang || exit 1;

local/rm_prepare_grammar.sh || exit 1;

 

再例如:再WSJ數據集

wsj0=/export/corpora5/LDC/LDC93S6B
wsj1=/export/corpora5/LDC/LDC94S13B

local/wsj_data_prep.sh $wsj0/??-{?,??}.? $wsj1/??-{?,??}.?  || exit 1;

local/wsj_prepare_dict.sh || exit 1;

utils/prepare_lang.sh data/local/dict "<SPOKEN_NOISE>" data/local/lang_tmp data/lang || exit 1;

local/wsj_format_data.sh || exit 1;

WSJ相對RM多的命令與訓練的語言模型有關,但是最重要的還是兩者共有的命令。

數據准備階段輸出包含兩個部分:一個與data有關( data/train/),一個與language有關(data/lang/)。data與具體的錄音數據有關,lang與語言本身有關
(lexicon, phone等),如果你想要用現有的系統和語言模型解碼准備好的數據,則你需要進一步熟悉它們。

Data preparation-- the "data" part.

可以看下egs/swbd/s5, data/train 目錄:

s5# ls data/train
cmvn.scp  feats.scp  reco2file_and_channel  segments  spk2utt  text  utt2spk  wav.scp
所有文件都很重要,對於簡單的任務,是沒有seg信息的(發音直接對應於一個文件)。
"utt2spk", "text" and "wav.scp" and possibly "segments" and "reco2file_and_channel"這些需要你自己創建,剩下的可以用標准腳本生成。
如果文件命令規范的話,很多可以用腳本自動生成。

Files you need to create yourself

text文件記錄了每個發音id與其對應的文本。

s5# head -3 data/train/text
sw02001-A_000098-001156 HI UM YEAH I'D LIKE TO TALK ABOUT HOW YOU DRESS FOR WORK AND
sw02001-A_001980-002131 UM-HUM
sw02001-A_002736-002893 AND IS

發音id:語音庫名稱,說話人id作為前綴,語音時間戳信息000098-001156
2001-A作為前綴;這樣的好處是有助與與說話人信息相關的排序(utt2spk and spk2utt)。
有時我們需要將說話人id與發音分離開,用-是非常安全的,因為它是最小的ASCII值,如果說話人Id長度變化,某些情況排序過程會終止,有可能會崩潰的(按照C排序)。

另外一個重要的文件是wav.scp。

s5# head -3 data/train/wav.scp
sw02001-A /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 1 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |
sw02001-B /home/dpovey/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav -p -c 2 /export/corpora3/LDC/LDC97S62/swb1/sw02001.sph |

文件格式:
<recording-id> <extended-filename>
extended-filename有可能是真實的wav文件,如果seg文件不存在,則wav.scp每一行的第一個標識是發音id.
wav.scp必須是單聲道的,如果文件是多聲道的,可以用sox提取出指定的聲道。
"segments" 文件:
s5# head -3 data/train/segments
sw02001-A_000098-001156 sw02001-A 0.98 11.56
sw02001-A_001980-002131 sw02001-A 19.8 21.31
sw02001-A_002736-002893 sw02001-A 27.36 28.93
<utterance-id> <recording-id> <segment-begin> <segment-end>
秒為單位,

 "reco2file_and_channel" 僅僅會被用到,當NIST sclite工具評分時。
s5# head -3 data/train/reco2file_and_channel
sw02001-A sw02001 A
sw02001-B sw02001 B
sw02005-A sw02005 A
<recording-id> <filename> <recording-side (A or B)>
如果你沒有stm文件,則你不需要知道其它的,也不需要用到reco2file_and_channel文件
utt2spk"文件
s5# head -3 data/train/utt2spk
sw02001-A_000098-001156 2001-A
sw02001-A_001980-002131 2001-A
sw02001-A_002736-002893 2001-A

The format is

<utterance-id> <speaker-id>

spk2gender文件
s5# head -3 ../../rm/s5/data/train/spk2gender
adg0 f
ahh0 m
ajp0 m
說話人id   性別

export LC_ALL=C 排序方式

Files you don't need to create yourself

"spk2utt" file 
utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
feats.scp file.指出了發音id,其對應的mfcc特征位於ark文件的位置
s5# head -3 data/train/feats.scp
sw02001-A_000098-001156 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:24
sw02001-A_001980-002131 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:54975
sw02001-A_002736-002893 /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/raw_mfcc_train.1.ark:62762
<utterance-id> <extended-filename-of-features>

This feats.scp file是由下面的命令創建的
steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train exp/make_mfcc/train $mfccdir

cmvn.scp 說話人id 特征ark位置
s5# head -3 data/train/cmvn.scp
2001-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:7
2001-B /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:253
2005-A /home/dpovey/kaldi-trunk/egs/swbd/s5/mfcc/cmvn_train.ark:499
cmvn.scp由下面命令創建
steps/compute_cmvn_stats.sh data/train exp/make_mfcc/train $mfccdir

驗證、修復 文件格式
utils/validate_data_dir.sh data/train
utils/fix_data_dir.sh data/train

Data preparation-- the "lang" directory.

lang文件夾下的內容:

s5# ls data/lang
L.fst  L_disambig.fst  oov.int	oov.txt  phones  phones.txt  topo  words.txt

拷貝lang的所有文件,加上G.fst存入lang_test中
s5# ls data/lang_test
G.fst  L.fst  L_disambig.fst  oov.int  oov.txt	phones	phones.txt  topo  words.txt

lang_test/ was created by copying lang/ and adding G.fst.

s5# ls data/lang/phones
context_indep.csl  disambig.txt         nonsilence.txt        roots.txt    silence.txt
context_indep.int  extra_questions.int  optional_silence.csl  sets.int     word_boundary.int
context_indep.txt  extra_questions.txt  optional_silence.int  sets.txt     word_boundary.txt
disambig.csl       nonsilence.csl       optional_silence.txt  silence.csl

phones里面的文件非常多,幸運的時,作為kaldi用戶,你不需要自己創建這些文件,只需要一些簡單的輸入,執行utils/prepare_lang.sh腳本就可完成。
Fortunately you, as a Kaldi user, don't have to create all of these files because we have a script "utils/prepare_lang.sh" that creates it all for you based on simpler inputs



語音識別過程中需要對聲學模型進行構圖,即擴展HCLG的過程,

擴展是按照H<-C<-L<-G的順序進行的,

首先擴展G,

1.G.fst: The Language Model FST

FSA grammar,可以通過n-gram構建得到,即把字構成了詞組

2.L_disambig.fst: The Phonetic Dictionary with Disambiguation Symbols FST

構建一個FST(LG),輸入時phone,輸出是word,即把phone轉化成了字
The file L.fst is the Finite State Transducer form of the lexicon with phone symbols on the input and word symbols on the output.
A lexicon with disambiguation symbols

3.C.fst: The Context FST

把triphone 轉化成monophone,即在第2步驟中擴展了 context,即擴展triphone,最終輸出是CLG

4.H.fst: The HMM FST

把HMM的state映射到triphone ,即把 HMM的pdf-id映射到triphone,也就是擴展了HMM,
即輸入時pdf-id,輸出是word,也就是HCLG

HCLG.fst: final graph

把步驟1-4合起來HCLG,就是構圖中構建WFST的過程。

即,輸入是pdf-id,輸出是對應的詞組

Contents of the "lang" directory

  phones.txt和words.txt都是OpenFst格式的符號文件,每一行內容是文本與對應的數字。

phones.txt and words.txt. These are both symbol-table files, in the OpenFst format
each line is the text form and then the integer form:

s5# head -3 data/lang/phones.txt <eps> 0 SIL 1 SIL_B 2 s5# head -3 data/lang/words.txt <eps> 0 !SIL 1 -'S 2

這兩個文件會被int2sym.pl和sym2int.pl腳本以及fstcompile和fstprint命令使用到。

 They are mostly only accessed by the scripts utils/int2sym.pl and utils/sym2int.pl, and by the OpenFst programs fstcompile and fstprint.

 

L.fst時FST形式的lexicon,詳細可參考下面的論文。

The file L.fst is the Finite State Transducer form of the lexicon。見 (L, see "Speech Recognition with Weighted Finite-State Transducers" by Mohri, Pereira and Riley, in Springer Handbook on SpeechProcessing and Speech Communication, 2008).

L_disambig.fst也是lexicon,不過包含的時歧義的符號#1,#2。

 L_disambig.fst is the lexicon, as above but including the disambiguation symbols #1, #2

 

 

The file data/lang/oov.txt contains just a single line:

oov.txt僅僅包含一行。

s5# cat data/lang/oov.txt
<UNK>
在訓練過程中,這個詞不在發音詞典里。我們稱之為垃圾音素,這個音素與各種不同的spoken noise對齊。

This is the word that we will map all out-of-vocabulary words to during training,we designate as a "garbage phone"; this phone will align with various kinds of spoken noise.<SPN> (short for "spoken noise"):

s5# grep -w UNK data/local/dict/lexicon.txt
<UNK> SPN

oov.int對應oov.txt的整數形式。
The file oov.int contains the integer form of this


topo描述的時HMM的拓撲結構。
The file data/lang/topo ;This specifies the topology of the HMMs 
s5# cat data/lang/topo
<Topology>
<TopologyEntry>
<ForPhones>
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.75 <Transition> 1 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.75 <Transition> 2 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 2 0.75 <Transition> 3 0.25 </State>
<State> 3 </State>
</TopologyEntry>
<TopologyEntry>
<ForPhones>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State>
<State> 5 </State>
</TopologyEntry>
</Topology>
查看頭3行
s5# head -3 data/lang/phones/context_indep.txt    上下文無關的音素 文本表示
SIL
SIL_B
SIL_E
s5# head -3 data/lang/phones/context_indep.int   整數表示
1
2
3
s5# cat data/lang/phones/context_indep.csl 所有音素,整數表示
1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:17:18:19:20


context_indep.txt包含一些音素,我們用這些建立與上下文無關的模型,我們不會構建有關左音素和右音素的決策樹,事實上,我們會構建較小的樹,這些樹只會問與中心音素與HMM狀態相關的問題,
創建這些樹依賴於root.txt

"context_indep.txt" contains a list of those phones for which we build context-independent models

 we do not build a decision tree  that  gets to ask questions about the left and right phonetic context
In fact, we do build smaller trees where we get to ask questions about the central phone and the HMM-state,
this depends on the "roots.txt" file

root.txt包含一些信息,它與我們如何構建與音素相關的決策樹。

The file roots.txt contains information that relates to how we build the phonetic-context decision tree:

shared表示共享中心節點,split表示分裂成以下幾種節點,如圖所示。
head data/lang/phones/roots.txt
shared split SIL SIL_B SIL_E SIL_I SIL_S
shared split SPN SPN_B SPN_E SPN_I SPN_S
shared split NSN NSN_B NSN_E NSN_I NSN_S
shared split LAU LAU_B LAU_E LAU_I LAU_S
...
shared split B_B B_E B_I B_S
In addition, all three states of a HMM (or all five states, for silences) share the root


# cat data/lang/phones/context_indep.txt
SIL
SIL_B
SIL_E
SIL_I
SIL_S
SPN
SPN_B
SPN_E
SPN_I
SPN_S
NSN
NSN_B
NSN_E
NSN_I
NSN_S
LAU
LAU_B
LAU_E
LAU_I
LAU_S

這些音素的變體與詞位置有關,並不是所有的變體會被用到。
There are a lot of variants of these phones because of word-position dependency
 not all of these variants will ever be used
eg: SIL 變體  SIL_B  SIL_I   SIL_E SIL_S 

我們區別開silence.txt和nonsilence.txt,silence.txt的音素都會被用於各種線性變換,如LDA,MLLT,fMLLR,而nonsilence不會的。
What we mean by "nonsilence" phones is, phones that we intend to estimate various kinds of linear transforms on: that is, global transforms such as LDA and MLLT, and speaker adaptation transforms such as fMLLR
does not pay to include silence in the estimation of such transforms

s5# head -3 data/lang/phones/silence.txt
SIL
SIL_B
SIL_E
s5# head -3 data/lang/phones/nonsilence.txt
IY_B
IY_E
IY_I
s5# head -3 data/lang/phones/disambig.txt
#0
#1
#2

optional_silence.txt只包含一個音素,這個音素有選擇的位於兩個詞之間。
The file optional_silence.txt contains a single phone which can optionally appear between word
s5# cat data/lang/phones/optional_silence.txt
SIL
set.txt 的每一行,組合排列着某個音素的所有變體。
 sets.txt groups together all the word-position-dependent versions of each phone
s5# head -3 data/lang/phones/sets.txt
SIL SIL_B SIL_E SIL_I SIL_S
SPN SPN_B SPN_E SPN_I SPN_S
NSN NSN_B NSN_E NSN_I NSN_S
 
extra_questions.txt問題集,人工的,用於構建決策樹。
s5# cat data/lang/phones/extra_questions.txt
IY_B B_B D_B F_B G_B K_B SH_B L_B M_B N_B OW_B AA_B TH_B P_B OY_B R_B UH_B AE_B S_B T_B AH_B V_B W_B Y_B Z_B CH_B AO_B DH_B UW_B ZH_B EH_B AW_B AX_B EL_B AY_B EN_B HH_B ER_B IH_B JH_B EY_B NG_B
IY_E B_E D_E F_E G_E K_E SH_E L_E M_E N_E OW_E AA_E TH_E P_E OY_E R_E UH_E AE_E S_E T_E AH_E V_E W_E Y_E Z_E CH_E AO_E DH_E UW_E ZH_E EH_E AW_E AX_E EL_E AY_E EN_E HH_E ER_E IH_E JH_E EY_E NG_E
IY_I B_I D_I F_I G_I K_I SH_I L_I M_I N_I OW_I AA_I TH_I P_I OY_I R_I UH_I AE_I S_I T_I AH_I V_I W_I Y_I Z_I CH_I AO_I DH_I UW_I ZH_I EH_I AW_I AX_I EL_I AY_I EN_I HH_I ER_I IH_I JH_I EY_I NG_I
IY_S B_S D_S F_S G_S K_S SH_S L_S M_S N_S OW_S AA_S TH_S P_S OY_S R_S UH_S AE_S S_S T_S AH_S V_S W_S Y_S Z_S CH_S AO_S DH_S UW_S ZH_S EH_S AW_S AX_S EL_S AY_S EN_S HH_S ER_S IH_S JH_S EY_S NG_S
SIL SPN NSN LAU
SIL_B SPN_B NSN_B LAU_B
SIL_E SPN_E NSN_E LAU_E
SIL_I SPN_I NSN_I LAU_I
SIL_S SPN_S NSN_S LAU_S
The first four questions are asking about the word-position,
last five do the same for the "silence phones"


word_boundary.txt 解釋這些音素與詞位置的關系。

The file word_boundary.txt explains how the phones relate to word positions:

s5# head  data/lang/phones/word_boundary.txt
SIL nonword
SIL_B begin
SIL_E end
SIL_I internal
SIL_S singleton
SPN nonword
SPN_B begin
 

Creating the language model or grammar

 

 

 

 

文檔出處:http://www.kaldi-asr.org/doc/data_prep.html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM