HMM原理簡述和使用說明

本文轉載自查看原文 2017-10-24 15:46 1916

1、原理簡述　　

為了對GMM-HMM在語音識別上的應用有個宏觀認識，花了些時間讀了下HTK（用htk完成簡單的孤立詞識別）的部分源碼，對該算法總算有了點大概認識，達到了預期我想要的。不得不說，網絡上關於語音識別的通俗易懂教程太少，都是各種公式滿天飛，很少有說具體細節的，當然了，那需要有實戰經驗才行。下面總結以下幾點，對其有個宏觀印象即可（以孤立詞識別為例）。

　　一、每個單詞的讀音都對應一個HMM模型，大家都知道HMM模型中有個狀態集S，那么每個狀態用什么來表示呢，數字？向量？矩陣？其實這個狀態集中的狀態沒有具體的數學要求，只是一個名稱而已，你可以用’1’, ’2’, ‘3’…表示，也可以用’a’, ‘b’, ’c ’表示。另外每個HMM模型中到底該用多少個狀態，是通過先驗知識人為設定的。

　　二、HMM的每一個狀態都對應有一個觀察值，這個觀察值可以是一個實數，也可以是個向量，且每個狀態對應的觀察值的維度應該相同。假設現在有一個單詞的音頻文件，首先需要將其進行采樣得到數字信息（A/D轉換），然后分幀進行MFCC特征提取，假設每一幀音頻對應的MFCC特征長度為39，則每個音頻文件就轉換成了N個MFCC向量（不同音頻文件對應的N可能不同），這就成了一個序列，而在訓練HMM模型的參數時（比如用Baum-Welch算法），每次輸入到HMM中的數據要求就是一個觀測值序列。這時，每個狀態對應的觀測值為39維的向量，因為向量中元素的取值是連續的，需要用多維密度函數來模擬，通常情況下用的是多維高斯函數。在GMM-HMM體系中，這個擬合函數是用K個多維高斯混合得到的。假設知道了每個狀態對應的K個多維高斯的所有參數，則該GMM生成該狀態上某一個觀察向量（一幀音頻的MFCC系數）的概率就可以求出來了。

　　三、對每個單詞建立一個HMM模型，需要用到該單詞的訓練樣本，這些訓練樣本是提前標注好的，即每個樣本對應一段音頻，該音頻只包含這個單詞的讀音。當有了該單詞的多個訓練樣本后，就用這些樣本結合Baum-Welch算法和EM算法來訓練出GMM-HMM的所有參數，這些參數包括初始狀態的概率向量，狀態之間的轉移矩陣，每個狀態對應的觀察矩陣（這里對應的是GMM，即每個狀態對應的K個高斯的權值，每個高斯的均值向量和方差矩陣）。

　　四、在識別階段，輸入一段音頻，如果該音頻含有多個單詞，則可以手動先將其分割開（考慮的是最簡單的方法），然后提取每個單詞的音頻MFCC特征序列，將該序列輸入到每個HMM模型（已提前訓練好的）中，采用前向算法求出每個HMM模型生成該序列的概率，最后取最大概率對應的那個模型，而那個模型所表示的單詞就是我們識別的結果。

　　五、在建立聲學模型時，可以用Deep Learning的方法來代替GMM-HMM中的GMM，因為GMM模擬任意函數的功能取決於混合高斯函數的個數，所以具有一定的局限性，屬於淺層模型。而Deep Network可以模擬任意的函數，因而表達能力更強。注意，這里用來代替GMM的Deep Nets模型要求是產生式模型，比如DBN，DBM等，因為在訓練HMM-DL網絡時，需要用到HMM的某個狀態產生一個樣本的概率。

　　六、GMM-HMM在具體實現起來還是相當復雜的。

　　七、一般涉及到時間序列時才會使用HMM，比如這里音頻中的語音識別，視頻中的行為識別等。如果我們用GMM-HMM對靜態的圖片分類，因為這里沒涉及到時間信息，所以HMM的狀態數可設為1，那么此時的GMM-HMM算法就退化成GMM算法了。

2、使用說明

一、離散輸出的隱馬爾科夫模型（DHMM，HMM with discrete outputs）

最大似然參數估計EM（Baum Welch算法）

The script dhmm_em_demo.m gives an example of how to learn an HMM with discrete outputs. Let there be Q=2 states and O=3 output symbols. We create random stochastic matrices as follows.

O = 3;

Q = 2;

prior0 = normalise(rand(Q,1));

transmat0 = mk_stochastic(rand(Q,Q));

obsmat0 = mk_stochastic(rand(Q,O));

Now we sample nex=20 sequences of length T=10 each from this model, to use as training data.

T=10; %序列長度

nex=20; %樣本序列數目

data = dhmm_sample(prior0, transmat0, obsmat0, nex, T);

Here data is 20x10. Now we make a random guess as to what the parameters are,

prior1 = normalise(rand(Q,1)); %初始狀態概率

transmat1 = mk_stochastic(rand(Q,Q)); %初始狀態轉移矩陣

obsmat1 = mk_stochastic(rand(Q,O)); %初始觀測狀態到隱藏狀態間的概率轉移矩陣

and improve our guess using 5 iterations of EM...

[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);

%prior2, transmat2, obsmat2 為訓練好后的初始概率，狀態轉移矩陣及混合狀態概率轉移矩陣

LL(t) is the log-likelihood after iteration t, so we can plot the learning curve.

序列分類

To evaluate the log-likelihood of a trained model given test data, proceed as follows:

loglik = dhmm_logprob(data, prior2, transmat2, obsmat2) %HMM測試

Note: the discrete alphabet is assumed to be {1, 2, ..., O}, where O = size(obsmat, 2). Hence data cannot contain any 0s.

To classify a sequence into one of k classes, train up k HMMs, one per class, and then compute the log-likelihood that each model gives to the test sequence; if the i'th model is the most likely, then declare the class of the sequence to be class i.

Computing the most probable sequence (Viterbi)

First you need to evaluate B(i,t) = P(y_t | Q_t=i) for all t,i:

B = multinomial_prob(data, obsmat);

Then you can use

[path] = viterbi_path(prior, transmat, B)

二、具有高斯混合輸出的隱馬爾科夫模型（GHMM，HMM with mixture of Gaussians outputs）

Maximum likelihood parameter estimation using EM (Baum Welch)

Let us generate nex=50 vector-valued sequences of length T=50; each vector has size O=2.

O = 2;

T = 50;

nex = 50;

data = randn(O,T,nex);

Now let use fit a mixture of M=2 Gaussians for each of the Q=2 states using K-means.

M = 2;

Q = 2;

left_right = 0;

prior0 = normalise(rand(Q,1));

transmat0 = mk_stochastic(rand(Q,Q));

[mu0, Sigma0] = mixgauss_init(Q*M, reshape(data, [O T*nex]), cov_type);

mu0 = reshape(mu0, [O Q M]);

Sigma0 = reshape(Sigma0, [O O Q M]);

mixmat0 = mk_stochastic(rand(Q,M));

Finally, let us improve these parameter estimates using EM.

[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...

mhmm_em(data, prior0, transmat0, mu0, Sigma0, mixmat0, 'max_iter', 2);

Since EM only finds a local optimum, good initialisation is crucial. The initialisation procedure illustrated above is very crude, and is probably not adequate for real applications... Click here for a real-world example of EM with mixtures of Gaussians using BNT.

What to do if the log-likelihood becomes positive?

It is possible for p(x) > 1 if p(x) is a probability density function, such as a Gaussian. (The requirements for a density are p(x)>0 for all x and int_x p(x) = 1.) In practice this usually means your covariance is shrinking to a point/delta function, so you should increase the width of the prior (see below), or constrain the matrix to be spherical or diagonal, or clamp it to a large fixed constant (not learn it at all). It is also very helpful to ensure the components of the data vectors have small and comparable magnitudes (use e.g., KPMstats/standardize).

This is a well-known pathology of maximum likelihood estimation for Gaussian mixtures: the global optimum may place one mixture component on a single data point, and give it 0 covariance, and hence infinite likelihood. One usually relies on the fact that EM cannot find the global optimum to avoid such pathologies.

What to do if the log-likelihood decreases during EM?

Since I implicitly add a prior to every covariance matrix (see below), what increases is loglik + log(prior), but what I print is just loglik, which may occasionally decrease. This suggests that one of your mixture components is not getting enough data. Try a better initialization or fewer clusters (states).

What to do if the covariance matrix becomes singular?

Estimates of the covariance matrix often become singular if you have too little data, or if too few points are assigned to a cluster center due to a bad initialization of the means. In this case, you should constrain the covariance to be spherical or diagonal, or adjust the prior (see below), or try a better initialization.

How do I add a prior to the covariance matrix?

Buried inside of KPMstats/mixgauss_Mstep you will see that cov_prior is initialized to 0.01*I. This is added to the maximum likelihood estimate after every M step. To change this, you will need to modify the mhmm_em function so it calls mixgauss_Mstep with a different value.

Sequence classification

To classify a sequence (e.g., of speech) into one of k classes (e.g., the digits 0-9), proceed as in the DHMM case above, but use the following procedure to compute likelihood:

loglik = mhmm_logprob(data, prior, transmat, mu, Sigma, mixmat);

Computing the most probable sequence (Viterbi)

First you need to evaluate B(t,i) = P(y_t | Q_t=i) for all t,i:

B = mixgauss_prob(data(:,:,ex), mu, Sigma, mixmat);

where data(:,:,ex) is OxT where O is the size of the observation vector. Finally, use

[path] = viterbi_path(prior, transmat, B);

三、具有高斯輸出的HMM

This is just like the mixture of Gaussians case, except we have M=1, and hence there is no mixing matrix.

Online EM for discrete HMMs/ POMDPs

For some applications (e.g., reinforcement learning/ adaptive control), it is necessary to learn a model online. The script dhmm_em_online_demo gives an example of how to do this.

　　MFCC:

　　MFCC的matlab實現教程可參考：張智星老師的網頁教程mfcc. 最基本的12維特征。

[ruby] view plain copy

function mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
% frame2mfcc: Frame to MFCC conversion.
% Usage: mfcc=frame2mfcc(frame, fs, filterNum, mfccNum, plotOpt)
%
% For example:
% waveFile='what_movies_have_you_seen_recently.wav';
% [y, fs, nbits]=wavReadInt(waveFile);
% startIndex=12000;
% frameSize=512;
% frame=y(startIndex:startIndex+frameSize-1);
% frame2mfcc(frame, fs, 20, 12, 1);
% Roger Jang 20060417
if nargin<1, selfdemo; return; end
if nargin<2, fs=16000; end
if nargin<3, filterNum=20; end
if nargin<4, mfccNum=12; end
if nargin<5, plotOpt=0; end
frameSize=length(frame);
% ====== Preemphasis should be done at wave level
%a=0.95;
%frame2 = filter([1, -a], 1, frame);
frame2=frame;
% ====== Hamming windowing
frame3=frame2.*hamming(frameSize);
% ====== FFT
[fftMag, fftPhase, fftFreq, fftPowerDb]=fftOneSide(frame3, fs);
% ====== Triangular band-pass filter bank
triFilterBankPrm=getTriFilterBankPrm(fs, filterNum); % Get parameters for triangular band-pass filter bank
% Triangular bandpass filter.
for i=1:filterNum
tbfCoef(i)=dot(fftPowerDb, trimf(fftFreq, triFilterBankPrm(:,i)));%得到filterNum個濾波系數
end
% ====== DCT
mfcc=zeros(mfccNum, 1); %DCT變換的前后個數也沒有變
for i=1:mfccNum
coef = cos((pi/filterNum)*i*((1:filterNum)-0.5))'; %mfcc中的前mfccNum個系數
mfcc(i) = sum(coef.*tbfCoef');%直接按照DCT公式
end
% ====== Log energy
%logEnergy=10*log10(sum(frame.*frame));
%mfcc=[logEnergy; mfcc];
if plotOpt
subplot(2,1,1);
plot(frame, '.-');
set(gca, 'xlim', [-inf inf]);
title('Input frame');
subplot(2,1,2);
plot(mfcc, '.-');
set(gca, 'xlim', [-inf inf]);
title('MFCC vector');
end
% ====== trimf.m (from fuzzy toolbox)
function y = trimf(x, prm) %由頻率的橫坐標算出三角形內的縱坐標,0~1
a = prm(1); b = prm(2); c = prm(3);
y = zeros(size(x));
% Left and right shoulders (y = 0)
index = find(x <= a | c <= x);
y(index) = zeros(size(index)); %只考慮三角波內的量
% Left slope
if (a ~= b)
index = find(a < x & x < b);
y(index) = (x(index)-a)/(b-a);
end
% right slope
if (b ~= c)
index = find(b < x & x < c);
y(index) = (c-x(index))/(c-b);
end
% Center (y = 1)
index = find(x == b);
y(index) = ones(size(index));
% ====== Self demo
function selfdemo
waveFile='what_movies_have_you_seen_recently.wav';
[y, fs, nbits]=wavReadInt(waveFile);
startIndex=12000;
frameSize=512;
frame=y(startIndex:startIndex+frameSize-1);
feval(mfilename, frame, fs, 20, 12, 1);

ZCR:

　　過0檢測，用於判斷每一幀中過零點的數量情況，最簡單的版本可參考：zeros cross rate.

[ruby] view plain copy

waveFile='csNthu.wav';
frameSize=256;
overlap=0;
[y, fs, nbits]=wavread(waveFile);
frameMat=enframe(y, frameSize, overlap);
frameNum=size(frameMat, 2);
for i=1:frameNum
frameMat(:,i)=frameMat(:,i)-mean(frameMat(:,i)); % mean justification
end
zcr=sum(frameMat(1:end-1, :).*frameMat(2:end, :)<0);
sampleTime=(1:length(y))/fs;
frameTime=((0:frameNum-1)*(frameSize-overlap)+0.5*frameSize)/fs;
subplot(2,1,1); plot(sampleTime, y); ylabel('Amplitude'); title(waveFile);
subplot(2,1,2); plot(frameTime, zcr, '.-');
xlabel('Time (sec)'); ylabel('Count'); title('ZCR');

EPD:

　　端點檢測，檢測聲音的起始點和終止點，可參考：EPD in Time Domain,在時域中的最簡單檢測方法。

[ruby] view plain copy

waveFile='sunday.wav';
[wave, fs, nbits] = wavread(waveFile);
frameSize = 256;
overlap = 128;
wave=wave-mean(wave); % zero-mean substraction
frameMat=buffer2(wave, frameSize, overlap); % frame blocking,每一列代表一幀
frameNum=size(frameMat, 2); % no. of frames
volume=frame2volume(frameMat); % volume,求每一幀的能量，絕對值或者平方和,volume為行向量
volumeTh1=max(volume)*0.1; % volume threshold 1
volumeTh2=median(volume)*0.1; % volume threshold 2
volumeTh3=min(volume)*10; % volume threshold 3
volumeTh4=volume(1)*5; % volume threshold 4
index1 = find(volume>volumeTh1); %找出volume大於閾值的那些幀序號
index2 = find(volume>volumeTh2);
index3 = find(volume>volumeTh3);
index4 = find(volume>volumeTh4);
%frame2sampleIndex()為從幀序號找到樣本點的序號(即每一個采樣點的序號)
%endPointX長度為2,包含了起點和終點的樣本點序號
endPoint1=frame2sampleIndex([index1(1), index1(end)], frameSize, overlap);
endPoint2=frame2sampleIndex([index2(1), index2(end)], frameSize, overlap);
endPoint3=frame2sampleIndex([index3(1), index3(end)], frameSize, overlap);
endPoint4=frame2sampleIndex([index4(1), index4(end)], frameSize, overlap);
subplot(2,1,1);
time=(1:length(wave))/fs;
plot(time, wave);
ylabel('Amplitude'); title('Waveform');
axis([-inf inf -1 1]);
line(time(endPoint1( 1))*[1 1], [-1, 1], 'color', 'm');%標起點終點線
line(time(endPoint2( 1))*[1 1], [-1, 1], 'color', 'g');
line(time(endPoint3( 1))*[1 1], [-1, 1], 'color', 'k');
line(time(endPoint4( 1))*[1 1], [-1, 1], 'color', 'r');
line(time(endPoint1(end))*[1 1], [-1, 1], 'color', 'm');
line(time(endPoint2(end))*[1 1], [-1, 1], 'color', 'g');
line(time(endPoint3(end))*[1 1], [-1, 1], 'color', 'k');
line(time(endPoint4(end))*[1 1], [-1, 1], 'color', 'r');
legend('Waveform', 'Boundaries by threshold 1', 'Boundaries by threshold 2', 'Boundaries by threshold 3', 'Boundaries by threshold 4');
subplot(2,1,2);
frameTime=frame2sampleIndex(1:frameNum, frameSize, overlap);
plot(frameTime, volume, '.-');
ylabel('Sum of Abs.'); title('Volume');
axis tight;
line([min(frameTime), max(frameTime)], volumeTh1*[1 1], 'color', 'm');
line([min(frameTime), max(frameTime)], volumeTh2*[1 1], 'color', 'g');
line([min(frameTime), max(frameTime)], volumeTh3*[1 1], 'color', 'k');
line([min(frameTime), max(frameTime)], volumeTh4*[1 1], 'color', 'r');
legend('Volume', 'Threshold 1', 'Threshold 2', 'Threshold 3', 'Threshold 4');

GMM:

　 GMM用在擬合數據分布上，本質上是先假設樣本的概率分布為GMM，然后用多個樣本去學習這些GMM的參數。GMM建模在語音中可用於某個單詞的發音，某個人的音色等。其訓練過程可參考:speaker recognition.

[ruby] view plain copy

function [M, V, W, logProb] = gmmTrain(data, gaussianNum, dispOpt)
% gmmTrain: Parameter training for gaussian mixture model (GMM)
% Usage: function [M, V, W, logProb] = gmm(data, gaussianNum, dispOpt)
% data: dim x dataNum matrix where each column is a data point
% gaussianNum: No. of Gaussians or initial centers
% dispOpt: Option for displaying info during training
% M: dim x meanNum matrix where each column is a mean vector
% V: 1 x gaussianNum vector where each element is a variance for a Gaussian
% W: 1 x gaussianNum vector where each element is a weighting factor for a Gaussian
% Roger Jang 20000610
if nargin==0, selfdemo; return; end
if nargin<3, dispOpt=0; end
maxLoopCount = 50; % Max. iteration
minImprove = 1e-6; % Min. improvement
minVariance = 1e-6; % Min. variance
logProb = zeros(maxLoopCount, 1); % Array for objective function
[dim, dataNum] = size(data);
% Set initial parameters
% Set initial M
%M = data(1+floor(rand(gaussianNum,1)*dataNum),:); % Randomly select several data points as the centers
if length(gaussianNum)==1,
% Using vqKmeans to find initial centers
fprintf('Start KMEANS to find the initial mu...\n');
% M = vqKmeansMex(data, gaussianNum, 0);
M = vqKmeans(data, gaussianNum, 0); %利用聚類的方法求均值,聚成gaussianNum類
% M = vqLBG(data, gaussianNum, 0);
fprintf('Start GMM training...\n');
if any(any(~isfinite(M))); keyboard; end
else
% gaussianNum is in fact the initial centers
M = gaussianNum;
gaussianNum = size(M, 2);
end
% Set initial V as the distance to the nearest center
if gaussianNum==1
V=1;
else
distance=pairwiseSqrDist(M);%pairwiseSqrDist是dll
%distance=pairwiseSqrDist2(M);
distance(1:(gaussianNum+1):gaussianNum^2)=inf; % Diagonal elements are inf
[V, index]=min(distance); % Initial variance for each Gaussian
end
% Set initial W
W = ones(1, gaussianNum)/gaussianNum; % Weight for each Gaussian,初始化時是均分權值
if dispOpt & dim==2, displayGmm(M, V, data); end
for i = 1:maxLoopCount %開始迭代訓練參數,EM算法
% Expectation step:
% P(i,j) is the probability of data(:,j) to the i-th Gaussian
% Prob為每個樣本在GMM下的概率
[prob, P]=gmmEval(data, M, V, W);
logProb(i)=sum(log(prob)); %所有樣本的聯合概率
if dispOpt
fprintf('i = %d, log prob. = %f\n',i-1, logProb(i));
end
PW = diag(W)*P;
BETA=PW./(ones(gaussianNum,1)*sum(PW)); % BETA(i,j) is beta_i(x_j)
sumBETA=sum(BETA,2);
% Maximization step: eqns (2.96) to (2.98) from Bishop p.67:
M = (data*BETA')./(ones(dim,1)*sumBETA');
DISTSQ = pairwiseSqrDist(M, data); % Distance of M to data
%DISTSQ = pairwiseSqrDist2(M, data); % Distance of M to data
V = max((sum(BETA.*DISTSQ, 2)./sumBETA)/dim, minVariance); % (2.97)
W = (1/dataNum)*sumBETA; % (2.98)
if dispOpt & dim==2, displayGmm(M, V, data); end
if i>1, if logProb(i)-logProb(i-1)<minImprove, break; end; end
end
[prob, P]=gmmEval(data, M, V, W);
logProb(i)=sum(log(prob));
fprintf('Iteration count = %d, log prob. = %f\n',i, logProb(i));
logProb(i+1:maxLoopCount) = [];
% ====== Self Demo ======
function selfdemo
%[data, gaussianNum] = dcdata(2);
data = rand(1000,2);
gaussianNum = 8;
data=data';
plotOpt=1;
[M, V, W, lp] = feval(mfilename, data, gaussianNum, plotOpt);
pointNum = 40;
x = linspace(min(data(1,:)), max(data(1,:)), pointNum);
y = linspace(min(data(2,:)), max(data(2,:)), pointNum);
[xx, yy] = meshgrid(x, y);
data = [xx(:) yy(:)]';
z = gmmEval(data, M, V, W);
zz = reshape(z, pointNum, pointNum);
figure; mesh(xx, yy, zz); axis tight; box on; rotate3d on
figure; contour(xx, yy, zz, 30); axis image
% ====== Other subfunctions ======
function displayGmm(M, V, data)
% Display function for EM algorithm
figureH=findobj(0, 'tag', mfilename);
if isempty(figureH)
figureH=figure;
set(figureH, 'tag', mfilename);
colordef black
plot(data(1,:), data(2,:),'.r'); axis image
theta=linspace(-pi, pi, 21);
x=cos(theta); y=sin(theta);
sigma=sqrt(V);
for i=1:length(sigma)
circleH(i)=line(x*sigma(i)+M(1,i), y*sigma(i)+M(2,i), 'color', 'y');
end
set(circleH, 'tag', 'circleH', 'erasemode', 'xor');
else
circleH=findobj(figureH, 'tag', 'circleH');
theta=linspace(-pi, pi, 21);
x=cos(theta); y=sin(theta);
sigma=sqrt(V);
for i=1:length(sigma)
set(circleH(i), 'xdata', x*sigma(i)+M(1,i), 'ydata', y*sigma(i)+M(2,i));
end
drawnow
end

Speaker identification:

　給N個人的語音資料，用GMM可以訓練這N個人的聲音模型，然后給定一段語音，判斷該語音與這N個人中哪個最相似。方法是求出該語音在N個GMM模型下的概率，選出概率最大的那個。可參考:speaker recognition.

[ruby] view plain copy

function [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
% speakerIdentify: speaker identification using GMM parameters
% Usage: [recogRate, confusionMatrix, speakerData]=speakerIdentify(speakerData, speakerGmm, useIntGmm)
% speakerData: structure array generated by speakerDataRead.m
% speakerGmm: speakerGmm(i).gmmPrm is the GMM parameters for speaker i.
% useIntGmm: use fixed-point GMM
% Roger Jang, 20070517, 20080726
if nargin<3, useIntGmm=0; end
% ====== Speaker identification using GMM parameters
speakerNum=length(speakerData);
for i=1:speakerNum
% fprintf('%d/%d: Recognizing wave files by %s\n', i, speakerNum, speakerData(i).name);
for j=1:length(speakerData(i).sentence)
% fprintf('\tSentece %d...\n', j);
frameNum=size(speakerData(i).sentence(j).fea, 2);
logProb=zeros(speakerNum, frameNum); %logProb(i,m)表示第i個人第j個句子中第m幀在GMM模型下的log概率
%找出一個句子，看它屬於哪個speaker
for k=1:speakerNum,
% fprintf('\t\tSpeaker %d...\n', k);
% logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
if ~useIntGmm
% logProb(k, :)=gmmEvalMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
logProb(k, :)=gmmEval(speakerData(i).sentence(j).fea, speakerGmm(k).gmmPrm);
else
% logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, gmm(k).mean, gmm(k).covariance, gmm(k).weight);
logProb(k, :)=gmmEvalIntMex(speakerData(i).sentence(j).fea, speakerGmm(i).gmmPrm);
end
end
cumLogProb=sum(logProb, 2);
[maxProb, index]=max(cumLogProb);
speakerData(i).sentence(j).predictedSpeaker=index; %找出身份
speakerData(i).sentence(j).logProb=logProb;
end
end
% ====== Compute confusion matrix and recognition rate
confusionMatrix=zeros(speakerNum);
for i=1:speakerNum,
predictedSpeaker=[speakerData(i).sentence.predictedSpeaker];
[index, count]=elementCount(predictedSpeaker);
confusionMatrix(i, index)=count;
end
recogRate=sum(diag(confusionMatrix))/sum(sum(confusionMatrix));

GMM-HMM:

　　訓練階段：給出HMM的k個狀態，每個狀態下的觀察樣本的生成可以用一個概率分布來擬合，這里是采用GMM擬合的。其實，可以把GMM-HMM整體看成是一個生成模型。給定該模型的5個初始參數(結合隨機和訓練樣本獲得)，啟動EM算法的E步：獲得訓練樣本分布，即計算訓練樣本在各個狀態下的概率。M步：用這些訓練樣本重新評估那5個參數。

　　測試階段：(以孤立詞識別為例)給定每個詞發音的frame矩陣，取出某一個GMM-HMM模型，算出該發音每一幀數據在取出的GMM-HMM模型各個state下的概率，結合模型的轉移概率和初始概率，獲得對應的clique tree，可用圖模型的方法inference出生成該語音的概率。比較多個GMM-HMM模型，取最大概率的模型對應的詞。

　 參考資料：

機器學習&數據挖掘筆記_13（用htk完成簡單的孤立詞識別）

http://htk.eng.cam.ac.uk/extensions/

張智星老師的網頁教程mfcc.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 drizzleDumper的原理分析和使用說明 Shadows 使用說明 htmlcleaner 使用說明 VRTK使用說明 jgit使用說明 Podfile使用說明 MIPSsim使用說明 quicker使用說明 xtrabackup 使用說明 Kibana 使用說明