能量模型的概念從統計力學中得來,它描述着整個系統的某種狀態,系統越有序,系統能量波動越小,趨近於平衡狀態,系統越無序,能量波動越大。例如:一個孤立的物體,其內部各處的溫度不盡相同,那么熱就從溫度較高的地方流向溫度較低的地方,最后達到各處溫度都相同的狀態,也就是熱平衡的狀態。在統計力學中,系統處於某個狀態的相對概率為,即玻爾茲曼因子,其中T表示溫度,
是玻爾茲曼常數,
是狀態
的能量。玻爾茲曼因子本身並不是一個概率,因為它還沒有歸一化。為了把玻爾茲曼因子歸一化,使其成為一個概率,我們把它除以系統所有可能的狀態的玻爾茲曼因子之和Z,稱為配分函數(partition function)。這便給出了玻爾茲曼分布。
玻爾茲曼機(Boltzmann Machine,BM)是一種特殊形式的對數線性的馬爾科夫隨機場(Markov Random Field,MRF),即能量函數是自由變量的線性函數。 通過引入隱含單元,我們可以提升模型的表達能力,表示非常復雜的概率分布。限制性玻爾茲曼機(RBM)進一步加一些約束,在RBM中不存在可見單元與可見單元的鏈接,也不存在隱含單元與隱含單元的鏈接,如下圖所示
能量函數在限制玻爾茲曼機中定義為,b,c,W為模型的參數,b,c分別為可見層和隱含層的偏置,W為可見層與隱含層的鏈接權重
有了上述三個公式我們可以使用最大似然估計來求解模型的參數:設 。把概率p(x)改寫為
。
由於可見單元V和不可見單元h條件獨立,利用這一性質,我們可以得到:
logistic回歸估計v與h取一的概率:
有了以上條件,我們可以推導出參數變化的梯度值:
使用基於馬爾可夫鏈的gibbs抽樣,對於一個d維的隨機向量x=(x1,x2,…xd),假設我們無法求得x的聯合概率分布p(x),但我們知道給定x的其他分量是其第i個分量xi的條件分布,即p(xi|xi-),xi-=(x1,x2,…xi-1,xi+1…xd)。那么,我們可以從x的一個任意狀態(如(x1(0),x2(0),…,xd(0)))開始,利用條件分布p(xi|xi-),迭代地對這狀態的每個分量進行抽樣,隨着抽樣次數n的增加,隨機變量(x1(n),x2(n),…,xd(n))的概率分布將以n的幾何級數的速度收斂到x的聯合概率分布p(v)。
基於RBM模型的對稱結構,以及其中節點的條件獨立行,我們可以使用Gibbs抽樣方法得到服從RBM定義的分布的隨機樣本。在RBM中進行k步Gibbs抽樣的具體算法為:用一個訓練樣本(或者可視節點的一個隨機初始狀態)初始化可視節點的狀態v0,交替進行下面的抽樣:
理論上,參數的每次更新需要讓上面的鏈條圖形遍歷一次,這樣帶來的性能損耗毫無疑問是不能承受的。
Hinton教授提出一種改進方法叫做對比分歧(Contrastive Divergence),即CD-K。他指出CD沒有必要等待鏈收斂,樣本可以通過k步 的gibbs抽樣完成,僅需要較少的抽樣步數(實驗中使用一步)就可以得到足夠好的效果。
下面給出RBM用到的CD-K算法偽代碼。
關於deeplearning的c++實現放到了github上,由於時間關系只是實現了大致框架,細節方面有待改善,也歡迎大家的參與:https://github.com/loujiayu/deeplearning
下面附上Geoff Hinton提供的關於RBM的matlab代碼
% Version 1.000 % % Code provided by Geoff Hinton and Ruslan Salakhutdinov % % Permission is granted for anyone to copy, use, modify, or distribute this % program and accompanying programs and documents for any purpose, provided % this copyright notice is retained and prominently displayed, along with % a note saying that the original programs are available from our % web page. % The programs and documents are distributed without any warranty, express or % implied. As the programs were written for research purposes only, they have % not been tested to the degree that would be advisable in any important % application. All use of these programs is entirely at the user's own risk. % This program trains Restricted Boltzmann Machine in which % visible, binary, stochastic pixels are connected to % hidden, binary, stochastic feature detectors using symmetrically % weighted connections. Learning is done with 1-step Contrastive Divergence. % The program assumes that the following variables are set externally: % maxepoch -- maximum number of epochs % numhid -- number of hidden units % batchdata -- the data that is divided into batches (numcases numdims numbatches) % restart -- set to 1 if learning starts from beginning epsilonw = 0.1; % Learning rate for weights epsilonvb = 0.1; % Learning rate for biases of visible units epsilonhb = 0.1; % Learning rate for biases of hidden units weightcost = 0.0002; initialmomentum = 0.5; finalmomentum = 0.9; [numcases numdims numbatches]=size(batchdata); if restart ==1, restart=0; epoch=1; % Initializing symmetric weights and biases. vishid = 0.1*randn(numdims, numhid); hidbiases = zeros(1,numhid); visbiases = zeros(1,numdims); poshidprobs = zeros(numcases,numhid); neghidprobs = zeros(numcases,numhid); posprods = zeros(numdims,numhid); negprods = zeros(numdims,numhid); vishidinc = zeros(numdims,numhid); hidbiasinc = zeros(1,numhid); visbiasinc = zeros(1,numdims); batchposhidprobs=zeros(numcases,numhid,numbatches); end for epoch = epoch:maxepoch, fprintf(1,'epoch %d\r',epoch); errsum=0; for batch = 1:numbatches, fprintf(1,'epoch %d batch %d\r',epoch,batch); %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% data = batchdata(:,:,batch); poshidprobs = 1./(1 + exp(-data*vishid - repmat(hidbiases,numcases,1))); batchposhidprobs(:,:,batch)=poshidprobs; posprods = data' * poshidprobs; poshidact = sum(poshidprobs); posvisact = sum(data); %%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% poshidstates = poshidprobs > rand(numcases,numhid); %%%%%%%%% START NEGATIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% negdata = 1./(1 + exp(-poshidstates*vishid' - repmat(visbiases,numcases,1))); neghidprobs = 1./(1 + exp(-negdata*vishid - repmat(hidbiases,numcases,1))); negprods = negdata'*neghidprobs; neghidact = sum(neghidprobs); negvisact = sum(negdata); %%%%%%%%% END OF NEGATIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% err= sum(sum( (data-negdata).^2 )); errsum = err + errsum; if epoch>5, momentum=finalmomentum; else momentum=initialmomentum; end; %%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% vishidinc = momentum*vishidinc + ... epsilonw*( (posprods-negprods)/numcases - weightcost*vishid); visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*(posvisact-negvisact); hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*(poshidact-neghidact); vishid = vishid + vishidinc; visbiases = visbiases + visbiasinc; hidbiases = hidbiases + hidbiasinc; %%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% end fprintf(1, 'epoch %4i error %6.1f \n', epoch, errsum); end;