一、問題介紹
概率分布模型中,有時只含有可觀測變量,如單硬幣投擲模型,對於每個測試樣例,硬幣最終是正面還是反面是可以觀測的。而有時還含有不可觀測變量,如三硬幣投擲模型。問題這樣描述,首先投擲硬幣A,如果是正面,則投擲硬幣B,如果是反面,則投擲硬幣C,最終只記錄硬幣B,C投擲的結果是正面還是反面,因此模型中硬幣B,C的正反是可觀測變量,而硬幣A的正反則是不可觀測變量。這里,用Y表示可觀測變量,Z表示(隱變量)不可觀測變量,Y和Z統稱為完全數據,Y成為不完全數據。對於文本分類問題,未標記數據的自變量為可觀測變量Y,未標記數據為觀測到的類別標簽為隱變量Z。
一般的只含有可觀測變量的概率分布模型由於先驗概率分布是可以通過可觀測的類別標簽來求得(對應文本分類問題中每個類別的數據出現的概率),而條件概率分布是可以通過可觀測的類別標簽以和可觀測的樣本自變量中特征來求得(對應文本分類問題中已知類別的前提下某個單詞是否出現的概率),因此通過朴素貝葉斯法就可以對概率模型求解。但是如果模型中存在隱變量,那么朴素貝葉斯法則不能使用,因為先驗概率分布和條件概率分布無法直接求得,因此提出一種用迭代方式進行的對不完全數據進行極大似然估計的方法——期望最大化算法(EM算法),接下來將對算法進行詳細的證明和解釋。
二、算法詳解
1. 極大化似然函數
由於能觀測到的數據只有不完全數據Y,因此對參數進行極大似然估計。
對Z前提下Y的概率分布可以理解為每個類別確定后Y的概率模型。如果是高斯混合模型,那么類別確定之后,自變量應該符合高斯分布,如果是文本分類模型,那么類別確定之后,自變量應該符合條件概率分布(自變量是特征的集合,由於所有特征都是條件獨立的,因此聯合分布就是各個特征的分布連乘在一起,每個特征滿足兩點分布,那么自變量應該滿足連乘的兩點分布)。對Z的先驗概率分布可以理解為每個類別出現的比例,即每個類別的先驗概率。不完全數據的Z是無法觀測的,因此難點就在於確定條件概率分布和先驗概率分布。
這里B的兩個參數表示不同含義,前一個參數表示Z的后驗概率分布的參數空間,后一個參數表示完全數據的聯合分布。
2. 收斂性分析
現在最大化似然函數L,可以先求似然函數L的下限函數B,然后找出下限函數B的極大值點,那么該點一定也使似然函數L更靠近其極大值點,通過迭代的步驟,就可以不斷逼近L的極值點。如下圖EM算法迭代
首先用先驗啟動知識對Z的后驗概率進行初始化,即用已標記數據及計算出Y前提下Z的概率分布,因此可獲得P(Z∣Y,θ(0))P(Z∣Y,θ(0)),這樣在初始點(自變量是參數空間,應變量是似然函數值)的下限函數就可以求出,當迭代到第t步時求出下限函數為B(θ(t),θ)B(θ(t),θ)在t這一點,似然函數L和其下限函數相等L(θ(t))=B(θ(t),θ(t))L(θ(t))=B(θ(t),θ(t))。這時對B求極大值,可以獲得迭代的下一步的參數
3. 算法流程
EM算法的流程分為兩步,分別是E(期望)步和M(最大化)步。E步主要是求出當前的下限函數B,由於B是通過期望推導出的,所以稱為期望步驟,M步主要是求出當前下限函數B的極大值,然后將這點的參數帶入似然函數,所以稱為最大化步,因此算法流程如下:
1. 利用預先知識,求出隱變量的后驗概率分布,獲得參數空間的初始值θ(0)來啟動EM算法。
2. E步,求期望Ez[logP(Z,Y∣θ)∣Y,θ(t)]Ez[logP(Z,Y∣θ)∣Y,θ(t)]
3. M步,最大化期望,求出新的參數值θ(t+1)θ(t+1)
4. 迭代2、3步直至收斂或固定的迭代次數
下面是數學推導部分


那根據上面的算法過程,我們可以講其實現,代碼如下
## Help Function
logSum <- function(v) {
m = max(v)
return ( m + log(sum(exp(v-m))))
}
# Hard E step
hard.E.step <- function(gamma, model, counts){
# Model Parameter Setting
N <- dim(counts)[2] # number of documents
K <- dim(model$mu)[1]
# E step:
for (n in 1:N){
for (k in 1:K){
## calculate the posterior based on the estimated mu and rho in the "log space"
gamma[n,k] <- log(model$rho[k,1]) + sum(counts[,n] * log(model$mu[k,]))
}
# normalisation to sum to 1 in the log space
logZ = logSum(gamma[n,])
gamma[n,] = gamma[n,] - logZ
## Compared with soft EM, here we need to do a hard assignment for the ducuments based on the post probability
## Find the max post probablity for znk
max.index <- which.max(gamma[n,])
## Set all the post probability to 0 and biggest to 1 and finish the hard assignment
gamma[n,] <- 0
gamma[n,max.index] <- 1
}
return (gamma)
}
# M step
M.step <- function(gamma, model, counts,eps=1e-10){
# Model Parameter Setting
N <- dim(counts)[2] # number of documents
W <- dim(counts)[1] # number of words i.e. vocabulary size
K <- dim(model$mu)[1] # number of clusters
## Updating the parameters in the M step
for (k in 1:K) {
## Update the mix cofficients
model$rho[k,1] <- sum(gamma[,k])/N
## Update the language model mu
total <- sum(gamma[,k] * t(counts)) + W * eps
## For each w, compute the language model
for (w in 1:W){
model$mu[k,w] <- (sum(gamma[,k] * counts[w,]) + eps)/total
}
}
# Return the result
return (model)
}
##--- Initialize model parameters randomly --------------------------------------------
##
initial.param <- function(vocab_size, K=4, seed=123456){
rho <- matrix(1/K,nrow = K, ncol=1) # assume all clusters have the same size (we will update this later on)
mu <- matrix(runif(K*vocab_size),nrow = K, ncol = vocab_size) # initiate Mu
mu <- prop.table(mu, margin = 1) # normalization to ensure that sum of each row is 1
return (list("rho" = rho, "mu"= mu))
}
接着我們把E-Mstep合並在一起
## Hard EM
##--- EM for Document Clustering --------------------------------------------
hard.EM <- function(counts, K=4, max.epoch=10, seed=123456){
#INPUTS:
## counts: word count matrix
## K: the number of clusters
#OUTPUTS:
## model: a list of model parameters
# Model Parameter Setting
N <- dim(counts)[2] # number of documents
W <- dim(counts)[1] # number of unique words (in all documents)
# Initialization
model <- initial.param(W, K=K, seed=seed)
gamma <- matrix(0, nrow = N, ncol = K)
print(train_obj(model,counts))
# Build the model
for(epoch in 1:max.epoch){
# E Step
gamma <- hard.E.step(gamma, model, counts)
# M Step
model <- M.step(gamma, model, counts)
print(train_obj(model,counts))
}
# Return Model
return(list("model"=model,"gamma"=gamma))
}
接着,我們需要導入我們的文本,並做簡單處理,接着我們去驗證下我們上面實現的EM代碼
## Load the library
library(tm)
library(SnowballC)
## Function for reading the data
eps=1e-10
# reading the data
read.data <- function(file.name='Task2A.txt', sample.size=1000, seed=100, pre.proc=TRUE, spr.ratio= 0.90) {
# INPUTS:
## file.name: name of the input .txt file
## sample.size: if == 0 reads all docs, otherwise only reads a subset of the corpus
## seed: random seed for sampling (read above)
## pre.proc: if TRUE performs the preprocessing (recommended)
## spr.ratio: is used to reduce the sparcity of data by removing very infrequent words
# OUTPUTS:
## docs: the unlabled corpus (each row is a document)
## word.doc.mat: the count matrix (each rows and columns corresponds to words and documents, respectively)
## label: the real cluster labels (will be used in visualization/validation and not for clustering)
# Read the data
text <- readLines(file.name)
# select a subset of data if sample.size > 0
if (sample.size>0){
set.seed(seed)
text <- text[sample(length(text), sample.size)]
}
## the terms before the first '\t' are the lables (the newsgroup names) and all the remaining text after '\t' are the actual documents
docs <- strsplit(text, '\t')
# store the labels for evaluation
labels <- unlist(lapply(docs, function(x) x[1]))
# store the unlabeled texts
docs <- data.frame(unlist(lapply(docs, function(x) x[2])))
library(tm)
# create a corpus
docs <- DataframeSource(docs)
corp <- Corpus(docs)
# Preprocessing:
if (pre.proc){
corp <- tm_map(corp, removeWords, stopwords("english")) # remove stop words (the most common word in a language that can be find in any document)
corp <- tm_map(corp, removePunctuation) # remove pnctuation
corp <- tm_map(corp, stemDocument) # perform stemming (reducing inflected and derived words to their root form)
corp <- tm_map(corp, removeNumbers) # remove all numbers
corp <- tm_map(corp, stripWhitespace) # remove redundant spaces
}
# Create a matrix which its rows are the documents and colomns are the words.
dtm <- DocumentTermMatrix(corp)
## reduce the sparcity of out dtm
dtm <- removeSparseTerms(dtm, spr.ratio)
## convert dtm to a matrix
word.doc.mat <- t(as.matrix(dtm))
# Return the result
return (list("docs" = docs, "word.doc.mat"= word.doc.mat, "labels" = labels))
}
訓練模型
# Reading documents
## Note: sample.size=0 means all read all documents!
##(for develiopment and debugging use a smaller subset e.g., sample.size = 40)
data <- read.data(file.name='Task2A.txt', sample.size=0, seed=100, pre.proc=TRUE, spr.ratio= .99)
# word-document frequency matrix
counts <- data$word.doc.mat
# calling the hard EM algorithm on the data with K = 4
hard.res <- hard.EM(counts, K = 4, max.epoch = 50)
得到以下輸出
2171715 [1] 1952192 [1] 1942383 [1] 1938631 [1] 1937321 [1] 1936228 [1] 1935571 [1] 1935383 [1] 1935195 [1] 1935073 [1] 1935032 [1] 1934910 [1] 1934876 [1] 1934764 [1] 1934700 [1] 1934629 [1] 1934559 [1] 1934515 [1] 1934494 [1] 1934387 [1] 1934331 [1] 1934249 [1] 1934181 [1] 1934101 [1] 1933877 [1] 1933044 [1] 1929635 [1] 1927475 [1] 1926070 [1] 1925825 [1] 1925707 [1] 1925570 [1] 1925531 [1] 1925507 [1] 1925477 [1] 1925468 [1] 1925456 [1] 1925431 [1] 1925385 [1] 1925271 [1] 1925170 [1] 1925055 [1] 1924912 [1] 1924732 [1] 1924470 [1] 1924196 [1] 1923888 [1] 1923562 [1] 1923348 [1] 1923261 [1] 1923162
將我們的結果可視化出來
##--- Cluster Visualization -------------------------------------------------
cluster.viz <- function(doc.word.mat, color.vector, title=' '){
p.comp <- prcomp(doc.word.mat, scale. = TRUE, center = TRUE)
plot(p.comp$x, col=color.vector, pch=1, main=title)
}
# hard EM clustering visualization
## find the culster with the maximum probability (since we have soft assignment here)
label.hat <- apply(hard.res$gamma, 1, which.max)
## normalize the count matrix for better visualization
counts<-scale(counts) # only use when the dimensionality of the data (number of words) is large enough
## visualize the estimated clusters
cluster.viz(t(counts), label.hat, 'Estimated Clusters (Hard EM)')
那這時候,我們把原文本直接分類可視化,和EM的分類做對比
## visualize the real clusters cluster.viz(t(counts), factor(data$label), 'Real Clusters')

我們發現,EM基本上非常好的把文本的分類這個任務給完成了。
