一個簡單文本分類任務-EM算法-R語言

本文轉載自查看原文 2018-09-30 01:57 1878 機器學習

一、問題介紹

概率分布模型中，有時只含有可觀測變量，如單硬幣投擲模型，對於每個測試樣例，硬幣最終是正面還是反面是可以觀測的。而有時還含有不可觀測變量，如三硬幣投擲模型。問題這樣描述，首先投擲硬幣A，如果是正面，則投擲硬幣B，如果是反面，則投擲硬幣C，最終只記錄硬幣B,C投擲的結果是正面還是反面，因此模型中硬幣B,C的正反是可觀測變量，而硬幣A的正反則是不可觀測變量。這里，用Y表示可觀測變量，Z表示（隱變量）不可觀測變量，Y和Z統稱為完全數據，Y成為不完全數據。對於文本分類問題，未標記數據的自變量為可觀測變量Y，未標記數據為觀測到的類別標簽為隱變量Z。
一般的只含有可觀測變量的概率分布模型由於先驗概率分布是可以通過可觀測的類別標簽來求得（對應文本分類問題中每個類別的數據出現的概率），而條件概率分布是可以通過可觀測的類別標簽以和可觀測的樣本自變量中特征來求得（對應文本分類問題中已知類別的前提下某個單詞是否出現的概率），因此通過朴素貝葉斯法就可以對概率模型求解。但是如果模型中存在隱變量，那么朴素貝葉斯法則不能使用，因為先驗概率分布和條件概率分布無法直接求得，因此提出一種用迭代方式進行的對不完全數據進行極大似然估計的方法——期望最大化算法（EM算法），接下來將對算法進行詳細的證明和解釋。

二、算法詳解

1. 極大化似然函數

由於能觀測到的數據只有不完全數據Y，因此對參數進行極大似然估計。

L （ θ ） = log P (Y ∣ θ) = log \sum z P (Y, Z ∣ θ) = log \sum z P (Y ∣ Z, θ)

L (θ) \geq B (θ ， θ) = \sum z P (Z ∣ Y, θ) log P ( Y , Z ∣ θ ) P ( Z ∣ Y , θ )

2. 收斂性分析

現在最大化似然函數L，可以先求似然函數L的下限函數B，然后找出下限函數B的極大值點，那么該點一定也使似然函數L更靠近其極大值點，通過迭代的步驟，就可以不斷逼近L的極值點。如下圖EM算法迭代
首先用先驗啟動知識對Z的后驗概率進行初始化，即用已標記數據及計算出Y前提下Z的概率分布，因此可獲得

3. 算法流程

EM算法的流程分為兩步，分別是E（期望）步和M（最大化）步。E步主要是求出當前的下限函數B，由於B是通過期望推導出的，所以稱為期望步驟，M步主要是求出當前下限函數B的極大值，然后將這點的參數帶入似然函數，所以稱為最大化步，因此算法流程如下：
1. 利用預先知識，求出隱變量的后驗概率分布，獲得參數空間的初始值θ(0)來啟動EM算法。
2. E步，求期望

那根據上面的算法過程，我們可以講其實現，代碼如下

## Help Function
logSum <- function(v) {
   m = max(v)
   return ( m + log(sum(exp(v-m))))
}

# Hard E step
hard.E.step <- function(gamma, model, counts){
  # Model Parameter Setting
  N <- dim(counts)[2] # number of documents
  K <- dim(model$mu)[1]

  # E step:    
  for (n in 1:N){
    for (k in 1:K){
      ## calculate the posterior based on the estimated mu and rho in the "log space"
      gamma[n,k] <- log(model$rho[k,1]) +  sum(counts[,n] * log(model$mu[k,])) 
    }
    # normalisation to sum to 1 in the log space
    logZ = logSum(gamma[n,])
    gamma[n,] = gamma[n,] - logZ
  
    ## Compared with soft EM, here we need to do a hard assignment for the ducuments based on the post probability
    ## Find the max post probablity for znk
    max.index <- which.max(gamma[n,])
    ## Set all the post probability to 0 and biggest to 1 and finish the hard assignment
    gamma[n,] <- 0
    gamma[n,max.index] <- 1 
  }

  return (gamma)
}

# M step
M.step <- function(gamma, model, counts,eps=1e-10){
  # Model Parameter Setting
    N <- dim(counts)[2]   # number of documents
    W <- dim(counts)[1]   # number of words i.e. vocabulary size
    K <- dim(model$mu)[1] # number of clusters

      ## Updating the parameters in the M step
    for (k in 1:K) {
        ## Update the mix cofficients 
        model$rho[k,1] <- sum(gamma[,k])/N
        ## Update the language model mu 
        total <- sum(gamma[,k] * t(counts)) + W * eps
        ## For each w, compute the language model
        for (w in 1:W){
            model$mu[k,w] <- (sum(gamma[,k] * counts[w,]) + eps)/total
        }  
        
    }
  
      # Return the result
      return (model)
}

##--- Initialize model parameters randomly --------------------------------------------
##
initial.param <- function(vocab_size, K=4, seed=123456){
  rho <- matrix(1/K,nrow = K, ncol=1)                    # assume all clusters have the same size (we will update this later on)
  mu <- matrix(runif(K*vocab_size),nrow = K, ncol = vocab_size)    # initiate Mu 
  mu <- prop.table(mu, margin = 1)               # normalization to ensure that sum of each row is 1
  return (list("rho" = rho, "mu"= mu))
}

　接着我們把E-Mstep合並在一起

## Hard EM
##--- EM for Document Clustering --------------------------------------------
hard.EM <- function(counts, K=4, max.epoch=10, seed=123456){
  #INPUTS:
  ## counts: word count matrix
  ## K: the number of clusters
  #OUTPUTS:
  ## model: a list of model parameters
  
  # Model Parameter Setting
  N <- dim(counts)[2] # number of documents
  W <- dim(counts)[1] # number of unique words (in all documents)
  
  # Initialization
  model <- initial.param(W, K=K, seed=seed)
  gamma <- matrix(0, nrow = N, ncol = K)

  print(train_obj(model,counts))
  # Build the model
  for(epoch in 1:max.epoch){
    
    # E Step
    gamma <- hard.E.step(gamma, model, counts)
    # M Step
    model <- M.step(gamma, model, counts)
   
    print(train_obj(model,counts)) 
  }
  # Return Model
  return(list("model"=model,"gamma"=gamma))
}

接着，我們需要導入我們的文本，並做簡單處理，接着我們去驗證下我們上面實現的EM代碼

## Load the library
library(tm)
library(SnowballC)

## Function for reading the data
eps=1e-10
  
# reading the data
read.data <- function(file.name='Task2A.txt', sample.size=1000, seed=100, pre.proc=TRUE, spr.ratio= 0.90) {
  # INPUTS:
  ## file.name: name of the input .txt file
  ## sample.size: if == 0  reads all docs, otherwise only reads a subset of the corpus
  ## seed: random seed for sampling (read above)
  ## pre.proc: if TRUE performs the preprocessing (recommended)
  ## spr.ratio: is used to reduce the sparcity of data by removing very infrequent words
  # OUTPUTS:
  ## docs: the unlabled corpus (each row is a document)
  ## word.doc.mat: the count matrix (each rows and columns corresponds to words and documents, respectively)
  ## label: the real cluster labels (will be used in visualization/validation and not for clustering)
  
  # Read the data
  text <- readLines(file.name)
  # select a subset of data if sample.size > 0
  if (sample.size>0){
    set.seed(seed)
    text <- text[sample(length(text), sample.size)]
  }
  ## the terms before the first '\t' are the lables (the newsgroup names) and all the remaining text after '\t' are the actual documents
  docs <- strsplit(text, '\t')
  # store the labels for evaluation
  labels <-  unlist(lapply(docs, function(x) x[1]))
  # store the unlabeled texts    
  docs <- data.frame(unlist(lapply(docs, function(x) x[2])))
  
  
  library(tm)
  # create a corpus
  docs <- DataframeSource(docs)
  corp <- Corpus(docs)
  
  # Preprocessing:
  if (pre.proc){
    corp <- tm_map(corp, removeWords, stopwords("english")) # remove stop words (the most common word in a language that can be find in any document)
    corp <- tm_map(corp, removePunctuation) # remove pnctuation
    corp <- tm_map(corp, stemDocument) # perform stemming (reducing inflected and derived words to their root form)
    corp <- tm_map(corp, removeNumbers) # remove all numbers
    corp <- tm_map(corp, stripWhitespace) # remove redundant spaces 
  }  
  # Create a matrix which its rows are the documents and colomns are the words. 
  dtm <- DocumentTermMatrix(corp)
  ## reduce the sparcity of out dtm
  dtm <- removeSparseTerms(dtm, spr.ratio)
  ## convert dtm to a matrix
  word.doc.mat <- t(as.matrix(dtm))
  
  # Return the result
  return (list("docs" = docs, "word.doc.mat"= word.doc.mat, "labels" = labels))
}

　　訓練模型

# Reading documents 
## Note: sample.size=0 means all read all documents!
##(for develiopment and debugging use a smaller subset e.g., sample.size = 40)
data <- read.data(file.name='Task2A.txt', sample.size=0, seed=100, pre.proc=TRUE, spr.ratio= .99)

# word-document frequency matrix 
counts <- data$word.doc.mat   

# calling the hard EM algorithm on the data with K = 4
hard.res <- hard.EM(counts, K = 4, max.epoch = 50)

　　得到以下輸出

2171715
[1] 1952192
[1] 1942383
[1] 1938631
[1] 1937321
[1] 1936228
[1] 1935571
[1] 1935383
[1] 1935195
[1] 1935073
[1] 1935032
[1] 1934910
[1] 1934876
[1] 1934764
[1] 1934700
[1] 1934629
[1] 1934559
[1] 1934515
[1] 1934494
[1] 1934387
[1] 1934331
[1] 1934249
[1] 1934181
[1] 1934101
[1] 1933877
[1] 1933044
[1] 1929635
[1] 1927475
[1] 1926070
[1] 1925825
[1] 1925707
[1] 1925570
[1] 1925531
[1] 1925507
[1] 1925477
[1] 1925468
[1] 1925456
[1] 1925431
[1] 1925385
[1] 1925271
[1] 1925170
[1] 1925055
[1] 1924912
[1] 1924732
[1] 1924470
[1] 1924196
[1] 1923888
[1] 1923562
[1] 1923348
[1] 1923261
[1] 1923162

　　將我們的結果可視化出來

##--- Cluster Visualization -------------------------------------------------
cluster.viz <- function(doc.word.mat, color.vector, title=' '){
  p.comp <- prcomp(doc.word.mat, scale. = TRUE, center = TRUE)
  plot(p.comp$x, col=color.vector, pch=1,  main=title)
}


# hard EM clustering visualization
## find the culster with the maximum probability (since we have soft assignment here)
label.hat <- apply(hard.res$gamma, 1, which.max) 
## normalize the count matrix for better visualization
counts<-scale(counts) # only use when the dimensionality of the data (number of words) is large enough

## visualize the estimated clusters
cluster.viz(t(counts), label.hat, 'Estimated Clusters (Hard EM)')

　　那這時候，我們把原文本直接分類可視化，和EM的分類做對比

## visualize the real clusters
cluster.viz(t(counts), factor(data$label), 'Real Clusters')

我們發現，EM基本上非常好的把文本的分類這個任務給完成了。　　

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 R語言-文本挖掘主題模型文本分類 Bert文本分類實踐（一）：實現一個簡單的分類模型 fastText文本分類算法 xlnet中文文本分類任務用keras實現基本的文本分類任務詳解使用EM算法的半監督學習方法應用於朴素貝葉斯文本分類基於協同訓練的半監督文本分類算法文本分類算法TextCNN原理詳解（一）帶監督的文本分類算法FastText 用半監督算法做文本分類(自訓練)