PLSA

本文轉載自查看原文 2016-07-13 19:43 4789 EM/ DataMining/ PLSA

PLSA模型

PLSA和LDA很像，都屬於主題模型，即它們都認為上帝在寫文章時先以一定概率選擇了一個主題，然后在這主題下以一定概率選擇了一個詞，重復這個過程就完成了一篇文章，即$p(d_i,w_j)=p(z_k|d_i)p(w_j|z_k)$，其中$d$表示文章，$w$表示詞，$z$表示主題。

模型求解

模型求解即求出所有的$p(z_k|d_i)$和$p(w_j|z_k)$，這樣就可以生成任意篇文章了。

這里有有必要補充個基礎概念--條件概率和后驗概率。所謂條件概率就是“由因得果”，在$p(z_k|d_i)$中$d_i$是因，$z_k$是果，所以$p(z_k|d_i)$就是個條件概率，同樣$p(w_j|z_k)$也是條件概率。所謂后驗概率就是“執果尋因”，即觀察到了系統的輸出和輸出，去探尋系統內部的運作機理。對應到PLSA模型中就是觀察到$d_i$中出現了一個詞$w_j$（文檔和詞都是觀察變量），去探尋連接$d_i$和$w_j$的是哪個主題（主題是隱含變量），如下圖所示，其實就是求$p(z_k|d_i,w_j)$。

圖1. PLSA的概率圖模型

下面用EM算法求解模型參數$p(z_k|d_i)$和$p(w_j|z_k)$。

E-Step

E是Expection(期望)的意思，即根據上一輪得到的模型參數求隱含變量的期望，對應到PLSA模型中就是根據上輪得到的模型參數$p(z_k|d_i)$和$p(w_j|z_k)$計算每篇文檔中每個詞背后對應的主題的概率$p(z_k|d_i,w_j)$。回頭看看圖1，從$d_i$到$w_j$一共有$K$條路徑，途經$z_k$的概率為

\begin{equation}p(z_k|d_i,w_j)=\frac{p(z_k|d_i)p(w_j|z_k)}{\sum_k{p(z_k|d_i)p(w_j|z_k)}}\label{post1}\end{equation}

這里的條件概率$p(z_k|d_i)$和$p(w_j|z_k)$是由上一輪的M-Step得到的，初始時$p(z_k|d_i)$和$p(w_j|z_k)$由隨機賦值得到。

如果完全由貝葉斯公式推導是這樣的

\begin{equation}p(z_k|d_i,w_j)=\frac{p(z_k)p(d_i,w_j|z_k)}{p(d_i,w_j)}=\frac{p(z_k)p(z_k|d_i)p(w_j|z_k)}{\sum_k{p(z_k|d_i)p(w_j|z_k)}}\label{post2}\end{equation}

跟公式\ref{post1}相比，公式\ref{post2}分子中多了個$p(z_k)$。於是計算$p(z_k|d_i,w_j)$就出現了兩個不同的版本，兩種版本的代碼我都見過，但是PLSA原創作者使用的是公式\ref{post1}。

M-Step

M是極大似然估計（Maximum Likelihood Estimate，MLE）的意思，在已知后驗概率的情況下通過MLE的方法求條件概率。

當我們已知所有的 $p(z_k|d_i,w_j)$時，統計一下在所有文章中由$z_k$到$w_j$的次數，再統計一下在所有文章中由$z_k$到任意$w$的次數，兩個次數相除就得到了$p(w_j|z_k)$

\begin{equation}p(w_j|z_k)=\frac{\sum_i{p(z_k|d_i,w_j)}}{\sum_i{\sum_j{p(z_k|d_i,w_j)}}}\label{cond1}\end{equation}

同樣，統計一下在文章$d_i$當中主題$z_k$出現的次數，再統計一下文章$d_i$中所有主題$z$的出現次數，兩者相除就得到了$p(z_k|d_i)$

\begin{equation}p(z_k|d_i)=\frac{\sum_j{p(z_k|d_i,w_j)}}{\sum_j{\sum_k{p(z_k|d_i,w_j)}}}\label{cond2}\end{equation}

且慢，不是說M-Step是用MLE的方法求條件概率嗎？這種簡單地統計頻數，讓兩個頻數相除跟MLE有什么關系呢？其實頻數相除就是由MLE推導出來的，我們舉一個簡單的例子來證明MLE和頻數相除是等價的（直接拿PLSA的例子來證明會比較復雜，中間還牽涉到拉格朗日數乘法）。投了10次硬幣，6次正面向上，4次反面向上，問這枚硬幣正面向上的概率是多少。用頻數相除的方法可以很容易地得到正面向上的概率是$\frac{6}{10}$。如果是用MLE求解，先設正面向上的概率為$p$，則似然函數為$p^6{(1-p)^4}$，對數似然函數為$ln{p^6}+ln{(1-p)^4}=6ln{p}+4ln{(1-p)}$，為求對數似然函數的極大值點我們令其導數為0，$\frac{6}{p}-\frac{4}{1-p}=0$，得$p=\frac{6}{10}$。所以兩種方法等價。

公式\ref{cond1}和\ref{cond2}沒有考慮到一個詞出現在文章的不同位置其權重實際上是不一樣的，比如一個詞出現在正文里我們算作1次出現，如果出現在標題里就應該算作1.5次出現。於是改進后的條件概率計算公式為

\begin{equation}p(w_j|z_k)=\frac{\sum_i{weight_{ij}\cdot p(z_k|d_i,w_j)}}{\sum_i{\sum_j{weight_{ij}\cdot p(z_k|d_i,w_j)}}}\label{cond3}\end{equation}

\begin{equation}p(z_k|d_i)=\frac{\sum_j{weight_{ij}\cdot p(z_k|d_i,w_j)}}{\sum_j{\sum_k{weight_{ij}\cdot p(z_k|d_i,w_j)}}}\label{cond4}\end{equation}

$weight_{ij}$是$w_j$在$d_i$中的權重。

PLSA用於推薦

PLSA是一個詞袋模型（BOW, Bag Of Word），它不考慮詞在文檔中出現的順序，但可以把詞在文檔中的權重考慮進來。我們把這些概念平行推廣到推薦系統中來，一個用戶的購買記錄看作是一個文檔，購買的每一件商品看作是一個詞，用戶對商品的評分看作是詞在文檔中的權重。套用PLSA算法就可以得到用戶在各個隱含主題上的向量表示$p(z_k|d_i)$，基於這個向量再去計算相似用戶，接着套用協同過濾算法給用戶推薦商品。

Java實現

PLSA.java

  1 package plsa;
  2 
  3 import java.io.BufferedReader;
  4 import java.io.BufferedWriter;
  5 import java.io.File;
  6 import java.io.FileReader;
  7 import java.io.FileWriter;
  8 import java.io.IOException;
  9 import java.util.ArrayList;
 10 import java.util.Collections;
 11 import java.util.Comparator;
 12 import java.util.HashMap;
 13 import java.util.List;
 14 import java.util.Map;
 15 import java.util.Map.Entry;
 16 
 17 /**
 18  * 最初的代碼來自於https://code.google.com/archive/p/mltool4j/，源代碼在計算p(z|d,w)時使用了p(z)，但是在傳統的PLSA算法中p(z)根本就沒有出現過，所以我對源代碼做了改動。
 19  * 
 20  * @author orisun
 21  * @date 2016年7月13日
 22  */
 23 public class PLSA {
 24     private Dataset dataset = null;
 25     private Posting[][] invertedIndex = null;
 26     private int M = -1; //文檔數
 27     private int V = -1; //詞匯數
 28     private int K = -1; //主題數
 29 
 30     public boolean doPLSA(String datafilePath, int ntopics, int iters) {
 31         try {
 32             this.dataset = new Dataset(datafilePath);
 33         } catch (IOException e) {
 34             e.printStackTrace();
 35             return false;
 36         }
 37         this.M = this.dataset.size();
 38         this.V = this.dataset.getFeatureNum();
 39         this.K = ntopics;
 40 
 41         //建立term-->doc的倒排索引，在計算p(w|z)時可以提高速度
 42         this.buildInvertedIndex(this.dataset);
 43         this.runEM(iters);
 44         return true;
 45     }
 46 
 47     /**
 48      * 建立term-->doc的倒排索引，在計算p(w|z)時可以提高速度
 49      * @param ds
 50      * @return
 51      */
 52     @SuppressWarnings("unchecked")
 53     private boolean buildInvertedIndex(Dataset ds) {
 54         ArrayList<Posting>[] list = new ArrayList[this.V];
 55         for (int k = 0; k < this.V; ++k) {
 56             list[k] = new ArrayList<Posting>();
 57         }
 58 
 59         for (int m = 0; m < this.M; m++) {
 60             Data d = ds.getDataAt(m);
 61             for (int position = 0; position < d.size(); position++) {
 62                 int w = d.getFeatureAt(position).dim;
 63                 list[w].add(new Posting(m, position));
 64             }
 65         }
 66         this.invertedIndex = new Posting[this.V][];
 67         for (int w = 0; w < this.V; w++) {
 68             this.invertedIndex[w] = list[w].toArray(new Posting[0]);
 69         }
 70         return true;
 71     }
 72 
 73     private boolean runEM(int iters) {
 74         // p(z|d), size: M x K
 75         double[][] Pz_d = new double[this.M][this.K];
 76 
 77         // p(w|z), size: K x V
 78         double[][] Pw_z = new double[this.K][this.V];
 79 
 80         // p(z|d,w), size: M x K x doc.size()
 81         double[][][] Pz_dw = new double[this.M][this.K][];
 82 
 83         // L: log-likelihood value
 84         double L = -1;
 85 
 86         //初始時，隨機初始化參數
 87         this.init(Pz_d, Pw_z, Pz_dw);
 88         for (int it = 0; it < iters; it++) {
 89             System.out.println("iteration " + it);
 90             // E-step
 91             if (!this.Estep(Pz_d, Pw_z, Pz_dw)) {
 92                 System.out.println("EM,  in E-step");
 93             }
 94 
 95             // M-step
 96             if (!this.Mstep(Pz_dw, Pw_z, Pz_d)) {
 97                 System.out.println("EM, in M-step");
 98             }
 99 
100             File modelPath = new File("model");
101             if (modelPath.exists()) {
102                 if (modelPath.isFile()) {
103                     modelPath.delete();
104                     modelPath.mkdirs();
105                 }
106             } else {
107                 modelPath.mkdirs();
108             }
109             //進入最后幾輪迭代時，保存參數
110             if (it > iters - 10) {
111                 L = calcLoglikelihood(Pz_d, Pw_z);
112                 System.out.println("[" + it + "]" + "\tlikelihood: " + L);
113                 outputPzd(Pz_d, "model/doc_topic." + it);//即文檔向量
114                 outputPwz(Pw_z, "model/topic_word." + it);
115             }
116         }
117 
118         return false;
119     }
120 
121     /**
122      * 拿計算好的文檔向量，去計算所有文檔跟第1篇文檔的相似度。以此來驗證PLSA得到的文檔向量是合理的。
123      */
124     public void test(String docVecFile) {
125         BufferedReader br = null;
126         try {
127             br = new BufferedReader(new FileReader(docVecFile));
128             String line = br.readLine();
129             if (line == null) {
130                 return;
131             }
132             String[] arr = line.split("\\s+");
133             if (arr.length < 1 + this.K) {
134                 System.err.println("1st doc vector's length is less than " + this.K);
135                 return;
136             }
137             double[] vec1 = new double[this.K];
138             double norm1 = 0.0;//向量模長
139             for (int i = 1; i < 1 + this.K; i++) {
140                 vec1[i - 1] = Double.parseDouble(arr[i]);
141                 norm1 += vec1[i - 1] * vec1[i - 1];
142             }
143             norm1 = Math.sqrt(norm1);
144             Map<String, Double> simMap = new HashMap<String, Double>();
145             while ((line = br.readLine()) != null) {
146                 arr = line.split("\\s+");
147                 if (arr.length == 1 + this.K) {
148                     String docName = arr[0];
149                     double[] vec2 = new double[this.K];
150                     double norm2 = 0.0;//向量模長
151                     double prod = 0.0;//向量內積
152                     for (int i = 1; i < 1 + this.K; i++) {
153                         vec2[i - 1] = Double.parseDouble(arr[i]);
154                         norm2 += vec2[i - 1] * vec2[i - 1];
155                         prod += vec1[i - 1] * vec2[i - 1];
156                     }
157                     norm2 = Math.sqrt(norm2);
158                     double sim = prod / (norm1 * norm2);
159                     simMap.put(docName, sim);
160                 }
161             }
162 
163             //按相似度從大到小排序
164             List<Entry<String, Double>> simList = new ArrayList<Entry<String, Double>>(
165                 simMap.entrySet());
166             Collections.sort(simList, new Comparator<Entry<String, Double>>() {
167                 @Override
168                 public int compare(Entry<String, Double> o1, Entry<String, Double> o2) {
169                     if (o1.getValue() > o2.getValue()) {
170                         return -1;
171                     } else if (o1.getValue() < o2.getValue()) {
172                         return 1;
173                     } else {
174                         return 0;
175                     }
176                 }
177             });
178             //輸出前100個與文檔1最相似的文檔
179             for (int i = 0; i < 100 && i < simList.size(); i++) {
180                 System.out.println(simList.get(i).getKey() + "\t" + simList.get(i).getValue());
181             }
182         } catch (IOException e) {
183             e.printStackTrace();
184         } finally {
185             try {
186                 br.close();
187             } catch (IOException e) {
188             }
189         }
190 
191     }
192 
193     private boolean init(double[][] Pz_d, double[][] Pw_z, double[][][] Pz_dw) {
194         // p(z|d), size: M x K
195         for (int m = 0; m < this.M; m++) {
196             double norm = 0.0;
197             for (int z = 0; z < this.K; z++) {
198                 Pz_d[m][z] = Math.random();
199                 norm += Pz_d[m][z];
200             }
201 
202             for (int z = 0; z < this.K; z++) {
203                 Pz_d[m][z] /= norm;
204             }
205         }
206 
207         // p(w|z), size: K x V
208         for (int z = 0; z < this.K; z++) {
209             double norm = 0.0;
210             for (int w = 0; w < this.V; w++) {
211                 Pw_z[z][w] = Math.random();
212                 norm += Pw_z[z][w];
213             }
214 
215             for (int w = 0; w < this.V; w++) {
216                 Pw_z[z][w] /= norm;
217             }
218         }
219 
220         // p(z|d,w), size: M x K x doc.size()
221         for (int m = 0; m < this.M; m++) {
222             for (int z = 0; z < this.K; z++) {
223                 Pz_dw[m][z] = new double[this.dataset.getDataAt(m).size()];
224             }
225         }
226         return false;
227     }
228 
229     private boolean Estep(double[][] Pz_d, double[][] Pw_z, double[][][] Pz_dw) {
230         for (int m = 0; m < this.M; m++) {
231             Data data = this.dataset.getDataAt(m);
232             for (int position = 0; position < data.size(); position++) {
233                 // get word(dimension) at current position of document m
234                 int w = data.getFeatureAt(position).dim;
235                 double norm = 0.0;
236                 for (int z = 0; z < this.K; z++) {
237                     double val = Pz_d[m][z] * Pw_z[z][w];
238                     Pz_dw[m][z][position] = val;
239                     norm += val;
240                 }
241                 // 當前文檔中的當前詞，在各個主題上的概率分布進行歸一化
242                 for (int z = 0; z < this.K; z++) {
243                     Pz_dw[m][z][position] /= norm;
244                 }
245             }
246         }
247         return true;
248     }
249 
250     private boolean Mstep(double[][][] Pz_dw, double[][] Pw_z, double[][] Pz_d) {
251         // p(z|d)
252         for (int m = 0; m < this.M; m++) {
253             double norm = 0.0;
254             for (int z = 0; z < this.K; z++) {
255                 double sum = 0.0;
256                 Data d = this.dataset.getDataAt(m);
257                 for (int position = 0; position < d.size(); position++) {
258                     double n = d.getFeatureAt(position).weight;
259                     sum += n * Pz_dw[m][z][position];
260                 }
261                 Pz_d[m][z] = sum;
262                 norm += sum;
263             }
264 
265             // normalization
266             for (int z = 0; z < this.K; z++) {
267                 Pz_d[m][z] /= norm;
268             }
269         }
270 
271         // p(w|z)
272         for (int z = 0; z < this.K; z++) {
273             double norm = 0.0;
274             for (int w = 0; w < this.V; w++) {
275                 double sum = 0.0;
276                 Posting[] postings = this.invertedIndex[w];
277                 for (Posting posting : postings) {
278                     int m = posting.docID;
279                     int position = posting.pos;
280                     double n = this.dataset.getDataAt(m).getFeatureAt(position).weight;
281                     sum += n * Pz_dw[m][z][position];
282                 }
283                 Pw_z[z][w] = sum;
284                 norm += sum;
285             }
286             // normalization
287             for (int w = 0; w < this.V; w++) {
288                 Pw_z[z][w] /= norm;
289             }
290         }
291 
292         return true;
293     }
294 
295     private double calcLoglikelihood(double[][] Pz_d, double[][] Pw_z) {
296         double L = 0.0;
297         for (int m = 0; m < this.M; m++) {
298             Data d = this.dataset.getDataAt(m);
299             for (int position = 0; position < d.size(); position++) {
300                 Feature f = d.getFeatureAt(position);
301                 int w = f.dim;
302                 double n = f.weight;
303 
304                 double sum = 0.0;
305                 for (int z = 0; z < this.K; z++) {
306                     sum += Pz_d[m][z] * Pw_z[z][w];
307                 }
308                 L += n * Math.log10(sum);
309             }
310         }
311         return L;
312     }
313 
314     /**
315      * 輸出每篇文檔在各個主題上的概率分布
316      * 
317      * @param outFile
318      */
319     private void outputPzd(double[][] Pz_d, String outFile) {
320         BufferedWriter bw = null;
321         try {
322             bw = new BufferedWriter(new FileWriter(outFile));
323             for (int i = 0; i < this.M; i++) {
324                 String docName = this.dataset.getDataAt(i).docName;
325                 bw.write(docName);
326                 for (int j = 0; j < this.K; j++) {
327                     bw.write("\t");
328                     bw.write(String.valueOf(Pz_d[i][j]));
329                 }
330                 bw.newLine();
331             }
332         } catch (IOException e) {
333             e.printStackTrace();
334         } finally {
335             if (bw != null) {
336                 try {
337                     bw.close();
338                 } catch (IOException e) {
339                 }
340             }
341         }
342     }
343 
344     /**
345      * 輸出每個主題下的top100的詞
346      * 
347      * @param outFile
348      */
349     private void outputPwz(double[][] Pw_z, String outFile) {
350         BufferedWriter bw = null;
351         try {
352             bw = new BufferedWriter(new FileWriter(outFile));
353             for (int i = 0; i < Pw_z.length; i++) {
354                 Map<String, Double> wordWeight = new HashMap<String, Double>();//詞在該主題下的權重
355                 for (int j = 0; j < Pw_z[i].length; j++) {
356                     String word = this.dataset.features.get(j);
357                     wordWeight.put(word, Pw_z[i][j]);
358                 }
359                 List<Entry<String, Double>> wordWeightList = new ArrayList<Entry<String, Double>>(
360                     wordWeight.entrySet());
361                 Collections.sort(wordWeightList, new Comparator<Entry<String, Double>>() {
362                     @Override
363                     public int compare(Entry<String, Double> o1, Entry<String, Double> o2) {
364                         if (o1.getValue() > o2.getValue()) {
365                             return -1;
366                         } else if (o1.getValue() < o2.getValue()) {
367                             return 1;
368                         } else {
369                             return 0;
370                         }
371                     }
372                 });
373                 for (int j = 0; j < wordWeightList.size() && j < 100; j++) {
374                     bw.write(wordWeightList.get(j).getKey() + ":" + wordWeightList.get(j).getValue()
375                              + "\t");
376                 }
377                 bw.newLine();
378             }
379         } catch (IOException e) {
380             e.printStackTrace();
381         } finally {
382             if (bw != null) {
383                 try {
384                     bw.close();
385                 } catch (IOException e) {
386                 }
387             }
388         }
389     }
390 
391     public static void main(String[] args) {
392         int nTopic = 50;
393         int nIter = 100;
394         PLSA plsa = new PLSA();
395         if (args.length < 1) {
396             System.err.println("train data in docs/user2vec");
397             plsa.doPLSA("docs/user2vec", nTopic, nIter);
398         } else {
399             System.out.println("train data in " + args[0]);
400             if (args.length >= 2) {
401                 nTopic = Integer.parseInt(args[1]);
402             }
403             if (args.length >= 3) {
404                 nIter = Integer.parseInt(args[2]);
405             }
406             plsa.doPLSA(args[0], nTopic, nIter);
407         }
408         System.out.println("end PLSA");
409 
410         String docVecFile = "model/doc_topic." + (nIter - 1);
411         plsa.test(docVecFile);
412     }
413     //nohup java -cp .:plsa.jar plsa.PLSA /data/orisun/cf/data/user_graph.txt 50 100 &
414 }

View Code

Dataset.java

  1 package plsa;
  2 
  3 import java.io.BufferedReader;
  4 import java.io.File;
  5 import java.io.FileReader;
  6 import java.io.IOException;
  7 import java.util.ArrayList;
  8 import java.util.HashMap;
  9 import java.util.List;
 10 import java.util.Map;
 11 
 12 /**
 13  * 文檔集合
 14  * 
 15  * @author orisun
 16  * @date 2016年7月10日
 17  */
 18 public class Dataset {
 19 
 20     /** 文檔集合 **/
 21     List<Data> datas = new ArrayList<Data>();
 22     /** 記錄每個詞的編號 **/
 23     Map<String, Integer> featureIndex = new HashMap<String, Integer>();
 24     List<String> features = new ArrayList<String>();
 25 
 26     int size() {
 27         return datas.size();
 28     }
 29 
 30     int getFeatureNum() {
 31         return featureIndex.size();
 32     }
 33 
 34     Data getDataAt(int i) {
 35         return datas.get(i);
 36     }
 37 
 38     /**
 39      * 
 40      * @param dataDir
 41      *            如果dataDir是文檔集所在的目錄。文檔格式：每行存儲一個詞及詞在文件中的權重，空格分隔。每篇文檔中詞可以有重復。<br>
 42      *            如果所有文檔都放在dataDir這一個文件里面，則文件每行的格式為:文件名\t詞:權重\t詞:權重……
 43      * @throws IOException
 44      */
 45     Dataset(String dataDir) throws IOException {
 46         File path = new File(dataDir);
 47         if (path.exists()) {
 48             int featureNum = 0;
 49             if (path.isDirectory()) {
 50                 File[] files = path.listFiles();
 51                 for (File file : files) {
 52                     Data data = new Data();
 53                     data.docName = file.getName();
 54                     BufferedReader br = new BufferedReader(new FileReader(file));
 55                     String line = null;
 56                     while ((line = br.readLine()) != null) {
 57                         String[] arr = line.trim().split("\\s+");
 58                         if (arr.length == 2) {
 59                             String word = arr[0];
 60                             double weight = Double.parseDouble(arr[1]);
 61                             Integer index = featureIndex.get(word);
 62                             if (index == null) {
 63                                 featureIndex.put(word, featureNum);
 64                                 features.add(word);
 65                                 index = featureNum;
 66                                 featureNum++;
 67                             }
 68                             Feature feature = new Feature(index, weight);
 69                             data.features.add(feature);
 70                         }
 71                     }
 72                     br.close();
 73                     datas.add(data);
 74                 }
 75             } else if (path.isFile()) {
 76                 BufferedReader br = new BufferedReader(new FileReader(path));
 77                 String line = null;
 78                 while ((line = br.readLine()) != null) {
 79                     String[] arr = line.trim().split("\\s+");
 80                     if (arr.length >= 2) {
 81                         Data data = new Data();
 82                         data.docName = arr[0];
 83                         for (int i = 1; i < arr.length; i++) {
 84                             String[] brr = arr[i].split(":");
 85                             if (brr.length == 2) {
 86                                 String word = brr[0];
 87                                 double weight = Double.parseDouble(brr[1]);
 88                                 Integer index = featureIndex.get(word);
 89                                 if (index == null) {
 90                                     featureIndex.put(word, featureNum);
 91                                     features.add(word);
 92                                     index = featureNum;
 93                                     featureNum++;
 94                                 }
 95                                 Feature feature = new Feature(index, weight);
 96                                 data.features.add(feature);
 97                             }
 98                         }
 99                         datas.add(data);
100                     }
101                 }
102                 br.close();
103             }
104         }
105     }
106 
107 }

View Code

Data.java

 1 package plsa;
 2 
 3 import java.util.ArrayList;
 4 import java.util.List;
 5 
 6 /**
 7  * 文檔
 8  * 
 9  * @author orisun
10  * @date 2016年7月10日
11  */
12 public class Data {
13 
14     /** 文檔中的所有詞 **/
15     List<Feature> features = new ArrayList<Feature>();
16     /** 文檔名稱 **/
17     String docName;
18 
19     int size() {
20         return features.size();
21     }
22 
23     Feature getFeatureAt(int i) {
24         return features.get(i);
25     }
26 }

View Code

Feature.java

 1 package plsa;
 2 
 3 /**
 4  * 詞
 5  * 
 6  * @author orisun
 7  * @date 2016年7月10日
 8  */
 9 public class Feature {
10 
11     /** 該詞在所有詞中的編號 **/
12     int dim;
13     /** 該詞在指定文檔中的權重 **/
14     double weight;
15 
16     Feature(int index, double weight) {
17         this.dim = index;
18         this.weight = weight;
19     }
20 }

View Code

Posting.java

 1 package plsa;
 2 
 3 /**
 4  * 倒排索引
 5  * 
 6  * @author orisun
 7  * @date 2016年7月10日
 8  */
 9 public class Posting {
10 
11     /** 文檔編號 **/
12     int docID;
13     /** 詞在文檔中的位置 **/
14     int pos;
15 
16     Posting(int docID, int pos) {
17         this.docID = docID;
18         this.pos = pos;
19     }
20 }

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 PLSA主題模型 LDA和PLSA的區別 PLSA的EM推導 PLSA及EM算法 PLSA算法（轉） PLSA的簡單概念 LSA，pLSA原理及其代碼實現 Mixture unigram Model, PLSA及LDA 一口氣講完 LSA — PlSA —LDA在自然語言處理中的使用 NLP —— 圖模型（三）pLSA（Probabilistic latent semantic analysis，概率隱性語義分析）模型