使用DTM ( Dynamic Topic Models )進行主題演化實驗

本文轉載自查看原文 2014-10-28 19:13 2196 Dynamic Topic Models

最近想研究下Dynamic Topic Models（DTM），論文看了看，文科生的水平確實是看不懂，那就實驗一下吧，正好Blei的主頁上也提供了相應的C++工具， http://www.cs.princeton.edu/~blei/topicmodeling.html，dtm這個代碼放在google code中，下載需要fq。

下載了之后看了看，C++確實是不懂，但是在github上搜了一遭，也沒找到完美的java版本，所以只能硬着頭皮用C++了。

同時也去網上找找看看有沒有人做過類似的工作，一搜確實是有，但是不多：

1、一哥們的實驗，在linux下，http://www.jgoodwin.net/experimenting-with-dynamic-topic-models/

2、一些人的疑問，有沒有python版本，答案是沒有。http://stackoverflow.com/questions/22469506/are-there-any-efficient-python-libraries-for-dynamic-topic-models-preferably-ex

3、郵件列表 https://lists.cs.princeton.edu/pipermail/topic-models/

看了這么多，也沒看明白是什么意思，還是自己慢慢搞吧。

第1步：安裝系統

　　下載個centos5.5，這個在readme文件中有說明，作者就是這個版本編譯的，我安裝到了VMware中，當然還有另外的兩個版本可以用。

第2步：編譯文件

　　這個在readme文件中也有說明，把文件放到相應目錄，make一下就行了。

第3步：作者建議先用他的文件里面帶的一個小例子試驗一下。

　　在dtm/sample.sh文件中有說明。

（1）輸入文件。

　　兩個輸入文件是必須的，這些文件在dtm/example文件夾下面可以看到，是test-mult.dat和test-seq.dat。

a：foo-mult.dat ，用來表示文檔和詞的關系

　　　每個文檔一行，每一行形式是： unique_word_count index1:count1 index2:count2 ... indexn:counnt

　　　用中文來說就是：該文章的總詞數（不重復）詞1編號（用數字表示編號）：詞1頻次詞2編號：詞2頻次詞n編號：詞n頻次

　　　例如：11 288:1 1248:1 5:1 1063:2 269:1 654:1 656:2 532:1 373:1 1247:1 543:1

　　　表示這篇一共有11個不重復的詞，第228個詞出現1次，1248個詞出現1次，這些詞是所有文檔中統一編號的。

需要注意的一點是：該文件中文檔是按時間順序排列的，時間最早的在最上面，時間最晚的在最下面。

　 b：foo-deq.dat ，這文件是用來划分時間窗的。

　　　　文件格式如下：

　　　　　　　　Number_Timestamps（時間窗總數）
　　　　　　　　number_docs_time_1（第一個時間窗的文檔數，就是從第一個到第幾個文檔划分到第一個時間窗，我們如果按年來划分，就把每年的文檔數寫到這里就行）
　　　　　　　　　...
　　　　　　　　number_docs_time_i
　　　　　　　　...
　　　　　　　　number_docs_time_NumberTimestamps

作者提供的例子，第一行表示分為10個時間窗，第二行表示第一個時間窗有25個文檔。（看樣子估計也是按年划分的）：

　　　　　　　　　　10
　　　　　　　　　　25
　　　　　　　　　　50
　　　　　　　　　　75
　　　　　　　　　　100
　　　　　　　　　　100
　　　　　　　　　　100
　　　　　　　　　　100
　　　　　　　　　　125
　　　　　　　　　　150
　　　　　　　　　　175

當上面兩個文件搞定后。作者說還有兩個文件雖然不是必須的，但是也是很有用的。

C：詞典文件

　　　文檔集合中涉及的所有的詞，按照上面的詞的序號排列。

d：文檔信息文件

　　　每行表示一個文檔的基本信息，按照文檔a中的順序排列。

上述文件都可以用text2ldac生成，在https://github.com/JoKnopp/text2ldac下載，用python打開。

　　 使用方法，在命令行中，找到text2ldac.py目錄，運行 python text2ldac.py -o ./out -e txt ./in

　　　　out文件夾為輸出文件位置，in文件夾為輸入文件位置。 txt為僅處理txt文件

　　（2）運行程序

　　　作者在readme文件中說，通過運行./main --help命令可以查看所有選項和解釋，下面是運行該命令后的結果，太多了啊

　　

Flag initialize_lda is of type bool, but its default value is not a boolean. NOTE: This will soon be a compilations error!main: Warning: SetUsageMessage() never called

Flags from ../lib/util/gflags-1.1/src/gflags.cc:
-flagfile (load flags from file) type: string default: ""
-fromenv (set flags from the environment [use 'export FLAGS_flag1=value'])
type: string default: ""
-tryfromenv (set flags from the environment if present) type: string
default: ""
-undefok (comma-separated list of flag names that it is okay to specify on
the command line even if the program does not define a flag with that
name. IMPORTANT: flags in this list that have arguments MUST use the
flag=value format) type: string default: ""

Flags from ../lib/util/gflags-1.1/src/gflags_completions.cc:
-tab_completion_columns (Number of columns to use in output for tab
completion) type: int32 default: 80
-tab_completion_word (If non-empty, HandleCommandLineCompletions() will
hijack the process and attempt to do bash-style command line flag
completion on this value.) type: string default: ""

Flags from ../lib/util/gflags-1.1/src/gflags_reporting.cc:
-help (show help on all flags [tip: all flags can have two dashes])
type: bool default: true
-helpfull (show help on all flags -- same as -help) type: bool
default: false
-helpmatch (show help on modules whose name contains the specified substr)
type: string default: ""
-helpon (show help on the modules named by this flag value) type: string
default: ""
-helppackage (show help on all modules in the main package) type: bool
default: false
-helpshort (show help on only the main module for this program) type: bool
default: false
-helpxml (produce an xml version of help) type: bool default: false
-version (show version and build info and exit) type: bool default: false

Flags from data.c:
-influence_flat_years (How many years is the influence nonzero?If nonpositive, a lognormal distribution is used.)

　　　　type: int32 default: -1
-influence_mean_years (How many years is the mean number of citations?)
　　type: double default: 20
-influence_stdev_years (How many years is the stdev number of citations?)
　　type: double default: 15
-max_number_time_points (Used for the influence window.) type: int32
　　default: 200
-resolution (The resolution. Used to determine how far out the beta mean should be.)

　　 type: double default: 1
-sigma_c (c stdev.) type: double default: 0.050000000000000003
-sigma_cv (Variational c stdev.) type: double
　　default: 9.9999999999999995e-07
-sigma_d (If true, use the new phi calculation.) type: double
　　default: 0.050000000000000003
-sigma_l (If true, use the new phi calculation.) type: double
　　default: 0.050000000000000003
-time_resolution (This is the number of years per time slice.) type: double
　　default: 0.5

Flags from gsl-wrappers.c:
-rng_seed (Specifies the random seed. If 0, seeds pseudo-randomly.)
　　type: int64 default: 0

Flags from lda-seq.c:
-fix_topics (Fix a set of this many topics. This amounts to fixing these topics' variance at 1e-10.)

　　type: int32 default: 0
-forward_window (The forward window for deltas. If negative, we use a beta with mean 5.)

　　 type: int32 default: 1
-lda_sequence_max_iter (The maximum number of iterations.)

　　type: int32
　　default: 20
-lda_sequence_min_iter (The maximum number of iterations.)

　　 type: int32 default: 1
-normalize_docs (Describes how documents's wordcounts are considered for finding influence. Options are "normalize", "none", "occurrence", "log", or "log_norm".)

　　type: string default: "normalize"
-save_time (Save a specific time. If -1, save all times.)

　　type: int32 default: 2147483647

Flags from lda.c:
-lambda_convergence (Specifies the level of convergence required for lambda in the phi updates.)

　　type: double default: 0.01

Flags from main.c:
-alpha () type: double default: -10
-corpus_prefix (The function to perform. Can be dtm or dim.)

　　 type: string default: ""
-end () type: int32 default: -1
-heldout_corpus_prefix () type: string default: ""
-heldout_time (A time up to (but not including) which we wish to train, and at which we wish to test.) type: int32 default: -1
-initialize_lda (If true, initialize the model with lda.) type: bool
default: false
-lda_max_em_iter () type: int32 default: 20
-lda_model_prefix (The name of a fit model to be used for testing likelihood. Appending "info.dat" to this should give the name of the
file.) type: string default: ""
-mode (The function to perform. Can be fit, est, or time.) type: string
default: "fit"
-model (The function to perform. Can be dtm or dim.)

　　type: string default: "dtm"
-ntopics () type: double default: -1
-outname () type: string default: ""
-output_table () type: string default: ""
-params_file (A file containing parameters for this run.) type: string
default: "settings.txt"
-start () type: int32 default: -1
-top_chain_var () type: double default: 0.0050000000000000001
-top_obs_var () type: double default: 0.5

　輸入下面的命令（后面的注釋是我自己加的，。如果影響運行請去掉）

./main \ /*main函數*/
--ntopics=20 \ /*每個時間窗生成20個主題*/
--mode=fit \ /*這個應該有dim和fit兩個選項*/
--rng_seed=0 \
--initialize_lda=true \
--corpus_prefix=example/test \
--outname=example/model_run \
--top_chain_var=0.005 \
--alpha=0.01 \
--lda_sequence_min_iter=6 \
--lda_sequence_max_iter=20 \
--lda_max_em_iter=10

（2）輸出結果。上面文件完成之后，通過運行程序生成下面的文件，並且可以通過R查看結果，我們就可以用這個結果進行分析。

　　　a　topic-???-var-e-log-prob.dat:

　　　　　　主要是 e-betas（詞在每個主題內每個時間段的分布），一行是一個詞。

　　　　　　從文件中，我們看以看到每行只有一個數字。

　　　　　　可以在dtm\example\model_run\lda-seq中看到例子，他這個應該是有48240行，應該是有4824個詞，每個時間窗內有4824個詞？？。作者同時給出了在R中查看這些矩陣的方法。比如查看某個詞在某個主題的某一個時間段的概率。

b gam.dat

gammas數據。表示文檔與主題的關聯。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 LDA進階（Dynamic Topic Models）概率主題模型簡介 Introduction to Probabilistic Topic Models 安裝使用pyclone進行克隆演化推斷 RabbitMQ的主題（Topic）模式（五） RabbitMQ（topic主題模式） MQTT主題Topic講解【Kafka】Kafka topic主題刪除不了使用R語言進行主題發現（一）消息隊列 -- 隊列（Queue）和主題（Topic） RabbitMQ (七) 訂閱者模式之主題模式 ( topic )