Moses創建一個翻譯系統的基本過程記錄，以后會按照每個過程詳細說明，並給出每個步驟的參數說明

本文轉載自查看原文 2014-08-28 22:39 2479 Moses

軟件需求：

首先你必須要有Moses（廢話哈哈）、然后要有GIZA++用作詞對齊（traning-model.perl的時候會用到）、IRSTLM產生語言模型

大致步驟：

大體的步驟如下：

准備Parallerl data(需要句子對齊)：對語料進行tokenisation、truecasing和cleaning步驟之后才能使用於我們的機器翻譯系統（哈哈，都快忍不住直接寫詳細步驟了）
訓練你的語言模型（使用IRSTLM）：當然也有幾步，詳細敘述再說
然后就是訓練你的翻譯系統啦（可能要花一兩個小時）：(2) run GIZA
(3) align words
(4) learn lexical translation
(5) extract phrases
(6) score phrases
(7) learn reordering model
(8) learn generation model
(9) create decoder config file
最后是蛋疼的Tuning（當然你也可以自己手動的Tuning），大概要花幾個小時
最后你就可以跑了，如果你嫌啟動慢，可以把模型轉化為Binarised-model來進行，會更快，當然這會需要你改動一些東西，不過很簡單

詳細步驟以及說明：

一、准備語料：

　　首先我們需要你找到你要創建的翻譯系統匹配的平行語料，例如英法新聞平行語料，然后對語料進行三步處理后才可以使用：tokenisation、truecasing和cleaning

　　首先是tokenisation，就是詞例化或者說分詞，要使用：mosesdecoder/scripts/tokenizer/tokenizer.perl進行詞例化。作用是將平行語料（其實是兩個文本文件），中的每個句子進行詞例化

例子如下：

 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
    < ~/corpus/training/news-commentary-v8.fr-en.en    \
    > ~/corpus/news-commentary-v8.fr-en.tok.en
 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ 
    < ~/corpus/training/news-commentary-v8.fr-en.fr    \
    > ~/corpus/news-commentary-v8.fr-en.tok.fr

說明：-l 后指定文本的語言，看了一下文件，似乎只支持 en de fr 和 it，如要給中文分詞的話，可能需要更多的配置，然后要指定輸入的文本和輸出的地點，切記 “<” ">"是必須的！

完整參數說明如下：

Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile
Options:
  -q     ... quiet.
  -a     ... aggressive hyphen splitting.
  -b     ... disable Perl buffering.
  -time  ... enable processing time calculation.
  -penn  ... use Penn treebank-like tokenization.
  -protected FILE  ... specify file with patters to be protected in tokenisation.
  -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.

然后我們還要進行truecasing，也就是對詞匯的大小寫進行調整

truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.，使用的腳本是：

/mosesdecoder/scripts/recaser/train-truecaser.perl和 /scripts/recaser/truecase.perl例子如下

~/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/corpus/truecase-model.en --corpus     \
     ~/corpus/news-commentary-v8.fr-en.tok.en
 ~/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/corpus/truecase-model.fr --corpus     \
     ~/corpus/news-commentary-v8.fr-en.tok.fr


 ~/mosesdecoder/scripts/recaser/truecase.perl \
   --model ~/corpus/truecase-model.en         \
   < ~/corpus/news-commentary-v8.fr-en.tok.en \
   > ~/corpus/news-commentary-v8.fr-en.true.en
 ~/mosesdecoder/scripts/recaser/truecase.perl \
   --model ~/corpus/truecase-model.fr         \ 
   < ~/corpus/news-commentary-v8.fr-en.tok.fr \
   > ~/corpus/news-commentary-v8.fr-en.true.fr

其中的train-truecaser腳本用來訓練truecaser的model，輸入文件仍然是你的corpus的已tok文件，作用是修改每一句子的首字母，輸出是每個不同單詞的形式和頻率

manual中的USER GUIDE寫到：

Instead of lowercasing all training and test data, we may also want to keep words in their nat-
ural case, and only change the words at the beginning of their sentence to their most frequent
form. This is what we mean by truecasing. Again, this requires first the training of a truecasing
model, which is a list of words and the frequency of their different forms.

然后最后一部就是用剛才的模型進行truecase啦：

truecase.perl --model MODEL [-b] < in > out

-b代表 unbuffered，不清楚用途目前

~~在中文處理中應該不需要truecasing這一步~~

最后的詞處理clean,去除一些過長的單詞

Finally we clean, limiting sentence length to 80:

 ~/mosesdecoder/scripts/training/clean-corpus-n.perl \
    ~/corpus/news-commentary-v8.fr-en.true fr en \
    ~/corpus/news-commentary-v8.fr-en.clean 1 80

二、訓練語言模型(使用IRSTLM)

終於搞定了我們的語料了，下面我們要進入更深入的話題：訓練語言模型

語言模型最朴實的作用在於讓你的output更加流暢，更加像母語，為了達到這一效果，我們需要另外的句子對齊的平行語料來訓練我們的語言模型（如果你用training model的同一個語料來訓練，未免感覺會無效？）

參看manual中base系統的描述，我們需要用到以下幾個工具來訓練我們的語言模型。

首先：add-start-end.sh

 ~/irstlm/bin/add-start-end.sh                 \
   < ~/corpus/news-commentary-v8.fr-en.true.en \
   > news-commentary-v8.fr-en.sb.en

　　用於把你的語料添加上開始結束標記（實際上就是<s></s>標簽對）

然后使用：build-lm.sh build一個語言模型源文件，輸出lm源文件

export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh \
   -i news-commentary-v8.fr-en.sb.en                  \
   -t ./tmp  -p -s improved-kneser-ney -o news-commentary-v8.fr-en.lm.en

最后compile：

~/irstlm/bin/compile-lm  \