Moses創建一個翻譯系統的基本過程記錄,以后會按照每個過程詳細說明,並給出每個步驟的參數說明


軟件需求:

首先你必須要有Moses(廢話哈哈)、然后要有GIZA++用作詞對齊(traning-model.perl的時候會用到)、IRSTLM產生語言模型

大致步驟:

大體的步驟如下:

  1. 准備Parallerl data(需要句子對齊):對語料進行tokenisation、truecasing和cleaning步驟之后才能使用於我們的機器翻譯系統(哈哈,都快忍不住直接寫詳細步驟了)
  2. 訓練你的語言模型(使用IRSTLM):當然也有幾步,詳細敘述再說
  3. 然后就是訓練你的翻譯系統啦(可能要花一兩個小時):(2) run GIZA
    (3) align words
    (4) learn lexical translation
    (5) extract phrases
    (6) score phrases
    (7) learn reordering model
    (8) learn generation model
    (9) create decoder config file

  4. 最后是蛋疼的Tuning(當然你也可以自己手動的Tuning),大概要花幾個小時
  5. 最后你就可以跑了,如果你嫌啟動慢,可以把模型轉化為Binarised-model來進行,會更快,當然這會需要你改動一些東西,不過很簡單

詳細步驟以及說明:

一、准備語料:

  首先我們需要你找到你要創建的翻譯系統匹配的平行語料,例如英法新聞平行語料,然后對語料進行三步處理后才可以使用:tokenisation、truecasing和cleaning

  首先是tokenisation,就是詞例化或者說分詞,要使用 :mosesdecoder/scripts/tokenizer/tokenizer.perl進行詞例化。 作用是將平行語料(其實是兩個文本文件),中的每個句子進行詞例化

例子如下:

 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
    < ~/corpus/training/news-commentary-v8.fr-en.en    \
    > ~/corpus/news-commentary-v8.fr-en.tok.en
 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ 
    < ~/corpus/training/news-commentary-v8.fr-en.fr    \
    > ~/corpus/news-commentary-v8.fr-en.tok.fr

 說明:-l 后指定文本的語言,看了一下文件,似乎只支持 en de fr 和 it,如要給中文分詞的話,可能需要更多的配置,然后要指定輸入的文本和輸出的地點,切記 “<”  ">"是必須的!

完整參數說明如下:

Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile
Options:
  -q     ... quiet.
  -a     ... aggressive hyphen splitting.
  -b     ... disable Perl buffering.
  -time  ... enable processing time calculation.
  -penn  ... use Penn treebank-like tokenization.
  -protected FILE  ... specify file with patters to be protected in tokenisation.
  -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.

 

然后我們還要進行truecasing,也就是對詞匯的大小寫進行調整

truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.,使用的腳本是:

/mosesdecoder/scripts/recaser/train-truecaser.perl和 /scripts/recaser/truecase.perl例子如下
~/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/corpus/truecase-model.en --corpus     \
     ~/corpus/news-commentary-v8.fr-en.tok.en
 ~/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/corpus/truecase-model.fr --corpus     \
     ~/corpus/news-commentary-v8.fr-en.tok.fr

~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.en \ < ~/corpus/news-commentary-v8.fr-en.tok.en \ > ~/corpus/news-commentary-v8.fr-en.true.en ~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.fr \ < ~/corpus/news-commentary-v8.fr-en.tok.fr \ > ~/corpus/news-commentary-v8.fr-en.true.fr

 其中的train-truecaser腳本用來訓練truecaser的model,輸入文件仍然是你的corpus的已tok文件,作用是修改每一句子的首字母,輸出是每個不同單詞的形式和頻率

manual中的USER GUIDE寫到:

Instead of lowercasing all training and test data, we may also want to keep words in their nat-
ural case, and only change the words at the beginning of their sentence to their most frequent
form. This is what we mean by truecasing. Again, this requires first the training of a truecasing
model, which is a list of words and the frequency of their different forms.

 然后最后一部就是用剛才的模型進行truecase啦:

truecase.perl --model MODEL [-b] < in > out

 -b代表 unbuffered,不清楚用途目前

 在中文處理中應該不需要truecasing這一步

 

最后的詞處理clean,去除一些過長的單詞

Finally we clean, limiting sentence length to 80:

 ~/mosesdecoder/scripts/training/clean-corpus-n.perl \
    ~/corpus/news-commentary-v8.fr-en.true fr en \
    ~/corpus/news-commentary-v8.fr-en.clean 1 80

 


 

二、訓練語言模型(使用IRSTLM)

終於搞定了我們的語料了,下面我們要進入更深入的話題:訓練語言模型

語言模型最朴實的作用在於讓你的output更加流暢,更加像母語,為了達到這一效果,我們需要另外的句子對齊的平行語料來訓練我們的語言模型(如果你用training model的同一個語料來訓練,未免感覺會無效?)

參看manual中base系統的描述,我們需要用到以下幾個工具來訓練我們的語言模型。

 

首先:add-start-end.sh

 ~/irstlm/bin/add-start-end.sh                 \
   < ~/corpus/news-commentary-v8.fr-en.true.en \
   > news-commentary-v8.fr-en.sb.en

  用於把你的語料添加上開始結束標記(實際上就是<s></s>標簽對)

然后使用:build-lm.sh build一個語言模型源文件,輸出lm源文件

export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh \
   -i news-commentary-v8.fr-en.sb.en                  \
   -t ./tmp  -p -s improved-kneser-ney -o news-commentary-v8.fr-en.lm.en

最后compile:

~/irstlm/bin/compile-lm  \

軟件需求:

首先你必須要有Moses(廢話哈哈)、然后要有GIZA++用作詞對齊(traning-model.perl的時候會用到)、IRSTLM產生語言模型

大致步驟:

大體的步驟如下:

  1. 准備Parallerl data(需要句子對齊):對語料進行tokenisation、truecasing和cleaning步驟之后才能使用於我們的機器翻譯系統(哈哈,都快忍不住直接寫詳細步驟了)
  2. 訓練你的語言模型(使用IRSTLM):當然也有幾步,詳細敘述再說
  3. 然后就是訓練你的翻譯系統啦(可能要花一兩個小時):(2) run GIZA
    (3) align words
    (4) learn lexical translation
    (5) extract phrases
    (6) score phrases
    (7) learn reordering model
    (8) learn generation model
    (9) create decoder config file

  4. 最后是蛋疼的Tuning(當然你也可以自己手動的Tuning),大概要花幾個小時
  5. 最后你就可以跑了,如果你嫌啟動慢,可以把模型轉化為Binarised-model來進行,會更快,當然這會需要你改動一些東西,不過很簡單

詳細步驟以及說明:

一、准備語料:

  首先我們需要你找到你要創建的翻譯系統匹配的平行語料,例如英法新聞平行語料,然后對語料進行三步處理后才可以使用:tokenisation、truecasing和cleaning

  首先是tokenisation,就是詞例化或者說分詞,要使用 :mosesdecoder/scripts/tokenizer/tokenizer.perl進行詞例化。 作用是將平行語料(其實是兩個文本文件),中的每個句子進行詞例化

例子如下:

 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
    < ~/corpus/training/news-commentary-v8.fr-en.en    \
    > ~/corpus/news-commentary-v8.fr-en.tok.en
 ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \ 
    < ~/corpus/training/news-commentary-v8.fr-en.fr    \
    > ~/corpus/news-commentary-v8.fr-en.tok.fr

 說明:-l 后指定文本的語言,看了一下文件,似乎只支持 en de fr 和 it,如要給中文分詞的話,可能需要更多的配置,然后要指定輸入的文本和輸出的地點,切記 “<”  ">"是必須的!

完整參數說明如下:

Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile
Options:
  -q     ... quiet.
  -a     ... aggressive hyphen splitting.
  -b     ... disable Perl buffering.
  -time  ... enable processing time calculation.
  -penn  ... use Penn treebank-like tokenization.
  -protected FILE  ... specify file with patters to be protected in tokenisation.
  -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.

 

然后我們還要進行truecasing,也就是對詞匯的大小寫進行調整

truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.,使用的腳本是:

/mosesdecoder/scripts/recaser/train-truecaser.perl和 /scripts/recaser/truecase.perl例子如下
~/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/corpus/truecase-model.en --corpus     \
     ~/corpus/news-commentary-v8.fr-en.tok.en
 ~/mosesdecoder/scripts/recaser/train-truecaser.perl \
     --model ~/corpus/truecase-model.fr --corpus     \
     ~/corpus/news-commentary-v8.fr-en.tok.fr

~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.en \ < ~/corpus/news-commentary-v8.fr-en.tok.en \ > ~/corpus/news-commentary-v8.fr-en.true.en ~/mosesdecoder/scripts/recaser/truecase.perl \ --model ~/corpus/truecase-model.fr \ < ~/corpus/news-commentary-v8.fr-en.tok.fr \ > ~/corpus/news-commentary-v8.fr-en.true.fr

 其中的train-truecaser腳本用來訓練truecaser的model,輸入文件仍然是你的corpus的已tok文件,作用是修改每一句子的首字母,輸出是每個不同單詞的形式和頻率

manual中的USER GUIDE寫到:

Instead of lowercasing all training and test data, we may also want to keep words in their nat-
ural case, and only change the words at the beginning of their sentence to their most frequent
form. This is what we mean by truecasing. Again, this requires first the training of a truecasing
model, which is a list of words and the frequency of their different forms.

 然后最后一部就是用剛才的模型進行truecase啦:

truecase.perl --model MODEL [-b] < in > out

 -b代表 unbuffered,不清楚用途目前

 在中文處理中應該不需要truecasing這一步

 

最后的詞處理clean,去除一些過長的單詞

Finally we clean, limiting sentence length to 80:

 ~/mosesdecoder/scripts/training/clean-corpus-n.perl \
    ~/corpus/news-commentary-v8.fr-en.true fr en \
    ~/corpus/news-commentary-v8.fr-en.clean 1 80

 


 

二、訓練語言模型(使用IRSTLM)

終於搞定了我們的語料了,下面我們要進入更深入的話題:訓練語言模型

語言模型最朴實的作用在於讓你的output更加流暢,更加像母語,為了達到這一效果,我們需要另外的句子對齊的平行語料來訓練我們的語言模型(如果你用training model的同一個語料來訓練,未免感覺會無效?)

參看manual中base系統的描述,我們需要用到以下幾個工具來訓練我們的語言模型。

 

首先:add-start-end.sh

 ~/irstlm/bin/add-start-end.sh                 \
   < ~/corpus/news-commentary-v8.fr-en.true.en \
   > news-commentary-v8.fr-en.sb.en

  用於把你的語料添加上開始結束標記(實際上就是<s></s>標簽對)

然后使用:build-lm.sh build一個語言模型源文件,輸出lm源文件

export IRSTLM=$HOME/irstlm; ~/irstlm/bin/build-lm.sh \
   -i news-commentary-v8.fr-en.sb.en                  \
   -t ./tmp  -p -s improved-kneser-ney -o news-commentary-v8.fr-en.lm.en

最后compile:

~/irstlm/bin/compile-lm  \
   --text  \
   news-commentary-v8.fr-en.lm.en.gz \
   news-commentary-v8.fr-en.arpa.en

注意這里的 --text 后不要加 yes manual中寫錯了,這樣就生成了一個arpa文件(可以用於query和生成二進制的IRSTLM模型以及KenLM模型,這里一直用生成KenLM解決,因為不知道為何IRSTLM不好用)

You can directly create an IRSTLM binary LM (for faster loading in Moses) by replacing the last command with the following:

 ~/irstlm/bin/compile-lm news-commentary-v8.fr-en.lm.en.gz \
   news-commentary-v8.fr-en.blm.en


You can transform an arpa LM (*.arpa.en file) into an IRSTLM binary LM as follows:

 ~/irstlm/bin/compile-lm \
   news-commentary-v8.fr-en.arpa.en \
   news-commentary-v8.fr-en.blm.en


or viceversa, you can transform an IRSTLM binary LM into an arpa LM as follows:

 ~/irstlm/bin/compile-lm \
   --text yes \
   news-commentary-v8.fr-en.blm.en \
   news-commentary-v8.fr-en.arpa.en


This instead binarises (for faster loading) the *.arpa.en file using KenLM:

 ~/mosesdecoder/bin/build_binary \
   news-commentary-v8.fr-en.arpa.en \
   news-commentary-v8.fr-en.blm.en

You can check the language model by querying it, e.g.

 $ echo "is this an English sentence ?"                       \
   | ~/mosesdecoder/bin/query news-commentary-v8.fr-en.blm.en

 


   --text  \
   news-commentary-v8.fr-en.lm.en.gz \
   news-commentary-v8.fr-en.arpa.en

注意這里的 --text 后不要加 yes manual中寫錯了,這樣就生成了一個arpa文件(可以用於query和生成二進制的IRSTLM模型以及KenLM模型,這里一直用生成KenLM解決,因為不知道為何IRSTLM不好用)

You can directly create an IRSTLM binary LM (for faster loading in Moses) by replacing the last command with the following:

 ~/irstlm/bin/compile-lm news-commentary-v8.fr-en.lm.en.gz \
   news-commentary-v8.fr-en.blm.en


You can transform an arpa LM (*.arpa.en file) into an IRSTLM binary LM as follows:

 ~/irstlm/bin/compile-lm \
   news-commentary-v8.fr-en.arpa.en \
   news-commentary-v8.fr-en.blm.en


or viceversa, you can transform an IRSTLM binary LM into an arpa LM as follows:

 ~/irstlm/bin/compile-lm \
   --text yes \
   news-commentary-v8.fr-en.blm.en \
   news-commentary-v8.fr-en.arpa.en


This instead binarises (for faster loading) the *.arpa.en file using KenLM:

 ~/mosesdecoder/bin/build_binary \
   news-commentary-v8.fr-en.arpa.en \
   news-commentary-v8.fr-en.blm.en


You can check the language model by querying it, e.g.

 $ echo "is this an English sentence ?"                       \
   | ~/mosesdecoder/bin/query news-commentary-v8.fr-en.blm.en

三、訓練翻譯模型

首先看看參數:


 

Reference: All Training Parameters

 

  • --root-dir -- root directory, where output files are stored
  • --corpus -- corpus file name (full pathname), excluding extension
  • --e -- extension of the English corpus file
  • --f -- extension of the foreign corpus file
  • --lm -- language model: <factor>:<order>:<filename> (option can be repeated)
  • --first-step -- first step in the training process (default 1)
  • --last-step -- last step in the training process (default 7)
  • --parts -- break up corpus in smaller parts before GIZA++ training
  • --corpus-dir -- corpus directory (default $ROOT/corpus)
  • --lexical-dir -- lexical translation probability directory (default $ROOT/model)
  • --model-dir -- model directory (default $ROOT/model)
  • --extract-file -- extraction file (default $ROOT/model/extract)
  • --giza-f2e -- GIZA++ directory (default $ROOT/giza.$F-$E)
  • --giza-e2f -- inverse GIZA++ directory (default $ROOT/giza.$E-$F)
  • --alignment -- heuristic used for word alignment: intersect, union, grow, grow-final, grow-diag, grow-diag-final (default), grow-diag-final-and, srctotgt, tgttosrc
  • --max-phrase-length -- maximum length of phrases entered into phrase table (default 7)
  • --giza-option -- additional options for GIZA++ training
  • --verbose -- prints additional word alignment information
  • --no-lexical-weighting -- only use conditional probabilities for the phrase table, not lexical weighting
  • --parts -- prepare data for GIZA++ by running snt2cooc in parts
  • --direction -- run training step 2 only in direction 1 or 2 (for parallelization)
  • --reordering -- specifies which reordering models to train using a comma-separated list of config-strings, see FactoredTraining.BuildReorderingModel. (default distance)
  • --reordering-smooth -- specifies the smoothing constant to be used for training lexicalized reordering models. If the letter "u" follows the constant, smoothing is based on actual counts. (default 0.5)
  • --alignment-factors --
  • --translation-factors --
  • --reordering-factors --
  • --generation-factors --
  • --decoding-steps --

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM