MIT自然語言處理第四講:標注


一、 基本介紹

a) 標注問題(Tagging)
 i. 任務(Task): 在句子中為每個詞標上合適的詞性
 ii. 輸入(Input): Our enemies are innovative and resourceful , and so are we. They never stop thinking about new ways to harm our country and our people, and neither do we.
 iii. 輸出(Output): Our/PRP$ enemies/NNS are/VBP innovative/JJ and/CC resourceful/JJ ,/, and/CC so/RB are/VB we/PRP ?/?. They/PRP never/RB stop/VB thinking/VBG about/IN new/JJ ways/NNS to/TO harm/VB our/PROP$ country/NN and/CC our/PRP$ people/NN, and/CC neither/DT do/VB we/PRP.
b) Motivation
 i. 詞性標注對於許多應用領域是非常重要的
  1. 語法分析(Parsing)
  2. 語言模型(Language modeling)
  3. 問答系統和信息抽取(Q&A and Information extraction)
  4. 文本語音轉換(Text-to-speech)
 ii. 標注技術可用於各種任務
  1. 語義標注(Semantic tagging)
  2. 對話標注(Dialogue tagging)
c) 如何確定標記集?
 i. “The definition [of the parts of speech] are very far from having attained the degree of exactitude found in Euclidean geometry” Jespersen, The Philosophy of Grammar
 ii. 粗糙的詞典類別划分基本達成一致至少對某些語言來說
  1. 封閉類(Closed class): 介詞,限定詞,代詞,小品詞,助動詞
  2. 開放類(Open class): 名詞,動詞,形容詞和副詞
 iii. 各種粒度的多種標記集
  1. Penn tag set (45 tags), Brown tag set (87 tags), CLAWS2 tag set (132 tags)
  2. 示例:Penn Tree Tags
  標記(Tag) 說明(Description) 舉例(Example)
  CC      conjunction     and, but
  DT      determiner      a, the
  JJ       adjective      red
  NN      noun, sing.      rose
  RB       adverb       quickly
  VBD     verb, past tense    grew
d) 標注難嗎?
 i. 舉例:“Time flies like an arrow”
 ii. 許多單詞可能會出現在幾種不同的類別中
 iii. 然而,大多數單詞似乎主要在一個類別中出現
  1. “Dumb”標注器在給單詞標注最常用的標記時獲得了90%的准確率
  2. 對於90%的准確率我們滿足嗎?
 iv. 標注的信息資源:
  1. 詞匯(Lexical): 觀察單詞本身
  單詞(Word) 名詞(Noun) 動詞(Verb) 介詞(Preposition)
  flies      21      23      0
  like      10      30      21
  2. 組合(Syntagmatic): 觀察鄰近單詞
  ——哪個組合更像: “DT JJ NN” or “DT JJ VBP“?
二、 基於轉換的學習
a) 概述:
i. TBL 介於符號法和基於語料庫方法之間;
ii. TBL利用了更廣泛的詞匯知識和句法規則——很少的參數估計
iii. TBL關鍵部分:
1. 一個容許的用於“糾錯”的轉換規范
2. 學習算法
b) 轉換
i. 重寫規則(Rewrite rule): tag1 → tag2, 如果C滿足某個條件(if C holds)
– 模板是手工選擇的(Templates are hand-selected)
ii. 觸發條件(Triggering environment (C))::
1. 標記觸發(tag-triggered)
2. 單詞觸發(word-triggered)
3. 形態觸發(morphology-triggered)
c) 轉換模板(Transformation Templates)
i. 圖略;
ii. 附:TBL算法的提出者Eric Brill(1995-Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging)中的模板:
1. The preceding (following) word is tagged z.
2. The word two before (after) is tagged z.
3. One of the two preceding (following) words is tagged z.
4. One of the three preceding (following) words is tagged z.
5. The preceding word is tagged z and the following word is tagged w.
6. The preceding (following) word is tagged z and the word two before (after) is tagged w.
當條件滿足時,將標記1變為標記2(Change tag1 to tag 2 when),其中變量a,b,z和w在詞性集里取值(where a, b, z and w are variables over the set of parts of speech)。
iii. 舉例:
源標記    目標標記    觸發條件
NN       VB      previous tag is TO
VBP      VB      one of the previous tags is MD
JJR      JJR      next tag is JJ
VBP      VB      one of the prev. two words is “n’t”
d) TBL的學習(Learning component of TBL):
i. 貪婪搜索轉換的最優序列:
1. 選擇最好的轉換;
2. 決定它們應用的順序;
e) 算法(Algorithm)
注釋(Notations):
1. Ck — 第k次迭代時的語料庫標注(corpus tagging at iteration k)
2. E(Ck) — k次標注語料庫的錯誤數(the number of mistakes in tagged corpus)
C0 := corpus with each word tagged with its most frequent tag
for k:= 0 step 1 do
v:=the transformation ui that minimizes r(ui(Ck))
if (E(Ck)− E(v(Ck)) < then break fi
Ck+1 := v(Ck)
τk+1 := τ
end
輸出序列(Output sequence): τ1,...,τn
f) 初始化(Initialization)
i. 備選方案(Alternative approaches)
1. 隨機(random)
2. 頻率最多的標記(most frequent tag)
3. ...
ii. 實際上TBL對於初始分配並不敏感
g) 規則應用(Rule Application):
i. 從左到右的應用順序
ii. Immediate vs delayed effect:
Consider “A → B if the preceding tag is A”
– Immediate: AAAA →?
– Delayed: AAAA → ?
h) 規則選擇(Rule Selection):
i. 我們選擇模板及其相應的實例;
ii. 每個規則對已給出的標注進行修改
1. 某些情況下提高:Cimproved(τ)
2. 某些情況下降低:Cworsened (τ)
3. 對剩余數據不觸動
iii. 規則的貢獻是:
Cimproved(τ)− Cworsened (τ)
iv. 第i次迭代的規則選擇:
τ_selected (i)= argmax_τ_contrib(τ)
i) TBL標注器(The Tagger):
i. 輸入(Input):
1. 未標注的數據;
2. 經由學習器學習得到規則(S);
ii. 標注(Tagging):
1. 使用與學習器相同的初始值
2. 應用所有學習得到的規則,保持合適的應用順序
3. 最后的即時數據為輸出
j) 討論(Discussion)
i. TBL的時間復雜度是多少?
ii. 有無可能建立一個無監督的TBL標注器?
k) 與其他模型的關系(Relation to Other Models):
i. 概率模型(Probabilistic models):
1. “k-best”標注(“k-best” tagging);
2. 對先驗知識編碼(encoding of prior knowledge);
ii. 決策樹(Decision Trees)
1. TBL 很有效(TBL is more powerful (Brill, 1995));
2. TBL對於過度學習“免疫”(TBL is immune to overfitting)。
關於TBL,《自然語言處理綜論》第8章有更通俗的解釋和更詳細的算法說明。
三、 馬爾科夫模型(Markov Model)
a) 直觀(Intuition):對於序列中的每個單詞挑選最可能的標記(Pick the most likely tag for each word of a sequence)
 i. 我們將對P(T,S)建模,其中T是一個標記序列,S是一個單詞序列
 ii. P({T}delim{|}{S}{})={P(T,S)}/{sum{T}{}{P(T,S)}}
 Tagger(S)= argmax_{T in T^n}logP({T}delim{|}{S}{})
      = argmax_{T in T^n}logP({T,S}{})
b) 參數估計(Parameter Estimation)
 i. 應用鏈式法則(Apply chain rule):
 P(T,S)={prod{j=1}{n}{P({T_j}delim{|}{S_1,...S_{j-1},T_1,...,T_{j-1}}{})}}*
          P({S_j}delim{|}{S_1,...S_{j-1}T_1,...,T_{j}}{})
 ii. 獨立性假設(馬爾科夫假設)(Assume independence (Markov assumption)):
     ={prod{j=1}{n}{P({T_j}delim{|}{T_{j-2},T_{j-1}}{})}}*P({S_j}delim{|}{T_j}{})
c) 舉例(Example)
 i. They/PRP never/RB stop/VB thinking/VBG about/IN new/JJ  ways/NNS to/TO harm/VB our/PROP$ country/NN and/CC our/PRP$  people/NN, and/CC neither/DT do/VB we/PRP.
 ii. P(T, S)=P(PRP|S, S)∗P(They|PRP)∗P(RB|S, PRP)∗P(never|RB)∗...
d) 估計轉移概率(Estimating Transition Probabilities)
   P({T_j}delim{|}{T_{j-2},T_{j-1}}{})=
      {lambda_1}*{{Count(T_{j-2},T_{j-1},T_j)}/{Count(T_{j-2},T_{j-1})}}
      +{lambda_2}*{{Count(T_{j-1},T_j)}/{Count(T_{j-1})}}
      +{lambda_3}*{{Count(T_j)}/{Count(sum{i}{}{T_i})}}
e) 估計發射概率(Estimating Emission Probabilities)
     P({S_j}delim{|}{T_j}{})={Count(S_j,T_j)}/{Count(T_j)}
 i. 問題(Problem): 未登錄詞或罕見詞(unknown or rare words)
  1. 專有名詞(Proper names)
  “King Abdullah of Jordan, the King of Morocco, I mean, there’s a series of places — Qatar, Oman – I mean, places that are developing— Bahrain — they’re all developing the habits of free societies.”
  2. 新詞(New words)
  “They misunderestimated me.”
f) 處理低頻詞(Dealing with Low Frequency Words)
 i. 將詞表分為兩個集合(Split vocabulary into two sets)
  1. 常用詞(Frequent words)— 在訓練集中出現超過5次的詞(words occurring more than 5 times in training)
  2. 低頻詞(Low frequency words)— 訓練集中的其他詞(all other words)
 ii. 依據前綴、后綴等將低頻詞映射到一個小的、有限的集合中
g) 有效標注(Efficient Tagging)
 i. 對於一個單詞序列,如何尋找最可能的標記序列?
  1. 盲目搜索的方法是可怕的— 對於N個標記和W個單詞計算代價是N^W.
  2. 主意(Idea): 使用備忘錄(Viterbi算法)
  ——結束於相同標記的序列可以壓縮在一起,因為下一個標記僅依賴於此序列的當前標記
  圖示如下:
h) Viterbi 算法(The Viterbi Algorithm)
 i. 初始情況(Base case):
   pi delim{[}{0, START}{]} = log 1 = 0 
   pi delim{[}{0, t_{-1}}{]} = log 0 = infty 
  對所有其他的t_{-1}(for all other t_{-1})
 ii. 遞歸情況(Recursive case):
  1. 對於i = 1...S.length及對於所有的t_{-1} in T:
pi delim{[}{i, t_{-1}}{]} = {max}under{t in T union START}{ pi delim{[}{i-1, t}{]} + log P(t_{-1}delim{|}{t}{}) + log P(S_i delim{|}{t_{-1}}{})} 
  2. 回朔指針允許我們找出最大概率序列:
BP delim{[}{i, t_{-1}}{]} = {argmax}under{t in T union START}{ pi delim{[}{i-1, t}{]} + log P(t_{-1}delim{|}{t}{}) + log P(S_i delim{|}{t_{-1}}{})} 
i) 性能(Performance)
 i. HMM標注器對於訓練非常簡單(HMM taggers are very simple to train)
 ii. 表現相對很好(Perform relatively well) (over 90% performance on named entities)
 iii. 最大的困難是對p(單詞|標記)建模(Main difficulty is modeling of p(word|tag))
四、 結論(Conclusions)
a) 標注是一個相對比較簡單的任務,至少在一個監督框架下對於英語來說
b) 影響標注器性能的因素包括:
 i. 訓練集數量(The amount of training data available)
 ii. 標記集(The tag set)
 iii. 訓練集和測試集的詞匯差異(The difference in vocabulary between the training and the testing)
 iv. 未登錄詞(Unknown words)
c) TBL和HMM框架可用於其他自然語言處理任務(TBL and HMM framework can be used for other tasks)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM