NLP(七) 信息抽取和文本分類


原文鏈接:http://www.one2know.cn/nlp7/

  • 命名實體
    專有名詞:人名 地名 產品名
例句 命名實體
Hampi is on the South Bank of Tungabhabra river Hampi,Tungabhabra River
Paris is famous for Fashion Paris
Burj Khalifa is one of the SKyscrapers in Dubai Burj Khalifa,Dubai
Jeff Weiner is the CEO of LinkedIn Jeff Weiner,LinkedIn

命名實體是獨一無二的名詞
分類:TIMEZONE,LOCATION,RIVERS,COSMETICS(化妝品),CURRENCY(貨幣),DATE,TIME,PERSON

  • NLTK識別命名實體
    使用的數據已經經過以下預處理(之前學過的):
    1.將大文檔分割成句子
    2.將句子分割成詞
    3.對句子進行詞性標注
    4.從句子中提取包含連續詞(非重疊)的組塊(短語)
    5.給這些組塊包含的詞標注IOB標簽
    分析treebank語料庫:
import nltk

def sampleNE():
    sent = nltk.corpus.treebank.tagged_sents()[0] # 語料庫第一句
    print(nltk.ne_chunk(sent)) # nltk.ne_chunk()函數分析識別一個句子的命名實體

def sampleNE2():
    sent = nltk.corpus.treebank.tagged_sents()[0]
    print(nltk.ne_chunk(sent,binary=True))  # 包含識別無類別的命名實體

if __name__ == "__main__":
    sampleNE()
    sampleNE2()

輸出:

(S
  (PERSON Pierre/NNP)
  (ORGANIZATION Vinken/NNP)
  ,/,
  61/CD
  years/NNS
  old/JJ
  ,/,
  will/MD
  join/VB
  the/DT
  board/NN
  as/IN
  a/DT
  nonexecutive/JJ
  director/NN
  Nov./NNP
  29/CD
  ./.)
(S
  (NE Pierre/NNP Vinken/NNP)
  ,/,
  61/CD
  years/NNS
  old/JJ
  ,/,
  will/MD
  join/VB
  the/DT
  board/NN
  as/IN
  a/DT
  nonexecutive/JJ
  director/NN
  Nov./NNP
  29/CD
  ./.)
  • 創建字典、逆序字典和使用字典
    字典:一對一映射,將詞和詞性一一對應放入字典,下次可高效查找
import nltk

class LearningDictionary():
    def __init__(self,sentence): # 實例化時直接運行,建立了兩個字典
        self.words = nltk.word_tokenize(sentence)
        self.tagged = nltk.pos_tag(self.words)
        self.buildDictionary()
        self.buildReverseDictionary()

    # 將詞和詞性放到字典
    def buildDictionary(self):
        self.dictionary = {}
        for (word,pos) in self.tagged:
            self.dictionary[word] = pos

    # 在原來的字典基礎上,新建一個key和value調過來的字典
    def buildReverseDictionary(self):
        self.rdictionary = {}
        for key in self.dictionary.keys():
            value = self.dictionary[key]
            if value not in self.rdictionary:
                self.rdictionary[value] = [key]
            else:
                self.rdictionary[value].append(key)

    # 判斷詞是否在字典里
    def isWordPresent(self,word):
        return 'Yes' if word in self.dictionary else 'No'

    # 詞 => 詞性
    def getPOSForWord(self,word):
        return self.dictionary[word] if word in self.dictionary else None

    # 詞性 => 詞
    def getWordsForPOS(self,pos):
        return self.rdictionary[pos] if pos in self.rdictionary else None

# 測試
if __name__ == "__main__":
    # 以sentence實例化一個對象
    sentence = 'All the flights got delayed due to bad weather'
    learning = LearningDictionary(sentence)

    words = ['chair','flights','delayed','pencil','weather']
    pos = ['NN','VBS','NNS']
    for word in words:
        status = learning.isWordPresent(word)
        print("It '{}' present in dictionary ? : '{}'".format(word,status))
        if status is 'Yes':
            print("\tPOS For '{}' is '{}'".format(word,learning.getPOSForWord(word)))
    for pword in pos:
        print("POS '{}' has '{}' words".format(pword,learning.getWordsForPOS(pword)))

輸出:

It 'chair' present in dictionary ? : 'No'
It 'flights' present in dictionary ? : 'Yes'
	POS For 'flights' is 'NNS'
It 'delayed' present in dictionary ? : 'Yes'
	POS For 'delayed' is 'VBN'
It 'pencil' present in dictionary ? : 'No'
It 'weather' present in dictionary ? : 'Yes'
	POS For 'weather' is 'NN'
POS 'NN' has '['weather']' words
POS 'VBS' has 'None' words
POS 'NNS' has '['flights']' words
  • 特征集合選擇
import nltk
import random

sampledata = [
    ('KA-01-F 1034 A','rtc'),
    ('KA-02-F 1030 B','rtc'),
    ('KA-03-FA 1200 C','rtc'),
    ('KA-01-G 0001 A','gov'),
    ('KA-02-G 1004 A','gov'),
    ('KA-03-G 0204 A','gov'),
    ('KA-04-G 9230 A','gov'),
    ('KA-27 1290','oth')
]
random.shuffle(sampledata) # 隨機排序
testdata = [
    'KA-01-G 0109',
    'KA-02-F 9020 AC',
    'KA-02-FA 0801',
    'KA-01 9129'
]

def learnSimpleFeatures():
    def vehicleNumberFeature(vnumber):
        return {'vehicle_class':vnumber[6]} # 返回第7個字母  
    # 元組(第7個字母作為特征,類別)構成的列表
    featuresets = [(vehicleNumberFeature(vn),cls) for (vn,cls) in sampledata]
    # 朴素貝葉斯訓練數據 將分類器保存在classifier中
    classifier = nltk.NaiveBayesClassifier.train(featuresets)
    # 測試數據
    for num in testdata:
        feature = vehicleNumberFeature(num)
        print('(simple) %s is type of %s'%(num,classifier.classify(feature)))

def learnFeatures(): # 用6,7兩位作為特征
    def vehicleNumberFeature(vnumber):
        return {
            'vehicle_class':vnumber[6],
            'vehicle_prev':vnumber[5],
        }
    featuresets = [(vehicleNumberFeature(vn),cls) for (vn,cls) in sampledata]
    classifier = nltk.NaiveBayesClassifier.train(featuresets)
    for num in testdata:
        feature = vehicleNumberFeature(num)
        print('(dual) %s is type of %s'%(num,classifier.classify(feature)))

if __name__ == "__main__":
    learnSimpleFeatures()
    learnFeatures()

輸出:

(simple) KA-01-G 0109 is type of gov
(simple) KA-02-F 9020 AC is type of rtc
(simple) KA-02-FA 0801 is type of rtc
(simple) KA-01 9129 is type of gov
(dual) KA-01-G 0109 is type of gov
(dual) KA-02-F 9020 AC is type of rtc
(dual) KA-02-FA 0801 is type of rtc
(dual) KA-01 9129 is type of oth
  • 利用分類器分割句子
    依據:以'.'結尾,下一單詞首字母大寫
import nltk

# 定義特征 返回(字典,下一個句子首字母是否為大寫的布爾值)
def featureExtractor(words,i):
    return ({'current-word':words[i],'next-is-upper':words[i+1][0].isupper()},words[i+1][0].isupper())

# 得到特征集合
def getFeaturesets(sentence):
    words = nltk.word_tokenize(sentence) # 得到句子的單詞數組
    featuresets = [featureExtractor(words,i) for i in range(1,len(words)-1) if words[i] == '.']
    return featuresets

# 將文章分句的函數
def segmentTextAndPrintSentences(data):
    words = nltk.word_tokenize(data) # 整個文章分詞
    for i in range(0,len(words)-1):
        if words[i] == '.':
            if classifier.classify(featureExtractor(words,i)[0]) == True:
                print(".")
            else:
                print(words[i],end='')
        else:
            print("{} ".format(words[i]),end='')
    print(words[-1]) # 輸出最后一個標點
traindata = "The train and test data consist of three columns separated by spaces.Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag as derived by the Brill tagger and the third its chunk tag as derived from the WSJ corpus. The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of the file format."
testdata = "The baseline result was obtained by selecting the chunk tag which was most frequently associated with the current part-of-speech tag. At the workshop, all 11 systems outperformed the baseline. Most of them (six of the eleven) obtained an F-score between 91.5 and 92.5. Two systems performed a lot better: Support Vector Machines used by Kudoh and Matsumoto [KM00] and Weighted Probability Distribution Voting used by Van Halteren [Hal00]. The papers associated with the participating systems can be found in the reference section below."
traindataset = getFeaturesets(traindata)
classifier = nltk.NaiveBayesClassifier.train(traindataset)
segmentTextAndPrintSentences(testdata)

輸出:

The baseline result was obtained by selecting the chunk tag which was most frequently associated with the current part-of-speech tag .
At the workshop , all 11 systems outperformed the baseline .
Most of them ( six of the eleven ) obtained an F-score between 91.5 and 92.5 .
Two systems performed a lot better : Support Vector Machines used by Kudoh and Matsumoto [ KM00 ] and Weighted Probability Distribution Voting used by Van Halteren [ Hal00 ] .
The papers associated with the participating systems can be found in the reference section below .
  • 文本分類
    以RSS(豐富站點,Rich Site Summary)源的分類為例
import nltk
import random
import feedparser

# 兩個跟雅虎體育相關的RSS源
urls = {
    'mlb':'http://sports.yahoo.com/mlb/rss.xml',
    'nfl':'http://sports.yahoo.com/nfl/rss.xml',
}

feedmap = {} # 字典存RSS源
stopwords = nltk.corpus.stopwords.words('english') # 停用詞

# 輸入單詞列表 返回特征字典 key是非停用詞 value是True
def featureExtractor(words):
    features = {}
    for word in words:
        if word not in stopwords:
            features["word({})".format(word)] = True
    return features

# 空列表 用於儲存正確標注的句子
sentences = []

for category in urls.keys():
    feedmap[category] = feedparser.parse(urls[category]) # 下載數據源存到feedmap字典中
    print("downloading {}".format(urls[category]))
    for entry in feedmap[category]['entries']: # 遍歷所有RSS條目
        data = entry['summary']
        words = data.split()
        sentences.append((category,words)) # 將類別和所有單詞以元組形式存到sentences中

# 將 (類別,單詞列表) 轉化成 '所有單詞的特征:類別' 組成的字典
featuresets = [(featureExtractor(words),category) for category,words in sentences]

# 打亂 一半訓練集 一半測試集
random.shuffle(featuresets)
total = len(featuresets)
off = int(total/2)
trainset = featuresets[off:]
testset = featuresets[:off]

# 調用NaiveBayesClassifier模塊train()函數 構造一個分類器
classifier = nltk.NaiveBayesClassifier.train(trainset)

# 打印准確率
print(nltk.classify.accuracy(classifier,testset))

# 打印數據的有效特征
classifier.show_most_informative_features(5)

for (i,entry) in enumerate(feedmap['nfl']['entries']):
    if i < 4: # 從nfl隨機選取4個樣本測試
        features = featureExtractor(entry['title'].split())
        category = classifier.classify(features)
        print('{} -> {}'.format(category,entry['summary']))

輸出:

downloading http://sports.yahoo.com/mlb/rss.xml
downloading http://sports.yahoo.com/nfl/rss.xml
0.9148936170212766
Most Informative Features
               word(NFL) = True              nfl : mlb    =      8.6 : 1.0
       word(quarterback) = True              nfl : mlb    =      3.7 : 1.0
              word(team) = True              nfl : mlb    =      2.9 : 1.0
               word(two) = True              mlb : nfl    =      2.4 : 1.0
         word(Wednesday) = True              mlb : nfl    =      2.4 : 1.0
nfl -> The Cowboys RB will not be suspended for his role in an incident in May in Las Vegas.
nfl -> Giants defensive lineman Dexter Lawrence was 6 years old when Eli Manning began his NFL career. Manning is entering his 16th season, while Lawrence is arriving as a first-round draft pick. Age isn't always "just a number." "In the locker room, I feel their age," Manning said,
nfl -> Hue Jackson compiled a 3-36-1 record in two-and-a-half seasons with the Cleveland Browns before later joining division rival the Cincinnati Bengals.
nfl -> NFL Network's David Carr and free agent defensive lineman Andre Fluellen predict every game on the Minnesota Vikings' 2019 schedule.
  • 利用上下文進行詞性標注
import nltk

# 給出一些包含雙詞性的例句 address laugh
sentences = [
    "What is your address when you're in Beijing?",
    "the president's address on the state of economy.",
    "He addressed his remarks to the lawyers in the audience.",
    "In order to address an assembly, we should be ready",
    "He laughed inwardly at the scene.",
    "After all the advance publicity, the prizefight turned out to be a laugh.",
    "We can learn to laugh a little at even our most serious foibles.",
]

# 將每句話的 詞和詞性 放到列表中,構成一個二維列表
def getSentenceWords():
    sentwords = []
    for sentence in sentences:
        words = nltk.pos_tag(nltk.word_tokenize(sentence))
        sentwords.append(words)
    return sentwords

# 無上下文詞性標注
def noContextTagger():
    # 構建一個基准系統
    tagger = nltk.UnigramTagger(getSentenceWords())
    print(tagger.tag('the little remarks towards assembly are laughable'.split()))

# 有上下文詞性標注
def withContextTagger():
    # 返回字典:  4 x 特征:特征值
    def wordFeatures(words,wordPosInSentence):
        # 單詞的倒數1,2,3個字母作為特征
        endFeatures = {
            'last(1)':words[wordPosInSentence][-1],
            'last(2)':words[wordPosInSentence][-2:],
            'last(3)':words[wordPosInSentence][-3:],
        }
        # 如果一個詞不是句子中第一個 用前面的詞決定
        if wordPosInSentence > 1:
            endFeatures['prev'] = words[wordPosInSentence - 1]
        else:
            endFeatures['prev'] = '|NONE|'
        return endFeatures
    allsentences = getSentenceWords() # 二維列表
    featureddata = [] # 准備放元組,元組包括 特征信息(featurelist)和標記(tag)
    for sentence in allsentences:
        untaggedSentence = nltk.tag.untag(sentence)
        featuredsentence = [(wordFeatures(untaggedSentence,index),tag) for index,(word,tag) in enumerate(sentence)]
        featureddata.extend(featuredsentence)
    breakup = int(len(featureddata) * 0.5)
    traindata = featureddata[breakup:]
    testdata = featureddata[:breakup]
    classifier = nltk.NaiveBayesClassifier.train(traindata)
    print("分類器准確率 : {}".format(nltk.classify.accuracy(classifier,testdata)))

if __name__ == "__main__":
    noContextTagger()
    withContextTagger()

輸出:

[('the', 'DT'), ('little', 'JJ'), ('remarks', 'NNS'), ('towards', None), ('assembly', 'NN'), ('are', None), ('laughable', None)]
分類器准確率 : 0.38461538461538464


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM