自然語言18_Named-entity recognition

本文轉載自查看原文 2016-11-19 11:20 3470 python/ Named-entity recognition/ nltk/ 自然語言/ pickle

python機器學習-乳腺癌細胞挖掘（博主親自錄制視頻）https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

命名實體（Named Entity）類別識別

除了在預測用戶意圖方面的用途，查詢日志還可以用來識別命名實體。命名實體識別是指識別文本中具有特定意義的實體，主要包括人名、地名、機構名、時間、日期、貨幣及其他專有名詞等。它是自然語言處理實用化的重要內容，在信息提取、句法分析、機器翻譯等應用領域中具有重要的基礎性作用。命名實體識別一方面要識別實體邊界，另一方面要識別實體類別（人名、地名、機構名或其他）。就漢語系統來講，確定實體邊界主要和分詞相關，發現命名實體的基本方法，一般首先找一些與定義相關的特征詞，例如：什么是XX，XX是什么，這是XX。找到具有這樣模式的查詢串后，即可以在查詢日志中通過頻率統計等方法，找到命名實體。這里重點討論第二方面的內容，即類別識別。

之所以會用查詢日志來進行命名實體的類別識別，是因為命名實體的類別並非是一個封閉集，而是一個不斷變化着的集合。一個命名實體，隨着時間的變化，往往會具有不同的屬性。以大家熟悉的"哈利·波特"為例，它開始是一部小說，然后又推出了同名的電影，后來還出了游戲，而這一過程是隨着時間變化的，也就是說在不同時間段，這些類別在用戶查詢需求中受關注程度是不一樣的。

類別分析首先統計和命名實體相關的查詢，如圖6-9所示[Jiang 2010]。

圖6-9 命名實體的查詢次數

對於這些查詢，如果要把它對應到相應的3種類型：書、游戲和電影上去，可以采用WS-LDA(Weakly Supervised Latent Dirichlet Allocation)模型[Guo 2009b]，通過訓練的方式得到對於某個含實體的查詢詞屬於某一類型的概率模型，然后在查詢端采用訓練出的模型判定某個含實體查詢潛在的類別，如圖 6-10所示。

圖6-10 實體分類LDA模型

以電影、圖片、小說、摘要、評論等與實體相伴隨的詞和事先標注的樣本集作為訓練對象，得到以下的概率分布，如圖6-11所示。

圖6-11 實體分類訓練集

當實際查詢某個實體時，根據含該實體的查詢中伴隨次出現的頻率，作為w，即可用上述模型進行預測，以得到某查詢屬於電影、書或是游戲的概率。

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text that highlights the names of entities:

[Jim] _Person bought 300 shares of [Acme Corp.] _Organization in [2006] _Time.

In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.^[1]^[2]

Problem definition

In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as terms for certain biological species and substances.^[3]

Full named-entity recognition is often broken down, conceptually and possibly also in implementations,^[4] as two distinct problems: detection of names, and classification of the names by the type of entity they refer to (e.g. person, organization, location and other^[5]). The first phase is typically simplified to a segmentation problem: names are defined to be contiguous spans of tokens, with no nesting, so that "Bank of America" is a single name, disregarding the fact that inside this name, the substring "America" is itself a name. This segmentation problem is formally similar to chunking.

Temporal expressions and some numerical expressions (i.e., money, percentages, etc.) may also be considered as named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context in which it is used.^[6]

Certain hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes.^[7] Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.^[8] More recently, in 2011 Ritter used a hierarchy based on common Freebase entity types in ground-breaking experiments on NER over social media text.^[9]

Formal evaluation

To evaluate the quality of a NER system's output, several measures have been defined. While accuracy on the token level is one possibility, it suffers from two problems: the vast majority of tokens in real-world text are not part of entity names as usually defined, so the baseline accuracy (always predict "not an entity") is extravagantly high, typically >90%; and mispredicting the full span of an entity name is not properly penalized (finding only a person's first name when their last name follows is scored as ½ accuracy).

In academic conferences such as CoNLL, a variant of the F1 score has been defined as follows:^[5]

Precision is the number of predicted entity name spans that line up exactly with spans in the gold standard evaluation data. I.e. when [_Person Hans] [_Person Blick] is predicted but [_Person Hans Blick] was required, precision for the predicted name is zero. Precision is then averaged over all predicted entity names.
Recall is similarly the number of names in the gold standard that appear at exactly the same location in the predictions.
F1 score is the harmonic mean of these two.

It follows from the above definition that any prediction that misses a single token, includes a spurious token, or has the wrong class, is a hard error and does not contribute to either precision or recall.

Evaluation models based on a token-by-token matching have been proposed.^[10] Such models are able to handle also partially overlapping matches, yet fully rewarding only exact matches. They allow a finer grained evaluation and comparison of extraction systems, taking into account also the degree of mismatch in non-exact predictions.

Approaches

NER systems have been created that use linguistic grammar-based techniques as well as statistical models, i.e. machine learning. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data. Semisupervised approaches have been suggested to avoid part of the annotation effort.^[11]^[12]

Many different classifier types have been used to perform machine-learned NER, with conditional random fields being a typical choice.^[13]

Problem domains

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.^[14] Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities. The most common entity of interest in that domain has been names of genes and gene products. There has been also considerable interest in the recognition of chemical entities and drugs in the context of the CHEMDNER competition, with 27 teams participating in this task.^[15]

Current challenges and research

Despite the high F1 numbers reported on the MUC-7 dataset, the problem of Named Entity Recognition is far from being solved. The main efforts are directed to reducing the annotation labor by employing semi-supervised learning,^[11]^[16] robust performance across domains^[17]^[18] and scaling up to fine-grained entity types.^[8]^[19] In recent years, many projects have turned to crowdsourcing, which is a promising solution to obtain high-quality aggregate human judgments for supervised and semi-supervised machine learning approaches to NER.^[20] Another challenging task is devising models to deal with linguistically complex contexts such as Twitter and search queries.^[21]

A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia^[22]^[23]^[24] can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system:

<ENTITY url="http://en.wikipedia.org/wiki/Michael_I._Jordan"> Michael Jordan </ENTITY> is a professor at <ENTITY url="http://en.wikipedia.org/wiki/University_of_California,_Berkeley"> Berkeley </ENTITY>

Software

GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API
OpenNLP includes rule-based and statistical named-entity recognition
Stanford University also has the Stanford Named Entity Recognizer
https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149（博主視頻教學主頁）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 自然語言預處理 TENER: Adapting Transformer Encoder for Named Entity Recognition 【論文翻譯】Neural Architectures for Named Entity Recognition 自然語言處理之jieba分詞自然語言處理(一) 關系抽取自然語言處理NLTK之入門 NLP自然語言處理 Python自然語言處理-系列一自然語言處理入門 NLP 自然語言處理之綜述