NLP里面的一些基本概念


1,corpus 語料庫

a computer-readable collection of text or speech 

2,utterance 發音

比如下面一句話:I do uh main- mainly business data processing 

uh 是 fillers填充詞Words like uh and um are called fillers or filled pauses )。The broken-off word main- is fragment called a fragment 

3,Types are the number of distinct words in a corpus  

給你一句話,這句話里面有多少個單詞呢? 標點符號算不算單詞?有相同lemma的單詞算不算重復的單詞?比如“he is a boy and you are a girl”,這句話中 “is”和 "are"的lemma 都是 be。另外,這句話中 "a" 出現了兩次。那這句話有多少個單詞?這就要看具體的統計單詞個數的方式了。

Tokens are the total number N of running words. 

4,Morphemes 

A Morpheme is the smallest division of text that has meaning. Prefxes and suffxes are examples of morphemes 

These are the smallest units of a word that is meaningful. 比如說:“bounded”,"bound"就是一個 morpheme,而Morphemes而包含了后綴 ed

5,Lemma(詞根) 和 Wordform(詞形)

Cat 和 cats 屬於相同的詞根,但是卻是不同的詞形。

Lemma 和 stem 有着相似的意思:

6,stem 

Stemming is the process of finding the word stem of a word 。比如,walking 、walked、walks 有着相同的stem,即: walk

與stem相關的一個概念叫做 lemmatization,它用來確定一個詞的基本形式,這個過程叫做lemma。比如,單詞operating,它的stem是 ope,它的lemma是operate 

Lemmatization is a more refined process than stemming and uses vocabulary and morphological techniques to find a lemma. This can result in more precise analysis in some situations 。

The lemmatization process determines the lemma of a word. A lemma can be thought of as the dictionary form of a word

(Lemmatization 要比 stemming 復雜,但是它們都是為了尋找 單詞的 “根”)。但是Lemmatization 更復雜,它用到了一些詞義分析(finding the morphological or vocabulary meaning of a token)

Stemming and lemmatization: These processes will alter the words to get to their "roots".  Similar to stemming is Lemmatization. This is the process of fnding its lemma, its form as found in a dictionary.  

Stemming is frequently viewed as a more primitive technique, where the attempt to get to the "root" of a word involves cutting off parts of the beginning and/or ending of a token. 

 Lemmatization can be thought of as a more sophisticated approach where effort is devoted to finding the morphological or vocabulary meaning of a token。

比如說 having 的 stem 是 hav,但是它的 lemma 是have

再比如說 was 和 been 有着不同的 stem,但是有着相同的 lemma : be

7,affix 詞綴 (prefix 和 suffxes)

比如說:一個單詞的 現在進行時,要加ing,那么 ing 就是一個后綴。

This precedes or follows the root of a word . 比如說,ation 就是 單詞graduation的后綴。

8,tokenization (分詞)

就是把一篇文章拆分成一個個的單詞。The process of breaking text apart is called tokenization 

9,Delimiters (分隔符)

要把一個句子 分割成一個個的單詞,就需要分隔符,常用的分隔符有:空格、tab鍵(\t);還有 逗號、句號……這個要視具體的處理任務而定。

The elements of the text that determine where elements should be split are called Delimiters 。

10,categorization (歸類)

把一篇文本,提取中心詞,進行歸類,來說明這篇文章講了什么東西。比如寫了一篇blog,需要將這篇blog的個人分類,方便以后查找。

This is the process of assigning some text element into one of the several possible groups.  

11,stopwords

某些NLP任務需要將一些常出現的“無意義”的詞去掉,比如:統計一篇文章頻率最高的100個詞,可能會有大量的“is”、"a"、"the" 這類詞,它們就是 stopwords。

Commonly used words might not be important for some NLP tasks such as general searches. These common words are called stopwords 

由於大部分文本都會包含 stopwords,因此文本分類時,最好去掉stopwords。關於stopwords的一篇參考文章

12,Normalization (歸一化)

將一系列的單詞 轉化成 某種 統一 的形式,比如:將一句話的各個單詞中,有大寫、有小寫,將之統一轉成 小寫。再比如,一句話中,有些單詞是 縮寫詞,將之統一轉換成全名。

Normalization is a process that converts a list of words to a more uniform sequence.

Normalization operations can include the following:(常用的歸一化操作有如下幾種)

converting characters to lowercase(大小寫轉換),expanding abbreviation(縮略詞變成全名), removing stopwords(移除一些常見的“虛詞”), stemming, and lemmatization.(詞干或者詞根提取) 


 參考資料

《JAVA自然語言處理》Natural Language processing with java

 

原文:http://www.cnblogs.com/hapjin/p/7581335.html 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM