1,corpus 語料庫
a computer-readable collection of text or speech
2,utterance 發音
比如下面一句話:I do uh main- mainly business data processing
uh 是 fillers,填充詞(Words like uh and um are called fillers or filled pauses )。The broken-off word main- is fragment called a fragment
3,Types are the number of distinct words in a corpus
給你一句話,這句話里面有多少個單詞呢? 標點符號算不算單詞?有相同lemma的單詞算不算重復的單詞?比如“he is a boy and you are a girl”,這句話中 “is”和 "are"的lemma 都是 be。另外,這句話中 "a" 出現了兩次。那這句話有多少個單詞?這就要看具體的統計單詞個數的方式了。
Tokens are the total number N of running words.
4,Morphemes
A Morpheme is the smallest division of text that has meaning. Prefxes and suffxes are examples of morphemes
These are the smallest units of a word that is meaningful. 比如說:“bounded”,"bound"就是一個 morpheme,而Morphemes而包含了后綴 ed
5,Lemma(詞根) 和 Wordform(詞形)
Cat 和 cats 屬於相同的詞根,但是卻是不同的詞形。
Lemma 和 stem 有着相似的意思:
6,stem
Stemming is the process of finding the word stem of a word 。比如,walking 、walked、walks 有着相同的stem,即: walk
與stem相關的一個概念叫做 lemmatization,它用來確定一個詞的基本形式,這個過程叫做lemma。比如,單詞operating,它的stem是 ope,它的lemma是operate
Lemmatization is a more refined process than stemming and uses vocabulary and morphological techniques to find a lemma. This can result in more precise analysis in some situations 。
The lemmatization process determines the lemma of a word. A lemma can be thought of as the dictionary form of a word.
(Lemmatization 要比 stemming 復雜,但是它們都是為了尋找 單詞的 “根”)。但是Lemmatization 更復雜,它用到了一些詞義分析(finding the morphological or vocabulary meaning of a token)
Stemming and lemmatization: These processes will alter the words to get to their "roots". Similar to stemming is Lemmatization. This is the process of fnding its lemma, its form as found in a dictionary.
Stemming is frequently viewed as a more primitive technique, where the attempt to get to the "root" of a word involves cutting off parts of the beginning and/or ending of a token.
Lemmatization can be thought of as a more sophisticated approach where effort is devoted to finding the morphological or vocabulary meaning of a token。
比如說 having 的 stem 是 hav,但是它的 lemma 是have
再比如說 was 和 been 有着不同的 stem,但是有着相同的 lemma : be
7,affix 詞綴 (prefix 和 suffxes)
比如說:一個單詞的 現在進行時,要加ing,那么 ing 就是一個后綴。
This precedes or follows the root of a word . 比如說,ation 就是 單詞graduation的后綴。
8,tokenization (分詞)
就是把一篇文章拆分成一個個的單詞。The process of breaking text apart is called tokenization
9,Delimiters (分隔符)
要把一個句子 分割成一個個的單詞,就需要分隔符,常用的分隔符有:空格、tab鍵(\t);還有 逗號、句號……這個要視具體的處理任務而定。
The elements of the text that determine where elements should be split are called Delimiters 。
10,categorization (歸類)
把一篇文本,提取中心詞,進行歸類,來說明這篇文章講了什么東西。比如寫了一篇blog,需要將這篇blog的個人分類,方便以后查找。
This is the process of assigning some text element into one of the several possible groups.
11,stopwords
某些NLP任務需要將一些常出現的“無意義”的詞去掉,比如:統計一篇文章頻率最高的100個詞,可能會有大量的“is”、"a"、"the" 這類詞,它們就是 stopwords。
Commonly used words might not be important for some NLP tasks such as general searches. These common words are called stopwords
由於大部分文本都會包含 stopwords,因此文本分類時,最好去掉stopwords。關於stopwords的一篇參考文章。
12,Normalization (歸一化)
將一系列的單詞 轉化成 某種 統一 的形式,比如:將一句話的各個單詞中,有大寫、有小寫,將之統一轉成 小寫。再比如,一句話中,有些單詞是 縮寫詞,將之統一轉換成全名。
Normalization is a process that converts a list of words to a more uniform sequence.
Normalization operations can include the following:(常用的歸一化操作有如下幾種)
converting characters to lowercase(大小寫轉換),expanding abbreviation(縮略詞變成全名), removing stopwords(移除一些常見的“虛詞”), stemming, and lemmatization.(詞干或者詞根提取)
參考資料
《JAVA自然語言處理》Natural Language processing with java