文本處理工具 - TextBlob

本文轉載自查看原文 2019-03-12 11:09 2059 Python

文本處理工具 - TextBlob

TextBlob基本介紹

TextBlob是一個用Python編寫的開源的文本處理庫。它可以用來執行很多自然語言處理的任務，比如，詞性標注，名詞性成分提取，情感分析，文本翻譯，等等。你可以在官方文檔閱讀TextBlog的所有特性。

基本功能

Noun phrase extraction 短語提取
Part-of-speech tagging 詞匯標注
Sentiment analysis 情感分析
Classification (Naive Bayes, Decision Tree) 分類
Language translation and detection powered by Google Translate 語言翻譯和檢查（谷歌翻譯支持）
Tokenization (splitting text into words and sentences) 分詞、分句
Word and phrase frequencies 詞、短語頻率
Parsing 語法分析
n-grams N元標注
Word inflection (pluralization and singularization) and lemmatization 詞反射及詞干提取
Spelling correction 拼寫准確性
Add new models or languages through extensions 添加新模型或語言通過表達
WordNet integration WordNet整合

快速開始：

Create a TextBlob（創建一個textblob對象）

First, the import. TextBlob 類

>>> from textblob import TextBlob

Let’s create our first TextBlob.

>>> wiki = TextBlob("Python is a high-level, general-purpose programming language.")

Part-of-speech Tagging（詞性標注）

Part-of-speech tags can be accessed through the tags property.

>>> wiki.tags
[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')]

Noun Phrase Extraction（名詞短語列表）

Similarly, noun phrases are accessed through the noun_phrases property. 注意：只提取名詞短語

>>> wiki.noun_phrases
WordList(['python'])

Sentiment Analysis（情感分析）

返回一個元組 Sentiment(polarity, subjectivity).

The polarity score is a float within the range [-1.0, 1.0]. -1.0 消極，1.0積極

The subjectivity is a float within the range [0.0, 1.0] 0.0 表示客觀，1.0表示主觀.

>>> testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
>>> testimonial.sentiment
Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)
>>> testimonial.sentiment.polarity
0.39166666666666666

Tokenization（分詞和分句）

You can break TextBlobs into words or sentences.

>>> zen = TextBlob("Beautiful is better than ugly. "
... "Explicit is better than implicit. "
... "Simple is better than complex.")
>>> zen.words
WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])
>>> zen.sentences
[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]

Sentence 對象和TextBlobs 一樣，有相同的方法和屬性.

>>> for sentence in zen.sentences:
... print(sentence.sentiment)

Words Inflection and Lemmatization（詞反射及詞干提取：單復數、過去式等）

Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.

singularize() 變單數， pluralize（）變復數，用在對名詞進行處理，且會考慮特殊名詞單復數形式

>>> sentence = TextBlob('Use 4 spaces per indentation level.')
>>> sentence.words
WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
>>> sentence.words[2].singularize()
'space'
>>> sentence.words[-1].pluralize()
'levels'

Word 類：lemmatize() 方法對單詞進行詞形還原，名詞找單數，動詞找原型。所以需要一次處理名詞，一次處理動詞

>>> from textblob import Word
>>> w = Word("octopi")
>>> w.lemmatize() # 默認只處理名詞
'octopus'
>>> w = Word("went")
>>> w.lemmatize("v") # 對動詞原型處理
'go'

WordNet Integration （WordNet整合）

You can access the synsets for a Word via the synsets 屬性或者用 get_synsets 方法只查看部分或全部synset.

>>> from textblob import Word
>>> from textblob.wordnet import VERB
>>> word = Word("octopus")
>>> word.synsets
[Synset('octopus.n.01'), Synset('octopus.n.02')]
>>> Word("hack").get_synsets(pos=VERB) # 只查找該詞作為動詞的集合，參數為空時和synsets方法相同
[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]

You can access the definitions for each synset via the definitions property or the define()method, which can also take an optional part-of-speech argument.

>>> Word("octopus").definitions #單詞“章魚”的定義
['tentacles of octopus prepared as food', 'bottom-living cephalopod having a soft oval body with eight long tentacles'] # '章魚的觸手是食物','底硒頭足類動物，身體軟而呈卵形，有八只長觸須'

You can also create synsets directly.

>>> from textblob.wordnet import Synset
>>> octopus = Synset('octopus.n.02')
>>> shrimp = Synset('shrimp.n.03')
>>> octopus.path_similarity(shrimp)
0.1111111111111111

For more information on the WordNet API, see the NLTK documentation on the Wordnet Interface.

WordLists

A WordList is just a Python list with additional methods. 屬性words ：一個包含句子分詞的list

>>> animals = TextBlob("cat dog octopus")
>>> animals.words
WordList(['cat', 'dog', 'octopus'])
>>> animals.words.pluralize()
WordList(['cats', 'dogs', 'octopodes'])

Spelling Correction(拼寫校正)

Use the correct() method to attempt spelling correction.

>>> b = TextBlob("I havv goood speling!")
>>> print(b.correct())
I have good spelling!

Word objects have a spellcheck() Word.spellcheck() method that returns a list of (word,confidence) tuples with spelling suggestions.

>>> from textblob import Word
>>> w = Word('falibility')
>>> w.spellcheck()
[('fallibility', 1.0)]

Spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector”[1] as implemented in the pattern library. It is about 70% accurate [2].

Get Word and Noun Phrase Frequencies(單詞詞頻)

There are two ways to get the frequency of a word or noun phrase in a TextBlob. 兩種方法來獲取單詞頻次

The first is through the word_counts dictionary. 從屬性word_counts 字典獲取

>>> monty = TextBlob("We are no longer the Knights who say Ni. "
... "We are now the Knights who say Ekki ekki ekki PTANG.")
>>> monty.word_counts['ekki']
3

If you access the frequencies this way, the search will not be case sensitive, and words that are not found will have a frequency of 0.

The second way is to use the count() method. 用count ()方法獲取

>>> monty.words.count('ekki') #單詞頻次
3

You can specify whether or not the search should be case-sensitive (default is False).

>>> monty.words.count('ekki', case_sensitive=True) #設置大小寫敏感，默認不區分
2

Each of these methods can also be used with noun phrases.

>>> wiki.noun_phrases.count('python') #短語頻次
1

Translation and Language Detection(翻譯及語言檢測語言)

New in version 0.5.0.

TextBlobs can be translated between languages.

>>> en_blob = TextBlob(u'Simple is better than complex.')
>>> en_blob.translate(to='es')
TextBlob("Simple es mejor que complejo.")

If no source language is specified, TextBlob will attempt to detect the language. You can specify the source language explicitly, like so. Raises TranslatorError if the TextBlob cannot be translated into the requested language or NotTranslated if the translated result is the same as the input string.

>>> chinese_blob = TextBlob(u"美麗優於丑陋")
>>> chinese_blob.translate(from_lang="zh-CN", to='en')
TextBlob("Beautiful is better than ugly")

You can also attempt to detect a TextBlob’s language using TextBlob.detect_language().

>>> b = TextBlob(u"بسيط هو أفضل من مجمع")
>>> b.detect_language()
'ar'

As a reference, language codes can be found here.

Language translation and detection is powered by the Google Translate API.

Parsing(解析)

Use the parse() method to parse the text. 句法解析 parse() 方法

>>> b = TextBlob("And now for something completely different.")
>>> print(b.parse())
And/CC/O/O now/RB/B-ADVP/O for/IN/B-PP/B-PNP something/NN/B-NP/I-PNP completely/RB/B-ADJP/O different/JJ/I-ADJP/O ././O/O

By default, TextBlob uses pattern’s parser [3].

TextBlobs Are Like Python Strings!(TextBlobs像是字符串)

You can use Python’s substring syntax.

>>> zen[0:19]
TextBlob("Beautiful is better")

You can use common string methods.

>>> zen.upper()
TextBlob("BEAUTIFUL IS BETTER THAN UGLY. EXPLICIT IS BETTER THAN IMPLICIT. SIMPLE IS BETTER THAN COMPLEX.")
>>> zen.find("Simple")
65

You can make comparisons between TextBlobs and strings.

>>> apple_blob = TextBlob('apples')
>>> banana_blob = TextBlob('bananas')
>>> apple_blob < banana_blob
True
>>> apple_blob == 'apples'
True

You can concatenate and interpolate TextBlobs and strings.

>>> apple_blob + ' and ' + banana_blob
TextBlob("apples and bananas")
>>> "{0} and {1}".format(apple_blob, banana_blob)
'apples and bananas'

`n`-grams（提取前n個字）

The TextBlob.ngrams() method returns a list of tuples of n successive words.

ngrams(n) 方法返回句子每 n 個連續單詞為一個元素的 list

>>> blob = TextBlob("Now is better than never.")
>>> blob.ngrams(n=3)
[WordList(['Now', 'is', 'better']), WordList(['is', 'better', 'than']), WordList(['better', 'than', 'never'])]

Get Start and End Indices of Sentences(句子開始和結束的索引)

Use sentence.start and sentence.end to get the indices where a sentence starts and ends within a TextBlob.

>>> for s in zen.sentences:
... print(s)
... print("---- Starts at index {}, Ends at index {}".format(s.start, s.end))
Beautiful is better than ugly.
---- Starts at index 0, Ends at index 30
Explicit is better than implicit.
---- Starts at index 31, Ends at index 64
Simple is better than complex.
---- Starts at index 65, Ends at index 95

文檔

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

[html] view plain copy

from textblob import TextBlob
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''
blob = TextBlob(text)
blob.tags # [('The', 'DT'), ('titular', 'JJ'),
# ('threat', 'NN'), ('of', 'IN'), ...]
blob.noun_phrases # WordList(['titular threat', 'blob',
# 'ultimate movie monster',
# 'amoeba-like mass', ...])
for sentence in blob.sentences:
print(sentence.sentiment.polarity)
# 0.060
# -0.341
blob.translate(to="es") # 'La amenaza titular de The Blob...

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

Features

Noun phrase extraction
Part-of-speech tagging
Sentiment analysis
Classification (Naive Bayes, Decision Tree)
Language translation and detection powered by Google Translate
Tokenization (splitting text into words and sentences)
Word and phrase frequencies
Parsing
n-grams
Word inflection (pluralization and singularization) and lemmatization
Spelling correction
Add new models or languages through extensions
WordNet integration

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 文本處理工具下——分析文本的工具每周一薦：文本處理工具AWK Linux Shell 文本處理工具集錦 Linux shell文本處理工具 Linux Shell 文本處理工具 AWK文本處理工具（Linux） Linux文本處理工具grep命令詳解 Linux--shell腳本之文本處理工具正則表達式和文本處理工具 Shell第二篇：正則表達式和文本處理工具