文本切分
之前討論了文本結構、成文和表示。具體來說,標識(token)是具有一定的句法語義且獨立的最小文本成分。一段文本或一個文本文件具有幾個組成部分,包括可以進一步細分為從句、短語和單詞的語句。最流行的文本切分技術包括句子切分和詞語切分,用於將文本語料庫分解成句子,並將每個句子分解成單詞。因此,文本切分可以定義為將文本數據分解或拆分為具有更小且有意義的成文(即標識)的過程。
句子切分
句子切分(sentence tokenization)是將文本語料庫分解成句子的過程,這些句子是組成語料庫的第一級切分結果。這個過程也稱為句子分隔,因為嘗試將文本分割成有意義的句子。任何文本語料庫都是文本的集合,其中每一段落包含多個句子。
執行句子切分有多種技術,基本技術包括在句子之間尋找特定的分隔符,例如句號 ( . )、換行符 ( \n ) 或者分號 ( ; )。將使用 NLTK 框架進行切分,該框架提供用於執行句子切分的各種接口。將主要關注以下句子切分器:
- sent_tokenize
- PunktSentenceTokenizer
- RegexpTokenizer
在將文本分割成句子之前,需要一些測試該系統的文本。下面將加載一些示例文本,以及部分在 NLTK 中可用的古騰堡(Gutenberg)資料庫。可以使用以下代碼段加載必要的依存項:
import
nltk
from
nltk.corpus
import
gutenberg
from
pprint
import
pprint
|
注意:
如果第一次執行則需要執行:
import
nltk
nltk.download(
'gutenberg'
)
|
則會下載所需要的書籍列表。下載成功后執行代碼進行查看:
In [
7
]: nltk.corpus.gutenberg.fileids()
Out[
7
]:
[
'austen-emma.txt'
,
'austen-persuasion.txt'
,
'austen-sense.txt'
,
'bible-kjv.txt'
,
'blake-poems.txt'
,
'bryant-stories.txt'
,
'burgess-busterbrown.txt'
,
'carroll-alice.txt'
,
'chesterton-ball.txt'
,
'chesterton-brown.txt'
,
'chesterton-thursday.txt'
,
'edgeworth-parents.txt'
,
'melville-moby_dick.txt'
,
'milton-paradise.txt'
,
'shakespeare-caesar.txt'
,
'shakespeare-hamlet.txt'
,
'shakespeare-macbeth.txt'
,
'whitman-leaves.txt'
]
|
如果執行時出現以下錯誤:
則說明網絡問題,請使用可以連接國外服務器資源的服務器。
alice
=
gutenberg.raw(fileids
=
'carroll-alice.txt'
)
sample_text
=
'We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming language!'
|
可以使用以下代碼查看 "Akuce ub Wibderkabd" 語料庫的長度及其前幾行內容:
In [
12
]:
print
(
len
(alice))
144395
In [
13
]:
print
(alice[
0
:
100
])
[Alice's Adventures
in
Wonderland by Lewis Carroll
1865
]
CHAPTER I. Down the Rabbit
-
Hole
Alice was
|
nltk.sent_tokenize 函數是 nltk 推薦的默認的句子切分函數。它內部使用了一個 PunktSentenceTokenizer 類的示例。然而,它不僅僅是一個普通的對象或示例,它依據在幾種語言模型上完成了預訓練,並且在除英語外的許多語言上取得了良好的運行效果。
以下是代碼段展示了該函數在示例文本中的基本操作:
注意:
第一次執行需要執行:
nltk.download(
'punkt'
)
|
default_st
=
nltk.sent_tokenize
alice_sentences
=
default_st(text
=
alice)
sample_sentences
=
default_st(text
=
sample_text)
print
(
'Total sentences in sample_text:'
,
len
(sample_sentences))
print
(
'Sample text sentences :-'
)
pprint(sample_sentences)
print
(
'\nTotal sentences in alice:'
,
len
(alice_sentences))
pprint(alice_sentences[
0
:
5
])
|
運行上述代碼段,你將得到以下輸出,該輸出給出句子總數以及這些句子在文本語料庫中的模樣:
Total sentences
in
sample_text:
3
Sample text sentences :
-
[
'We will discuss briefly about the basic syntax, structure and design '
'philosophies.'
,
'There is a defined hierarchical syntax for Python code which you should '
'remember when writing code!'
,
'Python is a really powerful programming language!'
]
Total sentences
in
alice:
1625
[
"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I."
,
'Down the Rabbit-Hole\n'
'\n'
'Alice was beginning to get very tired of sitting by her sister on the\n'
'bank, and of having nothing to do: once or twice she had peeped into the\n'
'book her sister was reading, but it had no pictures or conversations in\n'
"it, 'and what is the use of a book,' thought Alice 'without pictures or\n"
"conversation?'"
,
'So she was considering in her own mind (as well as she could, for the\n'
'hot day made her feel very sleepy and stupid), whether the pleasure\n'
'of making a daisy-chain would be worth the trouble of getting up and\n'
'picking the daisies, when suddenly a White Rabbit with pink eyes ran\n'
'close by her.'
,
'There was nothing so VERY remarkable in that; nor did Alice think it so\n'
"VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!"
,
'Oh dear!'
]
|
現在,應該可以看出,句子切分器其實是非常智能的,它不僅會使用句號來划分語句。它還會考慮到其他標點符號以及單詞大小寫。
我們也可以對其他語言的文本進行語句切分。如果正在處理德語文本,可以使用已經訓練好的 sent_tokenize,或者在德語文本中加載一個預先訓練好的切分模型得到一個 PunktSentenceTokenizer 實例中並執行相同的操作。以下代碼段顯示了德語中的語句切分過程。
首先加載德語文本語料庫並檢查它:
注意:
第一次執行需要執行:
nltk.download(
'europarl_raw'
)
|
In [
34
]: german_text
=
europarl_raw.german.raw(fileids
=
'ep-00-01-17.de'
)
In [
35
]:
from
nltk.corpus
import
europarl_raw
In [
36
]: german_text
=
europarl_raw.german.raw(fileids
=
'ep-00-01-17.de'
)
In [
37
]:
print
(
len
(german_text))
157171
In [
38
]:
print
(german_text[
0
:
100
])
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem
17.
Dezember unterbrochene Sit
|
然后,使用默認的 sent_tokenize 切分器和一個從 nltk 源加載的預訓練的德語切分器來講文本語料庫分割成句子:
In [
40
]: german_sentences_def
=
default_st(text
=
german_text, language
=
'german'
)
In [
41
]: german_tokenizer
=
nltk.data.load(resource_url
=
'tokenizers/punkt/german.pickle'
)
In [
42
]: german_sentences
=
german_tokenizer.tokenize(german_text)
In [
43
]:
print
(
type
(german_tokenizer))
<
class
'nltk.tokenize.punkt.PunktSentenceTokenizer'
>
|
有此可以看出 german_tokenizer 是 PunktSentenceTokenizer 的一個實例,它專門用來處理德語。
接下來,對此從默認切分器獲得的句子是否與從預訓練切分器獲得的句子相同,理想情況下應為 True。之后,顯示部分示例句子的切分結果:
In [
45
]:
print
(german_sentences_def
=
=
german_sentences)
True
In [
46
]:
for
sent
in
german_sentences[
0
:
5
]:
....:
print
(sent)
....:
Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem
17.
Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten .
Wie Sie feststellen konnten , ist der gefürchtete
" Millenium-Bug "
nicht eingetreten .
Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden .
Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode
in
den nächsten Tagen .
Heute möchte ich Sie bitten
-
das ist auch der Wunsch einiger Kolleginnen und Kollegen
-
, allen Opfern der Stürme , insbesondere
in
den verschiedenen Ländern der Europäischen Union ,
in
einer Schweigeminute zu gedenken .
|
從結果可以看出前端的假設是正確的,可以用兩種方式來切分英語之外的語言句子。使用默認的 PunktSentenceTokenizer 類也能很方便的實現句子切分,如下所示:
In [
47
]: punkt_st
=
nltk.tokenize.PunktSentenceTokenizer()
In [
48
]: sample_sentences
=
punkt_st.tokenize(sample_text)
In [
49
]: pprint(sample_sentences)
[
'We will discuss briefly about the basic syntax, structure and design '
'philosophies.'
,
'There is a defined hierarchical syntax for Python code which you should '
'remember when writing code!'
,
'Python is a really powerful programming language!'
]
|
可以看到,得到了與預期一致的輸出。在句子切分這部分知識中,要介紹的是使用 RegexpTokenizer 類的示例將文本切分為句子,將使用基於正則表達式的模式萊切分句子。
以下代碼顯示了如何使用正則表達式來分隔句子:
In [
50
]: SENTENCE_TOKENS_PATTERN
=
r
'(?<!\w\.\w.)(?<![A-Z][a-z]\.])(?<![A-Z]\.)(?<=\.|\?|\!)\s'
In [
51
]: regex_st
=
nltk.tokenize.RegexpTokenizer(
....: pattern
=
SENTENCE_TOKENS_PATTERN,
....: gaps
=
True
....: )
In [
52
]: sample_sentences
=
regex_st.tokenize(sample_text)
In [
53
]: pprint(sample_sentences)
[
'We will discuss briefly about the basic syntax, structure and design '
'philosophies.'
,
'There is a defined hierarchical syntax for Python code which you should '
'remember when writing code!'
,
'Python is a really powerful programming language!'
]
|
通過上面的輸出可以看出,獲得的切分結果與使用其他切分器切分的結果相同。
詞語切分
詞語切分(word tokeninzation)是將句子分解或分割成其組成單詞的過程。句子是單詞的集合,通過詞語切分,在本質上,將一個句子分割成單詞列表,該單詞列表又可以重建句子。詞語分隔在很多過程中都是非常重要的,特別是在文本清晰和規范化時,諸如磁感提取和詞型還原基於詞干、標識信息的操作會在每個單詞實施。與句子切分類似,nltk 為詞語切分提供了各種有用的接口。
- work_tokenize
- TreebankWordTokenizer
- RegexpTokenizer
- 從 RegexoTokenizer 繼承的切分器
將使用例句 "The brown fox wasn't that quick and he couldn't win the race" 作為各種切分器的輸入。nltk.word_tokenize 函數是 nltk 默認並推薦的詞語切分器。該切分器實際上是 TreebankWordTokenizer 類的一個實例或對象,並且是該核心類的一個封裝。以下代碼可與說其用法:
In [
9
]: sentence
=
"The brown fox wasn't that quick and he couldn't win the race"
In [
10
]: default_wt
=
nltk.word_tokenize
In [
11
]: words
=
default_wt(sentence)
In [
12
]:
print
(words)
[
'The'
,
'brown'
,
'fox'
,
'was'
,
"n't"
, 'that
', '
quick
', '
and
', '
he
', '
could
', "n'
t",
'win'
,
'the'
,
'race'
]
|
TreebankWordTokenizer 基於 Penn Treebank,並使用各種正則表達式來分隔文本。當然,這里的一個主要假設是我們已經預先執行了句子切分。Penn Treebank 使用的原始切分器是一個 sed 腳本,可以在 https://catalog.ldc.upenn.edu/ldc99t42 下載,從而了解句子切分為單詞的簡要模式。該切分器的一些主要功能包括:
- 分隔和分離出現在句子末尾的句點。
- 分隔和分離空格前的逗號和單引號。
- 將大多數表標點符號分隔成獨立標識。
- 分隔常規的縮寫詞,例如將 “don't” 分割成 “do” 和 “n‘t”。
以下代碼段展示了 TreebankWordTokenizr 的語句切分中的用法:
In [
13
]: treebank_wt
=
nltk.TreebankWordTokenizer()
In [
14
]: words
=
treebank_wt.tokenize(sentence)
In [
15
]:
print
(words)
[
'The'
,
'brown'
,
'fox'
,
'was'
,
"n't"
, 'that
', '
quick
', '
and
', '
he
', '
could
', "n'
t",
'win'
,
'the'
,
'race'
]
|
可以看出,正如所預期的那樣,上述代碼段的輸出與 word_tokenize() 的輸出相似,因為他們使用了相同的分詞機制。
現在來看看如何使用正則表達式的 RegexpTokenizer 類切分句子。請切記,在詞語切分中有兩個主要參數:pattern 參數和 gaps 參數。pattern 參數用於構建切分器;gaps 參數如果設置為 True,用於查找標識之間的間隙。否則,它用於查找標識本身。
以下代碼段展示了一些實用正則表達式執行詞語切分的示例:
In [
21
]: TOKEN_PATTERN
=
r
'\w+'
In [
22
]: regex_wt
=
nltk.RegexpTokenizer(pattern
=
TOKEN_PATTERN,gaps
=
False
)
In [
23
]: words
=
regex_wt.tokenize(sentence)
In [
24
]:
print
(words)
[
'The'
,
'brown'
,
'fox'
,
'wasn'
,
't'
,
'that'
,
'quick'
,
'and'
,
'he'
,
'couldn'
,
't'
,
'win'
,
'the'
,
'race'
]
|
In [
25
]: GAP_PATTERN
=
r
'\s+'
In [
26
]: regex_wt
=
nltk.RegexpTokenizer(pattern
=
GAP_PATTERN,gaps
=
True
)
In [
27
]: words
=
regex_wt.tokenize(sentence)
In [
28
]:
print
(words)
[
'The'
,
'brown'
,
'fox'
,
"wasn't"
, 'that
', '
quick
', '
and
', '
he
', "couldn'
t",
'win'
,
'the'
,
'race'
]
|
In [
29
]: word_indices
=
list
(regex_wt.span_tokenize(sentence))
In [
30
]:
print
(word_indices)
[(
0
,
3
), (
4
,
9
), (
10
,
13
), (
14
,
20
), (
21
,
25
), (
26
,
31
), (
32
,
35
), (
36
,
38
), (
39
,
47
), (
48
,
51
), (
52
,
55
), (
56
,
60
)]
In [
31
]:
print
([sentence[start:end]
for
start, end
in
word_indices])
[
'The'
,
'brown'
,
'fox'
,
"wasn't"
, 'that
', '
quick
', '
and
', '
he
', "couldn'
t",
'win'
,
'the'
,
'race'
]
|
除了基礎的 RegexpTokenizer 類之類,還有幾個派生類可以執行不同類型的詞語切分。WordPunktTokenizer 使用 r'\w+|[^\w\s]+' 模式將句子切分成獨立的字母和非字母標識。WhitespaceTokenizer 基於諸如縮進符、換行符及空格的空白字符將句子分割成單詞。
以下代碼說明了上述派生類的用法:
In [
32
]: wordpunkt_wt
=
nltk.WordPunctTokenizer()
In [
33
]: words
=
wordpunkt_wt.tokenize(sentence)
In [
34
]:
print
(words)
[
'The'
,
'brown'
,
'fox'
,
'wasn'
,
"'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'"
,
't'
,
'win'
,
'the'
,
'race'
]
|
In [
35
]: whitespace_wt
=
nltk.WhitespaceTokenizer()
In [
36
]: words
=
whitespace_wt.tokenize(sentence)
In [
37
]:
print
(words)
[
'The'
,
'brown'
,
'fox'
,
"wasn't"
, 'that
', '
quick
', '
and
', '
he
', "couldn'
t",
'win'
,
'the'
,
'race'
]
|