機器學習入門-文本數據-構造Ngram詞袋模型 1.CountVectorizer(ngram_range) 構建Ngram詞袋模型

本文轉載自查看原文 2019-01-26 19:37 2348 python基礎

函數說明：

1 CountVectorizer(ngram_range=(2, 2)) 進行字符串的前后組合，構造出新的詞袋標簽

參數說明：ngram_range=(2, 2) 表示選用2個詞進行前后的組合，構成新的標簽值

Ngram模型表示的是，對於詞頻而言，只考慮一個詞，這里我們在CountVectorizer統計詞頻時，傳入ngram_range=(2, 2)來構造新的詞向量的組合

好比一句話'I like you'

如果ngram_range = (2, 2)表示只選取前后的兩個詞構造詞組合 :詞向量組合為：’I like‘ 和 ’like you‘

如果ngram_range = (1, 3) 表示選取1到3個詞做為組合方式: 詞向量組合為: 'I', 'like', 'you', 'I like', 'like you', 'I like you' 構成詞頻標簽

代碼：

第一步：構造Dataframe格式，並數組化數據

第二步：構造函數進行分詞和去除停用詞，並使用空格進行串接，為了分詞做准備

第三步：np.vectorize 向量化函數，並調用函數進行分詞和去除停用詞

第四步：使用CountVectorizer(ngram_range(2, 2)) 進行文本的詞向量拼接

import pandas as pd
import numpy as np
import re
import nltk #pip install nltk


corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'
]

labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']

# 第一步：構建DataFrame格式數據
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 'categoray': labels})

# 第二步：構建函數進行分詞和停用詞的去除
# 載入英文的停用詞表
stopwords = nltk.corpus.stopwords.words('english')
# 建立詞分割模型
cut_model = nltk.WordPunctTokenizer()
# 定義分詞和停用詞去除的函數
def Normalize_corpus(doc):
    # 去除字符串中結尾的標點符號
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', string=doc)
    # 是字符串變小寫格式
    doc = doc.lower()
    # 去除字符串兩邊的空格
    doc = doc.strip()
    # 進行分詞操作
    tokens = cut_model.tokenize(doc)
    # 使用停止用詞表去除停用詞
    doc = [token for token in tokens if token not in stopwords]
    # 將去除停用詞后的字符串使用' '連接，為了接下來的詞袋模型做准備
    doc = ' '.join(doc)

    return doc

# 第三步：向量化函數和調用函數
# 向量化函數,當輸入一個列表時，列表里的數將被一個一個輸入，最后返回也是一個個列表的輸出
Normalize_corpus = np.vectorize(Normalize_corpus)
# 調用函數進行分詞和去除停用詞
corpus_norm = Normalize_corpus(corpus)
print(corpus_norm)

# 第四步：使用ngram_range構造組合的詞袋標簽
from sklearn.feature_extraction.text import  CountVectorizer
CV = CountVectorizer(ngram_range=(2, 2))
CV.fit(corpus_norm)
vocs = CV.get_feature_names()
print(vocs)
corpus_array = CV.transform(corpus_norm).toarray()
corpus_norm_df = pd.DataFrame(corpus_array, columns=vocs)
print(corpus_norm_df.head())

部分的組合結果

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習入門-文本數據-構造詞頻詞袋模型 1.re.sub(進行字符串的替換) 2.nltk.corpus.stopwords.words(獲得停用詞表) 3.nltk.WordPunctTokenizer(對字符串進行分詞操作) 4.np.vectorize(對函數進行向量化) 5. CountVectorizer(構建詞頻的詞袋模型) 詞袋和詞向量模型 sklearn 詞袋 CountVectorizer 詞袋模型從詞袋模型到詞向量機器學習-文本分類（1）之獨熱編碼、詞袋模型、N-gram、TF-IDF 文本向量化及詞袋模型 - NLP學習（3-1）視覺詞袋模型(BOVW) 文本離散表示（一）：詞袋模型（bag of words）視覺單詞模型、詞袋模型BoW