機器學習入門-文本特征-使用LDA主題模型構造標簽 1.LatentDirichletAllocation(LDA用於構建主題模型) 2.LDA.components(輸出各個詞向量的權重值)

本文轉載自查看原文 2019-01-27 00:28 1376 python基礎

函數說明

1.LDA(n_topics, max_iters, random_state) 用於構建LDA主題模型，將文本分成不同的主題

參數說明:n_topics 表示分為多少個主題， max_iters表示最大的迭代次數， random_state 表示隨機種子

2. LDA.components_ 打印輸入特征的權重參數，

LDA主題模型：可以用於做分類，好比如果是兩個主題的話，那就相當於是分成了兩類，同時我們也可以找出根據主題詞的權重值，來找出一些主題的關鍵詞

使用sklearn導入庫

from sklearn.decomposition import LatentDirichletAllocation，使用方法還是fit_transform

LDA.components_ 打印出各個參數的權重值，這個權重值是根據數據特征的標簽來進行排列的

代碼：

第一步：Dataframe化數據

第二步：進行分詞和停用詞的去除，使用' '.join 為了詞袋模型做准備

第三步：使用np.vectorizer對函數進行向量化處理，調用定義的函數進行分詞和停用詞的去除

第四步：使用Tf-idf 函數構建詞袋模型

第五步：使用LatentDirichletAllocation構建LDA模型，並進行0，1標簽的數字映射

第六步：使用LDA.components_打印輸入特征標簽的權重得分，去除得分小於0.6的得分，我們可以看出哪些詞是主要的關鍵字

import pandas as pd
import numpy as np
import re
import nltk #pip install nltk


corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'
]

labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']

# 第一步：構建DataFrame格式數據
corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 'categoray': labels})

# 第二步：構建函數進行分詞和停用詞的去除
# 載入英文的停用詞表
stopwords = nltk.corpus.stopwords.words('english')
# 建立詞分割模型
cut_model = nltk.WordPunctTokenizer()
# 定義分詞和停用詞去除的函數
def Normalize_corpus(doc):
    # 去除字符串中結尾的標點符號
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', string=doc)
    # 是字符串變小寫格式
    doc = doc.lower()
    # 去除字符串兩邊的空格
    doc = doc.strip()
    # 進行分詞操作
    tokens = cut_model.tokenize(doc)
    # 使用停止用詞表去除停用詞
    doc = [token for token in tokens if token not in stopwords]
    # 將去除停用詞后的字符串使用' '連接，為了接下來的詞袋模型做准備
    doc = ' '.join(doc)

    return doc

# 第三步：向量化函數和調用函數
# 向量化函數,當輸入一個列表時，列表里的數將被一個一個輸入，最后返回也是一個個列表的輸出
Normalize_corpus = np.vectorize(Normalize_corpus)
# 調用函數進行分詞和去除停用詞
corpus_norm = Normalize_corpus(corpus)

# 第四步：使用TfidVectorizer進行TF-idf詞袋模型的構建
from sklearn.feature_extraction.text import TfidfVectorizer

Tf = TfidfVectorizer(use_idf=True)
Tf.fit(corpus_norm)
vocs = Tf.get_feature_names()
corpus_array = Tf.transform(corpus_norm).toarray()
corpus_norm_df = pd.DataFrame(corpus_array, columns=vocs)
print(corpus_norm_df.head())

# 第五步：構建LDA主題模型
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
LDA_corpus = np.array(LDA.fit_transform(corpus_array))
LDA_corpus_one = np.zeros([LDA_corpus.shape[0]])
LDA_corpus_one[LDA_corpus[:, 0] < LDA_corpus[:, 1]] = 1
corpus_norm_df['LDA_labels'] = LDA_corpus_one
print(corpus_norm_df.head())

# 第六步：打印每個單詞的主題的權重值
tt_matrix = LDA.components_
for tt_m in tt_matrix:
    tt_dict = [(name, tt) for name, tt in zip(vocs, tt_m)]
    tt_dict = sorted(tt_dict, key=lambda x: x[1], reverse=True)
    # 打印權重值大於0.6的主題詞
    tt_dict = [tt_threshold for tt_threshold in tt_dict if tt_threshold[1] > 0.6]
    print(tt_dict)

大於0.6權重得分的部分特征

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習-LDA主題模型筆記文本主題模型之LDA(一) LDA基礎文本主題模型之LDA(一) LDA基礎 Spark機器學習(8)：LDA主題模型算法機器學習筆記19-----LDA主題模型(重點理解LDA的建模過程) 主題模型LDA：從入門到放棄 LDA主題模型簡述LDA主題模型 LDA概率主題模型 LDA主題模型算法