使用訓練好的word2vector進行文本聚類


嘗試了使用詞頻的詞表征進行kmeans,效果不好,所以考慮看看使用word2vec的詞表征會有什么不同。

1.加載word2vec

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('word2vector.bigram-char')

文件是網上下載的,使用百度百科語料訓練的300維詞向量,看下效果:

model.most_similar(['男人'])

[('女人', 0.874478816986084),
('老男人', 0.7225901484489441),
('大男人', 0.7179129123687744),
('女孩', 0.6780898571014404),
('臭男人', 0.6778838038444519),
('中年男人', 0.6763597726821899),
('男孩', 0.6762259006500244),
('真男人', 0.6674383878707886),
('好男人', 0.6661351919174194),
('單身男人', 0.6624549031257629)]

len(model.vocab):635974

2.詞嵌入

將我們自己的語料(3萬左右新聞數據且抽取了關鍵詞)嵌入word2vec詞向量:

#詞向量嵌入
from datetime import datetime
import numpy as np
start = datetime.now()
embedding = []

for idx, line in enumerate(keywords):
    vector = np.zeros(300)
    
    for word in line:
        
        if word not in model.vocab:
            vector += np.zeros(300)
        else:
            vector += model[word]
    embedding.append(vector/20)
    if (idx%100==0):
        print(idx)
    
end = datetime.now() 
end-start

因為我每篇文本取20個詞,所以將所有詞的vector/20取了個均值作為文本的向量

3.使用sklearn的kmeans進行聚類

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=100, random_state=1).fit(embedding)
y_pred = kmeans.labels_
cluster_center = kmeans.cluster_centers_
from collections import Counter

center_dict = Counter(y_pred)
center_dict

查看各個類別的數量

Counter({16: 314,
         27: 384,
         21: 160,
         30: 370,
         99: 223,
         15: 158,
         36: 882,
         48: 180,
         14: 184,
         43: 447,
         98: 726,
         88: 601,
         52: 195,
         53: 351,
         13: 565,
         5: 523,
         22: 417,
         23: 365,
         71: 604,
         37: 740,
         63: 355,
         29: 492,
         25: 554,
         82: 335,
         50: 727,
         41: 676,
         47: 344,
         4: 141,
         70: 274,
         12: 559,
         78: 481,
         84: 820,
         40: 237,
         75: 340,
         3: 394,
         10: 574,
         56: 564,
         59: 414,
         51: 301,
         73: 503,
         6: 560,
         60: 268,
         86: 405,
         2: 611,
         28: 485,
         66: 489,
         76: 334,
         77: 296,
         33: 226,
         65: 464,
         97: 501,
         18: 188,
         7: 218,
         54: 251,
         35: 511,
         92: 404,
         19: 454,
         74: 228,
         67: 325,
         49: 591,
         24: 306,
         69: 547,
         72: 330,
         11: 280,
         95: 374,
         81: 464,
         58: 636,
         32: 274,
         79: 115,
         87: 205,
         62: 425,
         34: 281,
         38: 330,
         96: 269,
         64: 445,
         68: 416,
         9: 382,
         91: 113,
         80: 251,
         20: 517,
         44: 264,
         93: 276,
         26: 240,
         17: 381,
         55: 129,
         57: 470,
         0: 501,
         83: 167,
         8: 261,
         89: 134,
         85: 69,
         31: 200,
         90: 147,
         46: 188,
         94: 492,
         1: 91,
         42: 401,
         45: 124,
         61: 189,
         39: 91})

看起來數量挺平均的,隨便拿出來一類看看文章的標題,發現有的好有的壞,總體效果還行。

本來還想用DBSCAN算法,發現時間太久,如果跑需要將vector維度降低,所以就算了。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM