[ML]使用word2vec做kmeans聚類


本文使用word2vec(100維)做聚類,訓練文本中一行是一條數據(已分詞),具體代碼如下:

from sklearn.cluster import KMeans  
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#from sklearn.decomposition import PCA
from gensim.models import Word2Vec
import nltk
from nltk.corpus import stopwords
#from sklearn.model_selection import train_test_split
import random
import matplotlib.pyplot as plt
%matplotlib inline
#from sklearn.datasets.samples_generator import make_blob

加載文本:

sents = []
#sents:已分好詞的文件,一行是一條數據,已經分好詞並去掉停用詞
with open('generate_data/sents_for_kmeans.txt','r',encoding='utf-8') as f:
    for line in f:
        sents.append(line.replace('\n',''))

文本去重:

sents = list(set(sents))
print(len(sents))
print(sents[10])

結果如下:

67760
含羞草 芒果 500g 大禮包 散裝 無絲 軟糯 芒果 100g
訓練word2vec模型:
all_words = [sent.split(' ') for sent in sents]
word2vec = Word2Vec(all_words)

查看詞典:

vocabulary = word2vec.wv.vocab
print(vocabulary.keys())
len(vocabulary)

將所有的詞向量匯合到一個list中:

vectors = []
for item in vocabulary:
    vectors.append(word2vec.wv[item])

訓練kmeans模型:

num_clusters = 2
km_cluster = KMeans(n_clusters=num_clusters, max_iter=300, n_init=40, init='k-means++',n_jobs=-1)  
#返回各自文本的所被分配到的類索引 
#result = km_cluster.fit_predict(vectors)  
#print("Predicting result: ", result)
km_cluster.fit(vectors)

圖形化展示:

cents = km_cluster.cluster_centers_
labels = km_cluster.labels_
inertia = km_cluster.inertia_
mark = ['or','ob']
color = 0
j = 0
for i in labels:
    #print(vectors[j])
    plt.plot(vectors[j],mark[i],markersize=5)
    j += 1
plt.show()

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM