最近對clustering感興趣就自己寫了一個k mediods的實現. 這個算法據說是比kmeans要robust. 我覺得關鍵的不同就是cluster的中心點是一個真實的數據點 而不是構想出來的mean.
寫起來倒是很簡單, 最后vectorize用了cdist() 函數 很好用.
先看結果 : 這是3個類 總共3000 個點的結果.
上代碼:
各種imports:
import matplotlib as plt from pylab import * import collections import copy import pdb import numpy as np from scipy.spatial.distance import cdist import random
這里介紹一下我主要的數據結構 :
medoids 是一個字典,
-2 是total cost
-1: 一個numpy array 存了medoids的index.
0 - k 存了各個cluster的成員的index
data是一個array, 每一行是一個數據點.
下面這個函數計算total cost 根據當前的情況:
def total_cost(data, medoids): ''' compute the total cost based on current setting. ''' med_idx = medoids[-1]; k = len(med_idx); cost = 0.0; med = data[ med_idx] dis = cdist( data, med, 'euclidean') cost = dis.min(axis = 1).sum() # rewrite using the cdist() function, which should be way faster # for i in range(k): # med = data[med_idx[i]] # for j in medoids[i]: # cost = cost + np.linalg.norm(med - data[j]) # medoids[-2] = [cost]
clustering()函數 分配每一點的歸屬 根據當前的情況.
def clustering(data, medoids): ''' compute the belonging of each data point according to current medoids centers, and eucludiean distance. ''' # pdb.set_trace() med_idx = medoids[-1] med = data[med_idx] k = len(med_idx) dis = cdist(data, med) best_med_it_belongs_to = dis.argmin(axis = 1) for i in range(k): medoids[i] =where(best_med_it_belongs_to == i)
最重要的kmedoids() 函數, 用來循環找到最優的clutering
old... 和cur... 比較 直到沒有變化了 就退出循環
cur...每次交換一對 ( 中心點, 非中心點) 構成tmp.... , 從所有的tmp...中找到最優的best....
def kmedoids( data, k): ''' given the data and # of clusters, compute the best clustering based on the algorithm provided in wikipedia: google pam algorithm. ''' # cur_medoids compare with old_medoids, convergence achieved if no change in the list of medoids in consecutive iterations. # tmp_medoids is cur_medoids swapped only one pair of medoid and non-medoid data point. # best_medoids is the best tmp_medoids through all possible swaps. N = len(data) cur_medoids = {} cur_medoids[-1] = range(k) clustering(data, cur_medoids) total_cost(data, cur_medoids) old_medoids = {} old_medoids[-1] = [] iter_counter = 1 # stop if not improvement. while not set(old_medoids[-1]) == set(cur_medoids[-1]): print 'iteration couter:' , iter_counter iter_counter = iter_counter + 1 best_medoids = copy.deepcopy(cur_medoids) old_medoids = copy.deepcopy(cur_medoids) # pdb.set_trace() # iterate over all medoids and non-medoids for i in range(N): for j in range(k): if not i ==j : # swap only a pair tmp_medoids = copy.deepcopy(cur_medoids) tmp_medoids[-1][j] = i clustering(data, tmp_medoids) total_cost(data, tmp_medoids) # pick out the best configuration. if( best_medoids[-2] > tmp_medoids[-2]): best_medoids = copy.deepcopy(tmp_medoids) cur_medoids = copy.deepcopy(best_medoids) print 'current total cost is ', cur_medoids[-2] return cur_medoids
最后寫一個test() 函數, 3個gaussian類 總共1000個點.
def test(): dim = 2 N =1000 # create datas with different normal distributions. d1 = np.random.normal(1, .2, (N,dim)) d2 = np.random.normal(2, .5, (N,dim)) d3 = np.random.normal(3, .3, (N,dim)) data = np.vstack((d1,d2,d3)) # need to change if more clusters are needed . k = 3 medoids = kmedoids(data, k) # plot different clusters with different colors. scatter( data[medoids[0], 0] ,data[medoids[0], 1], c = 'r') scatter( data[medoids[1], 0] ,data[medoids[1], 1], c = 'g') scatter( data[medoids[2], 0] ,data[medoids[2], 1], c = 'y') scatter( data[medoids[-1], 0],data[medoids[-1], 1] , marker = 'x' , s = 500) show()
最后執行代碼就好了 只要2min
if __name__ =='__main__': test()
另外, 我還找到了python很好使的scikit-learn toolbox , 是machine learning相關的. 就練了下手 寫個kmeans
先上圖: 這是gaussian 3 個類, 每個類3000 個點
代碼: 首先還是imports
import time import numpy as np import pylab as pl from sklearn.cluster import KMeans from sklearn.metrics.pairwise import euclidean_distances from sklearn.datasets.samples_generator import make_blobs
然后是生成數據, 並且很簡單的train kmeans:
np.random.seed(0) centers = [[1,1], [-1,-1], [1, -1]] k = len(centers) x , labels = make_blobs(n_samples=3000, centers=centers, cluster_std=.7) kmeans = KMeans(init='k-means++', n_clusters=3, n_init = 10) t0 = time.time() kmeans.fit(x) t_end = time.time() - t0
最后是畫圖:
colors = ['r', 'b', 'g'] for k , col in zip( range(k) , colors): members = (kmeans.labels_ == k ) pl.plot( x[members, 0] , x[members,1] , 'w', markerfacecolor=col, marker='.') pl.plot(kmeans.cluster_centers_[k,0], kmeans.cluster_centers_[k,1], 'o', markerfacecolor=col,\ markeredgecolor='k', markersize=10) pl.show()
很多時候都覺得python比matlab好用多了, 最重要的是matlab和vim的結合不是很好 用起來很不順手.