PAM for Kmedoids algorithm, PAM算法的實現, kmeans 算法實現. 利用scikit-learn toolbox.


最近對clustering感興趣就自己寫了一個k mediods的實現. 這個算法據說是比kmeans要robust. 我覺得關鍵的不同就是cluster的中心點是一個真實的數據點  而不是構想出來的mean.

寫起來倒是很簡單, 最后vectorize用了cdist() 函數 很好用.

先看結果 : 這是3個類 總共3000 個點的結果.

 

上代碼: 

各種imports:

import matplotlib as plt
from pylab import *
import collections
import copy
import pdb
import numpy as np
from scipy.spatial.distance import cdist
import random

這里介紹一下我主要的數據結構 :

medoids 是一個字典,

-2 是total cost

-1: 一個numpy array 存了medoids的index.

0 - k 存了各個cluster的成員的index

 

data是一個array, 每一行是一個數據點.

下面這個函數計算total cost 根據當前的情況:

def total_cost(data, medoids):
    '''
compute the total cost based on current setting.
'''
    med_idx = medoids[-1];
    k = len(med_idx);
    cost = 0.0;

    med = data[ med_idx]
    dis = cdist( data, med, 'euclidean')
    cost = dis.min(axis = 1).sum()
    
    # rewrite using the cdist() function, which should be way faster
    # for i in range(k):
    # med = data[med_idx[i]]
    # for j in medoids[i]:
    # cost = cost + np.linalg.norm(med - data[j])
    #
    medoids[-2] = [cost]

clustering()函數 分配每一點的歸屬 根據當前的情況.

def clustering(data, medoids):
    '''
compute the belonging of each data point according to current medoids centers, and eucludiean distance.
'''
    
    # pdb.set_trace()
    med_idx = medoids[-1]
    med = data[med_idx]
    k = len(med_idx)
    

    dis = cdist(data, med)
    best_med_it_belongs_to = dis.argmin(axis = 1)
    for i in range(k):
        medoids[i] =where(best_med_it_belongs_to == i)

 

最重要的kmedoids() 函數, 用來循環找到最優的clutering

old... 和cur... 比較 直到沒有變化了 就退出循環

cur...每次交換一對 ( 中心點, 非中心點) 構成tmp.... , 從所有的tmp...中找到最優的best....

def kmedoids( data, k):
    '''
given the data and # of clusters, compute the best clustering based on the algorithm provided in wikipedia: google pam algorithm.
'''
    # cur_medoids compare with old_medoids, convergence achieved if no change in the list of medoids in consecutive iterations.
    # tmp_medoids is cur_medoids swapped only one pair of medoid and non-medoid data point.
    # best_medoids is the best tmp_medoids through all possible swaps.

    N = len(data)
    cur_medoids = {}
    cur_medoids[-1] = range(k)
    clustering(data, cur_medoids)
    total_cost(data, cur_medoids)
    old_medoids = {}
    old_medoids[-1] = []
    
    iter_counter = 1
    # stop if not improvement.
    while not set(old_medoids[-1]) == set(cur_medoids[-1]):
        print 'iteration couter:' , iter_counter
        iter_counter = iter_counter + 1
        best_medoids = copy.deepcopy(cur_medoids)
        old_medoids = copy.deepcopy(cur_medoids)
        # pdb.set_trace()
        # iterate over all medoids and non-medoids
        for i in range(N):
            for j in range(k):
                if not i ==j :
                    # swap only a pair
                    tmp_medoids = copy.deepcopy(cur_medoids)
                    tmp_medoids[-1][j] = i

                    clustering(data, tmp_medoids)
                    total_cost(data, tmp_medoids)
                    # pick out the best configuration.
                    if( best_medoids[-2] > tmp_medoids[-2]):
                        best_medoids = copy.deepcopy(tmp_medoids)
        cur_medoids = copy.deepcopy(best_medoids)
        print 'current total cost is ', cur_medoids[-2]
    return cur_medoids

最后寫一個test() 函數, 3個gaussian類 總共1000個點.

def test():
    dim = 2
    N =1000

    # create datas with different normal distributions.
    d1 = np.random.normal(1, .2, (N,dim))
    d2 = np.random.normal(2, .5, (N,dim))
    d3 = np.random.normal(3, .3, (N,dim))
    data = np.vstack((d1,d2,d3))
    
    # need to change if more clusters are needed .
    k = 3
    medoids = kmedoids(data, k)
    # plot different clusters with different colors.
    scatter( data[medoids[0], 0] ,data[medoids[0], 1], c = 'r')
    scatter( data[medoids[1], 0] ,data[medoids[1], 1], c = 'g')
    scatter( data[medoids[2], 0] ,data[medoids[2], 1], c = 'y')
    scatter( data[medoids[-1], 0],data[medoids[-1], 1] , marker = 'x' , s = 500)
    show()

最后執行代碼就好了  只要2min

if __name__ =='__main__':
    test()

 另外, 我還找到了python很好使的scikit-learn toolbox , 是machine learning相關的. 就練了下手 寫個kmeans

先上圖: 這是gaussian 3 個類, 每個類3000 個點

代碼: 首先還是imports

import time
import numpy as np
import pylab as pl
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets.samples_generator import make_blobs

然后是生成數據, 並且很簡單的train kmeans:

np.random.seed(0)
centers = [[1,1], [-1,-1], [1, -1]]
k = len(centers)
x , labels = make_blobs(n_samples=3000, centers=centers, cluster_std=.7)

kmeans = KMeans(init='k-means++', n_clusters=3, n_init = 10)
t0 = time.time()
kmeans.fit(x)
t_end = time.time() - t0

最后是畫圖:

colors = ['r', 'b', 'g']
for k , col in zip( range(k) , colors):
    members = (kmeans.labels_ == k )
    pl.plot( x[members, 0] , x[members,1] , 'w', markerfacecolor=col, marker='.')
    pl.plot(kmeans.cluster_centers_[k,0], kmeans.cluster_centers_[k,1], 'o', markerfacecolor=col,\
            markeredgecolor='k', markersize=10)
pl.show()

很多時候都覺得python比matlab好用多了, 最重要的是matlab和vim的結合不是很好 用起來很不順手.

 


 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM