K-均值聚類算法


一.k均值聚類算法

對於樣本集。"k均值"算法就是針對聚類划分最小化平方誤差: 

其中是簇Ci的均值向量。從上述公式中可以看出,該公式刻畫了簇內樣本圍繞簇均值向量的緊密程度,E值越小簇內樣本的相似度越高。

 

工作流程:

 

k-均值算法的描述如下:

創建k個點作為起始質心(通常隨機選擇)
當任意一個點的簇分配結果發生改變時:
        對數據集中的每個點:
                對每個質心:
                計算質心與數據點之間的距離
         將數據點分配到距離其最近的簇
對每一個簇,計算簇中所有點的均值並將均值作為質心

接下來是對於數據集testSet.txt的代碼實現:

1.658985	4.285136
-3.453687	3.424321
4.838138	-1.151539
-5.379713	-3.362104
0.972564	2.924086
-3.567919	1.531611
0.450614	-3.302219
-3.487105	-1.724432
2.668759	1.594842
-3.156485	3.191137
3.165506	-3.999838
-2.786837	-3.099354
4.208187	2.984927
-2.123337	2.943366
0.704199	-0.479481
-0.392370	-3.963704
2.831667	1.574018
-0.790153	3.343144
2.943496	-3.357075
-3.195883	-2.283926
2.336445	2.875106
-1.786345	2.554248
2.190101	-1.906020
-3.403367	-2.778288
1.778124	3.880832
-1.688346	2.230267
2.592976	-2.054368
-4.007257	-3.207066
2.257734	3.387564
-2.679011	0.785119
0.939512	-4.023563
-3.674424	-2.261084
2.046259	2.735279
-3.189470	1.780269
4.372646	-0.822248
-2.579316	-3.497576
1.889034	5.190400
-0.798747	2.185588
2.836520	-2.658556
-3.837877	-3.253815
2.096701	3.886007
-2.709034	2.923887
3.367037	-3.184789
-2.121479	-4.232586
2.329546	3.179764
-3.284816	3.273099
3.091414	-3.815232
-3.762093	-2.432191
3.542056	2.778832
-1.736822	4.241041
2.127073	-2.983680
-4.323818	-3.938116
3.792121	5.135768
-4.786473	3.358547
2.624081	-3.260715
-4.009299	-2.978115
2.493525	1.963710
-2.513661	2.642162
1.864375	-3.176309
-3.171184	-3.572452
2.894220	2.489128
-2.562539	2.884438
3.491078	-3.947487
-2.565729	-2.012114
3.332948	3.983102
-1.616805	3.573188
2.280615	-2.559444
-2.651229	-3.103198
2.321395	3.154987
-1.685703	2.939697
3.031012	-3.620252
-4.599622	-2.185829
4.196223	1.126677
-2.133863	3.093686
4.668892	-2.562705
-2.793241	-2.149706
2.884105	3.043438
-2.967647	2.848696

  

from numpy import *

#load data
def loadDataSet(fileName):
    dataMat = []
    fr = open(fileName)
    for line in fr.readlines(): #for each line
        curLine = line.strip().split('\t')
        fltLine = list(map(float,curLine)) #這里和書中不同 和上一章一樣修改
        dataMat.append(fltLine)
    return dataMat

#distance func
def distEclud(vecA,vecB):
    return sqrt(sum(power(vecA - vecB, 2)))  # la.norm(vecA-vecB) 向量AB的歐式距離

#init K points randomly
def randCent(dataSet, k):
    n = shape(dataSet)[1]
    centroids = mat(zeros((k,n)))#create centroid mat
    for j in range(n):#create random cluster centers, within bounds of each dimension
        minJ = min(dataSet[:,j])
        rangeJ = float(max(dataSet[:,j]) - minJ)
        centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
    return centroids

 

計算出矩陣中的最大值、最小值、距離:

>>> import mykmeans
>>> from numpy import *
>>> datMat=mat(mykmeans.loadDataSet('testSet.txt'))
>>> min(datMat[:,0])
matrix([[-5.379713]])
>>> min(datMat[:,1])
matrix([[-4.232586]])
>>> max(datMat[:,0])
matrix([[4.838138]])
>>> max(datMat[:,1])
matrix([[5.1904]])
>>> mykmeans.randCent(datMat,2)
matrix([[ 0.94526572, -2.73986112],
        [-1.56707725,  4.14438101]])
>>> mykmeans.distEclud(datMat[0],datMat[1])
5.184632816681332

 

接下來是K-均值算法的核心程序:

def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
    m = shape(dataSet)[0]
    clusterAssment = mat(zeros((m,2)))#create mat to assign data points 
                                      #to a centroid, also holds SE of each point
    centroids = createCent(dataSet, k)
    clusterChanged = True
    while clusterChanged:
        clusterChanged = False
        for i in range(m):#for each data point assign it to the closest centroid
            minDist = inf; minIndex = -1
            for j in range(k):
                distJI = distMeas(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                    minDist = distJI; minIndex = j
            if clusterAssment[i,0] != minIndex: clusterChanged = True
            clusterAssment[i,:] = minIndex,minDist**2
        print (centroids)
        for cent in range(k):#recalculate centroids
            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
            centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean 
    return centroids, clusterAssment

datMat=mat(loadDataSet('testSet.txt'))
myCentroids,clustAssing=kMeans(datMat,4)

 

取k為4,得到4個質心:

[[ 4.80480327  0.7928886 ]
 [ 1.2776812  -3.82602902]
 [ 2.56286444  4.00821365]
 [-1.43428372  3.0160483 ]]
[[ 4.007951   -0.28653243]
 [-0.64297274 -3.04016156]
 [ 2.51964406  3.40459212]
 [-2.605345    2.35623864]]
[[ 3.4363324  -2.41850987]
 [-2.26682904 -2.92971896]
 [ 2.54391447  3.21299611]
 [-2.46154315  2.78737555]]
[[ 2.926737   -2.70147753]
 [-3.19984738 -2.96423548]
 [ 2.6265299   3.10868015]
 [-2.46154315  2.78737555]]
[[ 2.80293085 -2.7315146 ]
 [-3.38237045 -2.9473363 ]
 [ 2.6265299   3.10868015]
 [-2.46154315  2.78737555]]

二.后處理提高聚類性能

聚類算法中,k的值是由用戶初始定義的,如何才能判斷k值定義是否合適,就需要用誤差來評價聚類效果的好壞,誤差是各個點與其所屬類別質心的距離決定的。K-均值聚類的方法效果較差的原因是會收斂到局部最小值,而且全局最小。一種評價聚類效果的方法是SSE(Sum of Squared Error)誤差平方和的方法,取平方的結果是使得遠離中心的點變得更加突出。

一種降低SSE的方法是增加簇的個數,即提高k值,但是違背了聚類的目標,聚類的目標是在不改變簇數目的前提下提高簇的質量。可選的改進的方法是對生成的簇進行后處理,將最大SSE值的簇划分成兩個(K=2的K-均值算法),然后再進行相鄰的簇合並。具體方法有兩種:1、合並最近的兩個質心(合並使得SSE增幅最小的兩個質心)2、遍歷簇 合並兩個然后計算SSE的值,找到使得SSE最小的情況。

 

三.二分K-均值算法

這個算法的思想是:初始狀態所有數據點屬於一個大簇,之后每次選擇一個簇切分成兩個簇,這個切分滿足使SSE值最大程度降低,直到簇數目達到k。另一種思路是每次選擇SSE值最大的一個簇進行切分。

 

滿足使SSE值最大程度降低偽代碼如下:

將所有點看成一個簇
   當簇數目小於k時
        對於每一個簇:
            計算總誤差
            在給定的簇上面進行K-均值聚類(k=2)
    計算將該簇一分為二后的總誤差
    選擇使得誤差最小的那個簇進行划分操作

代碼如下:

def biKmeans(dataSet,k,distMeas=distEclud):
    m=shape(dataSet)[0]
    clusterAssment=mat(zeros((m,2)))
    centroid0 = mean(dataSet, axis=0).tolist()[0]
    centList = [centroid0]  # create a list with one centroid
    for j in range(m):  # calc initial Error for each point
        clusterAssment[j, 1] = distMeas(mat(centroid0), dataSet[j, :]) ** 2
    while (len(centList) < k):
        lowestSSE = inf #init SSE
        for i in range(len(centList)):#for every centroid
            ptsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0],:]  # get the data points currently in cluster i
            centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)# k=2,kMeans
            sseSplit = sum(splitClustAss[:, 1])  # compare the SSE to the currrent minimum
            sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1])
            print("sseSplit, and notSplit: ", sseSplit, sseNotSplit)
            if (sseSplit + sseNotSplit) < lowestSSE: #judge the error
                bestCentToSplit = i
                bestNewCents = centroidMat
                bestClustAss = splitClustAss.copy()
                lowestSSE = sseSplit + sseNotSplit
        #new cluster and split cluster
        bestClustAss[nonzero(bestClustAss[:, 0].A == 1)[0], 0] = len(centList)  # change 1 to 3,4, or whatever
        bestClustAss[nonzero(bestClustAss[:, 0].A == 0)[0], 0] = bestCentToSplit
        print('the bestCentToSplit is: ', bestCentToSplit)
        print('the len of bestClustAss is: ', len(bestClustAss))
        centList[bestCentToSplit] = bestNewCents[0, :].tolist()[0]  # replace a centroid with two best centroids
        centList.append(bestNewCents[1, :].tolist()[0])
        clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentToSplit)[0],:] = bestClustAss  # reassign new clusters, and SSE
    return mat(centList), clusterAssment

datMat3=mat(loadDataSet('testSet2.txt'))
centList,myNewAssments=biKmeans(datMat3,3)
centList

運行結果:

[[4.22282288 2.88343006]
 [0.53706656 4.28802843]]
[[ 2.37096343  0.78503778]
 [-1.72960876  1.49902535]]
[[ 2.22969593  1.291217  ]
 [-2.24671409  1.16767909]]
[[ 2.40575527  1.66726781]
 [-2.11802947  0.88737776]]
[[ 2.93386365  3.12782785]
 [-1.70351595  0.27408125]]
sseSplit, and notSplit:  541.2976292649145 0.0
the bestCentToSplit is:  0
the len of bestClustAss is:  60
[[2.17022341 1.96268662]
 [3.12589387 4.12081081]]
[[2.38650938 2.40163825]
 [3.2987665  3.61195425]]
[[2.32716778 2.52893967]
 [3.43025118 3.61782727]]
sseSplit, and notSplit:  27.63707606832937 501.7683305828214
[[-2.80259011  4.25652642]
 [-0.51490661  3.66748635]]
[[-3.14707283  3.35698672]
 [-0.52242395 -2.24829595]]
[[-2.94737575  3.3263781 ]
 [-0.45965615 -2.7782156 ]]
sseSplit, and notSplit:  67.2202000797829 39.52929868209309
the bestCentToSplit is:  1
the len of bestClustAss is:  40

四.應用:對地圖上的點進行聚類

portlandClubs.txt

Dolphin II    10860 SW Beaverton-Hillsdale Hwy    Beaverton, OR
Hotties    10140 SW Canyon Rd.    Beaverton, OR
Pussycats    8666a SW Canyon Road    Beaverton, OR
Stars Cabaret    4570 Lombard Ave    Beaverton, OR
Sunset Strip    10205 SW Park Way    Beaverton, OR
Vegas VIP Room    10018 SW Canyon Rd    Beaverton, OR
Full Moon Bar and Grill    28014 Southeast Wally Road    Boring, OR
505 Club    505 Burnside Rd    Gresham, OR
Dolphin    17180 McLoughlin Blvd    Milwaukie, OR
Dolphin III    13305 SE McLoughlin BLVD    Milwaukie, OR
Acropolis    8325 McLoughlin Blvd    Portland, OR
Blush    5145 SE McLoughlin Blvd    Portland, OR
Boom Boom Room    8345 Barbur Blvd    Portland, OR
Bottoms Up    16900 Saint Helens Rd    Portland, OR
Cabaret II    17544 Stark St    Portland, OR
Cabaret Lounge    503 W Burnside    Portland, OR
Carnaval    330 SW 3rd Avenue    Portland, OR
Casa Diablo    2839 NW St. Helens Road    Portland, OR
Chantilly Lace    6723 Killingsworth St    Portland, OR
Club 205    9939 Stark St    Portland, OR
Club Rouge    403 SW Stark    Portland, OR
Dancin' Bare    8440 Interstate Ave    Portland, OR
Devil's Point    5305 SE Foster Rd    Portland, OR
Double Dribble    13550 Southeast Powell Boulevard    Portland, OR
Dream on Saloon    15920 Stark St    Portland, OR
DV8    5003 Powell Blvd    Portland, OR
Exotica    240 Columbia Blvd    Portland, OR
Frolics    8845 Sandy Blvd    Portland, OR
G-Spot Airport    8654 Sandy Blvd    Portland, OR
G-Spot Northeast    3400 NE 82nd Ave    Portland, OR
G-Spot Southeast    5241 SE 72nd Ave    Portland, OR
Glimmers    3532 Powell Blvd    Portland, OR
Golden Dragon Exotic Club    324 SW 3rd Ave    Portland, OR
Heat    12131 SE Holgate Blvd.    Portland, OR
Honeysuckle's Lingerie    3520 82nd Ave    Portland, OR
Hush Playhouse    13560 Powell Blvd    Portland, OR
JD's Bar &amp; Grill    4523 NE 60th Ave    Portland, OR
Jody's Bar And Grill    12035 Glisan St    Portland, OR
Landing Strip    6210 Columbia Blvd    Portland, OR
Lucky Devil Lounge    633 SE Powell Blvd    Portland, OR
Lure    11051 Barbur Blvd    Portland, OR
Magic Garden    217 4th Ave    Portland, OR
Mary's Club    129 Broadway    Portland, OR
Montego's    15826 SE Division    Portland, OR
Mr. Peeps    709 122nd Ave    Portland, OR
Mynt Gentlemen's Club    3390 NE Sandy Blvd    Portland, OR
Mystic    9950 SE Stark St.    Portland, OR
Nicolai Street Clubhouse    2460 24th Ave    Portland, OR
Oh Zone    6218 Columbia Blvd    Portland, OR
Pallas Club    13639 Powell Blvd    Portland, OR
Pirates Cove    7427 Sandy Blvd    Portland, OR
Private Pleasures    10931 53rd Ave    Portland, OR
Pussycats    3414 Northeast 82nd Avenue    Portland, OR
Riverside Corral    545 Tacoma St    Portland, OR
Rooster's    605 Columbia Blvd    Portland, OR
Rose City Strip    3620 35th Pl    Portland, OR
Safari Show Club    3000 SE Powell Blvd    Portland, OR
Sassy's Bar &amp; Grill    927 Morrison St    Portland, OR
Secret Rendezvous    12503 Division St    Portland, OR
Shimmers    7944 Foster Rd    Portland, OR
Soobie's    333 SE 122nd Ave    Portland, OR
Spyce Gentleman's Club    33 NW 2nd Ave    Portland, OR
Sugar Shack    6732 Killingsworth St    Portland, OR
The Hawthorne Strip    1008 Hawthorne Blvd    Portland, OR
Tommy's Too    10335 Foster Rd    Portland, OR
Union Jacks    938 Burnside St    Portland, OR
Video Visions    6723 Killingsworth St    Portland, OR
Stars Cabaret Bridgeport    17939 SW McEwan Rd    Tigard, OR
Jiggles    7455 SW Nyberg St    Tualatin, OR

 

首先,我們需要將地址轉換為經度和緯度

import urllib
import json
def geoGrab(stAddress, city):
    apiStem = 'http://where.yahooapis.com/geocode?'  #create a dict and constants for the goecoder
    params = {}
    params['flags'] = 'J'#JSON return type
    params['appid'] = 'aaa0VN6k'
    params['location'] = '%s %s' % (stAddress, city)
    url_params = urllib.parse.urlencode(params)
    yahooApi = apiStem + url_params      #print url_params
    print (yahooApi)
    c=urllib.parse.urlopen(yahooApi)
    return json.loads(c.read())

from time import sleep
def massPlaceFind(fileName):
    fw = open('places.txt', 'w')
    for line in open(fileName).readlines():
        line = line.strip()
        lineArr = line.split('\t')
        retDict = geoGrab(lineArr[1], lineArr[2])
        if retDict['ResultSet']['Error'] == 0:
            lat = float(retDict['ResultSet']['Results'][0]['latitude'])
            lng = float(retDict['ResultSet']['Results'][0]['longitude'])
            print ("%s\t%f\t%f" % (lineArr[0], lat, lng))
            fw.write('%s\t%f\t%f\n' % (line, lat, lng))
        else: print ("error fetching")
        sleep(1)
    fw.close()
    
massPlaceFind('portlandClubs.txt')

 

這會在工作目錄下生成一個places.txt文本文件,接下來將使用這些點進行聚類,並將俱樂部以及他們的簇中心畫在城市地圖上。

def distSLC(vecA, vecB):#Spherical Law of Cosines
    a = sin(vecA[0,1]*pi/180) * sin(vecB[0,1]*pi/180)
    b = cos(vecA[0,1]*pi/180) * cos(vecB[0,1]*pi/180) * \
                      cos(pi * (vecB[0,0]-vecA[0,0]) /180)
    return arccos(a + b)*6371.0 #pi is imported with numpy

import matplotlib
import matplotlib.pyplot as plt
def clusterClubs(numClust=5):
    datList = []
    for line in open('places.txt').readlines():
        lineArr = line.split('\t')
        datList.append([float(lineArr[4]), float(lineArr[3])])
    datMat = mat(datList)
    myCentroids, clustAssing = biKmeans(datMat, numClust, distMeas=distSLC)
    fig = plt.figure()
    rect=[0.1,0.1,0.8,0.8]
    scatterMarkers=['s', 'o', '^', '8', 'p', \
                    'd', 'v', 'h', '>', '<']
    axprops = dict(xticks=[], yticks=[])
    ax0=fig.add_axes(rect, label='ax0', **axprops)
    imgP = plt.imread('Portland.png')
    ax0.imshow(imgP)
    ax1=fig.add_axes(rect, label='ax1', frameon=False)
    for i in range(numClust):
        ptsInCurrCluster = datMat[nonzero(clustAssing[:,0].A==i)[0],:]
        markerStyle = scatterMarkers[i % len(scatterMarkers)]
        ax1.scatter(ptsInCurrCluster[:,0].flatten().A[0], ptsInCurrCluster[:,1].flatten().A[0], marker=markerStyle, s=90)
    ax1.scatter(myCentroids[:,0].flatten().A[0], myCentroids[:,1].flatten().A[0], marker='+', s=300)
    plt.show()

clusterClubs(5)

 

當 將簇的數目改為6時:

 

當將簇的數目改為7時

 參考:https://blog.csdn.net/sinat_17196995/article/details/70332664

            https://blog.csdn.net/u013266697/article/details/52832215


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM