一.k均值聚類算法
對於樣本集
。"k均值"算法就是針對聚類划分
最小化平方誤差:
其中
是簇Ci的均值向量。從上述公式中可以看出,該公式刻畫了簇內樣本圍繞簇均值向量的緊密程度,E值越小簇內樣本的相似度越高。
工作流程:

k-均值算法的描述如下:
創建k個點作為起始質心(通常隨機選擇)
當任意一個點的簇分配結果發生改變時:
對數據集中的每個點:
對每個質心:
計算質心與數據點之間的距離
將數據點分配到距離其最近的簇
對每一個簇,計算簇中所有點的均值並將均值作為質心
接下來是對於數據集testSet.txt的代碼實現:
1.658985 4.285136 -3.453687 3.424321 4.838138 -1.151539 -5.379713 -3.362104 0.972564 2.924086 -3.567919 1.531611 0.450614 -3.302219 -3.487105 -1.724432 2.668759 1.594842 -3.156485 3.191137 3.165506 -3.999838 -2.786837 -3.099354 4.208187 2.984927 -2.123337 2.943366 0.704199 -0.479481 -0.392370 -3.963704 2.831667 1.574018 -0.790153 3.343144 2.943496 -3.357075 -3.195883 -2.283926 2.336445 2.875106 -1.786345 2.554248 2.190101 -1.906020 -3.403367 -2.778288 1.778124 3.880832 -1.688346 2.230267 2.592976 -2.054368 -4.007257 -3.207066 2.257734 3.387564 -2.679011 0.785119 0.939512 -4.023563 -3.674424 -2.261084 2.046259 2.735279 -3.189470 1.780269 4.372646 -0.822248 -2.579316 -3.497576 1.889034 5.190400 -0.798747 2.185588 2.836520 -2.658556 -3.837877 -3.253815 2.096701 3.886007 -2.709034 2.923887 3.367037 -3.184789 -2.121479 -4.232586 2.329546 3.179764 -3.284816 3.273099 3.091414 -3.815232 -3.762093 -2.432191 3.542056 2.778832 -1.736822 4.241041 2.127073 -2.983680 -4.323818 -3.938116 3.792121 5.135768 -4.786473 3.358547 2.624081 -3.260715 -4.009299 -2.978115 2.493525 1.963710 -2.513661 2.642162 1.864375 -3.176309 -3.171184 -3.572452 2.894220 2.489128 -2.562539 2.884438 3.491078 -3.947487 -2.565729 -2.012114 3.332948 3.983102 -1.616805 3.573188 2.280615 -2.559444 -2.651229 -3.103198 2.321395 3.154987 -1.685703 2.939697 3.031012 -3.620252 -4.599622 -2.185829 4.196223 1.126677 -2.133863 3.093686 4.668892 -2.562705 -2.793241 -2.149706 2.884105 3.043438 -2.967647 2.848696
from numpy import * #load data def loadDataSet(fileName): dataMat = [] fr = open(fileName) for line in fr.readlines(): #for each line curLine = line.strip().split('\t') fltLine = list(map(float,curLine)) #這里和書中不同 和上一章一樣修改 dataMat.append(fltLine) return dataMat #distance func def distEclud(vecA,vecB): return sqrt(sum(power(vecA - vecB, 2))) # la.norm(vecA-vecB) 向量AB的歐式距離 #init K points randomly def randCent(dataSet, k): n = shape(dataSet)[1] centroids = mat(zeros((k,n)))#create centroid mat for j in range(n):#create random cluster centers, within bounds of each dimension minJ = min(dataSet[:,j]) rangeJ = float(max(dataSet[:,j]) - minJ) centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1)) return centroids
計算出矩陣中的最大值、最小值、距離:
>>> import mykmeans >>> from numpy import * >>> datMat=mat(mykmeans.loadDataSet('testSet.txt')) >>> min(datMat[:,0]) matrix([[-5.379713]]) >>> min(datMat[:,1]) matrix([[-4.232586]]) >>> max(datMat[:,0]) matrix([[4.838138]]) >>> max(datMat[:,1]) matrix([[5.1904]]) >>> mykmeans.randCent(datMat,2) matrix([[ 0.94526572, -2.73986112], [-1.56707725, 4.14438101]]) >>> mykmeans.distEclud(datMat[0],datMat[1]) 5.184632816681332
接下來是K-均值算法的核心程序:
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent): m = shape(dataSet)[0] clusterAssment = mat(zeros((m,2)))#create mat to assign data points #to a centroid, also holds SE of each point centroids = createCent(dataSet, k) clusterChanged = True while clusterChanged: clusterChanged = False for i in range(m):#for each data point assign it to the closest centroid minDist = inf; minIndex = -1 for j in range(k): distJI = distMeas(centroids[j,:],dataSet[i,:]) if distJI < minDist: minDist = distJI; minIndex = j if clusterAssment[i,0] != minIndex: clusterChanged = True clusterAssment[i,:] = minIndex,minDist**2 print (centroids) for cent in range(k):#recalculate centroids ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean return centroids, clusterAssment datMat=mat(loadDataSet('testSet.txt')) myCentroids,clustAssing=kMeans(datMat,4)
取k為4,得到4個質心:
[[ 4.80480327 0.7928886 ] [ 1.2776812 -3.82602902] [ 2.56286444 4.00821365] [-1.43428372 3.0160483 ]] [[ 4.007951 -0.28653243] [-0.64297274 -3.04016156] [ 2.51964406 3.40459212] [-2.605345 2.35623864]] [[ 3.4363324 -2.41850987] [-2.26682904 -2.92971896] [ 2.54391447 3.21299611] [-2.46154315 2.78737555]] [[ 2.926737 -2.70147753] [-3.19984738 -2.96423548] [ 2.6265299 3.10868015] [-2.46154315 2.78737555]] [[ 2.80293085 -2.7315146 ] [-3.38237045 -2.9473363 ] [ 2.6265299 3.10868015] [-2.46154315 2.78737555]]
二.后處理提高聚類性能

聚類算法中,k的值是由用戶初始定義的,如何才能判斷k值定義是否合適,就需要用誤差來評價聚類效果的好壞,誤差是各個點與其所屬類別質心的距離決定的。K-均值聚類的方法效果較差的原因是會收斂到局部最小值,而且全局最小。一種評價聚類效果的方法是SSE(Sum of Squared Error)誤差平方和的方法,取平方的結果是使得遠離中心的點變得更加突出。
一種降低SSE的方法是增加簇的個數,即提高k值,但是違背了聚類的目標,聚類的目標是在不改變簇數目的前提下提高簇的質量。可選的改進的方法是對生成的簇進行后處理,將最大SSE值的簇划分成兩個(K=2的K-均值算法),然后再進行相鄰的簇合並。具體方法有兩種:1、合並最近的兩個質心(合並使得SSE增幅最小的兩個質心)2、遍歷簇 合並兩個然后計算SSE的值,找到使得SSE最小的情況。
三.二分K-均值算法

這個算法的思想是:初始狀態所有數據點屬於一個大簇,之后每次選擇一個簇切分成兩個簇,這個切分滿足使SSE值最大程度降低,直到簇數目達到k。另一種思路是每次選擇SSE值最大的一個簇進行切分。
滿足使SSE值最大程度降低偽代碼如下:
將所有點看成一個簇 當簇數目小於k時 對於每一個簇: 計算總誤差 在給定的簇上面進行K-均值聚類(k=2) 計算將該簇一分為二后的總誤差 選擇使得誤差最小的那個簇進行划分操作
代碼如下:
def biKmeans(dataSet,k,distMeas=distEclud): m=shape(dataSet)[0] clusterAssment=mat(zeros((m,2))) centroid0 = mean(dataSet, axis=0).tolist()[0] centList = [centroid0] # create a list with one centroid for j in range(m): # calc initial Error for each point clusterAssment[j, 1] = distMeas(mat(centroid0), dataSet[j, :]) ** 2 while (len(centList) < k): lowestSSE = inf #init SSE for i in range(len(centList)):#for every centroid ptsInCurrCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0],:] # get the data points currently in cluster i centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)# k=2,kMeans sseSplit = sum(splitClustAss[:, 1]) # compare the SSE to the currrent minimum sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1]) print("sseSplit, and notSplit: ", sseSplit, sseNotSplit) if (sseSplit + sseNotSplit) < lowestSSE: #judge the error bestCentToSplit = i bestNewCents = centroidMat bestClustAss = splitClustAss.copy() lowestSSE = sseSplit + sseNotSplit #new cluster and split cluster bestClustAss[nonzero(bestClustAss[:, 0].A == 1)[0], 0] = len(centList) # change 1 to 3,4, or whatever bestClustAss[nonzero(bestClustAss[:, 0].A == 0)[0], 0] = bestCentToSplit print('the bestCentToSplit is: ', bestCentToSplit) print('the len of bestClustAss is: ', len(bestClustAss)) centList[bestCentToSplit] = bestNewCents[0, :].tolist()[0] # replace a centroid with two best centroids centList.append(bestNewCents[1, :].tolist()[0]) clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentToSplit)[0],:] = bestClustAss # reassign new clusters, and SSE return mat(centList), clusterAssment datMat3=mat(loadDataSet('testSet2.txt')) centList,myNewAssments=biKmeans(datMat3,3) centList
運行結果:
[[4.22282288 2.88343006] [0.53706656 4.28802843]] [[ 2.37096343 0.78503778] [-1.72960876 1.49902535]] [[ 2.22969593 1.291217 ] [-2.24671409 1.16767909]] [[ 2.40575527 1.66726781] [-2.11802947 0.88737776]] [[ 2.93386365 3.12782785] [-1.70351595 0.27408125]] sseSplit, and notSplit: 541.2976292649145 0.0 the bestCentToSplit is: 0 the len of bestClustAss is: 60 [[2.17022341 1.96268662] [3.12589387 4.12081081]] [[2.38650938 2.40163825] [3.2987665 3.61195425]] [[2.32716778 2.52893967] [3.43025118 3.61782727]] sseSplit, and notSplit: 27.63707606832937 501.7683305828214 [[-2.80259011 4.25652642] [-0.51490661 3.66748635]] [[-3.14707283 3.35698672] [-0.52242395 -2.24829595]] [[-2.94737575 3.3263781 ] [-0.45965615 -2.7782156 ]] sseSplit, and notSplit: 67.2202000797829 39.52929868209309 the bestCentToSplit is: 1 the len of bestClustAss is: 40
四.應用:對地圖上的點進行聚類
portlandClubs.txt
Dolphin II 10860 SW Beaverton-Hillsdale Hwy Beaverton, OR Hotties 10140 SW Canyon Rd. Beaverton, OR Pussycats 8666a SW Canyon Road Beaverton, OR Stars Cabaret 4570 Lombard Ave Beaverton, OR Sunset Strip 10205 SW Park Way Beaverton, OR Vegas VIP Room 10018 SW Canyon Rd Beaverton, OR Full Moon Bar and Grill 28014 Southeast Wally Road Boring, OR 505 Club 505 Burnside Rd Gresham, OR Dolphin 17180 McLoughlin Blvd Milwaukie, OR Dolphin III 13305 SE McLoughlin BLVD Milwaukie, OR Acropolis 8325 McLoughlin Blvd Portland, OR Blush 5145 SE McLoughlin Blvd Portland, OR Boom Boom Room 8345 Barbur Blvd Portland, OR Bottoms Up 16900 Saint Helens Rd Portland, OR Cabaret II 17544 Stark St Portland, OR Cabaret Lounge 503 W Burnside Portland, OR Carnaval 330 SW 3rd Avenue Portland, OR Casa Diablo 2839 NW St. Helens Road Portland, OR Chantilly Lace 6723 Killingsworth St Portland, OR Club 205 9939 Stark St Portland, OR Club Rouge 403 SW Stark Portland, OR Dancin' Bare 8440 Interstate Ave Portland, OR Devil's Point 5305 SE Foster Rd Portland, OR Double Dribble 13550 Southeast Powell Boulevard Portland, OR Dream on Saloon 15920 Stark St Portland, OR DV8 5003 Powell Blvd Portland, OR Exotica 240 Columbia Blvd Portland, OR Frolics 8845 Sandy Blvd Portland, OR G-Spot Airport 8654 Sandy Blvd Portland, OR G-Spot Northeast 3400 NE 82nd Ave Portland, OR G-Spot Southeast 5241 SE 72nd Ave Portland, OR Glimmers 3532 Powell Blvd Portland, OR Golden Dragon Exotic Club 324 SW 3rd Ave Portland, OR Heat 12131 SE Holgate Blvd. Portland, OR Honeysuckle's Lingerie 3520 82nd Ave Portland, OR Hush Playhouse 13560 Powell Blvd Portland, OR JD's Bar & Grill 4523 NE 60th Ave Portland, OR Jody's Bar And Grill 12035 Glisan St Portland, OR Landing Strip 6210 Columbia Blvd Portland, OR Lucky Devil Lounge 633 SE Powell Blvd Portland, OR Lure 11051 Barbur Blvd Portland, OR Magic Garden 217 4th Ave Portland, OR Mary's Club 129 Broadway Portland, OR Montego's 15826 SE Division Portland, OR Mr. Peeps 709 122nd Ave Portland, OR Mynt Gentlemen's Club 3390 NE Sandy Blvd Portland, OR Mystic 9950 SE Stark St. Portland, OR Nicolai Street Clubhouse 2460 24th Ave Portland, OR Oh Zone 6218 Columbia Blvd Portland, OR Pallas Club 13639 Powell Blvd Portland, OR Pirates Cove 7427 Sandy Blvd Portland, OR Private Pleasures 10931 53rd Ave Portland, OR Pussycats 3414 Northeast 82nd Avenue Portland, OR Riverside Corral 545 Tacoma St Portland, OR Rooster's 605 Columbia Blvd Portland, OR Rose City Strip 3620 35th Pl Portland, OR Safari Show Club 3000 SE Powell Blvd Portland, OR Sassy's Bar & Grill 927 Morrison St Portland, OR Secret Rendezvous 12503 Division St Portland, OR Shimmers 7944 Foster Rd Portland, OR Soobie's 333 SE 122nd Ave Portland, OR Spyce Gentleman's Club 33 NW 2nd Ave Portland, OR Sugar Shack 6732 Killingsworth St Portland, OR The Hawthorne Strip 1008 Hawthorne Blvd Portland, OR Tommy's Too 10335 Foster Rd Portland, OR Union Jacks 938 Burnside St Portland, OR Video Visions 6723 Killingsworth St Portland, OR Stars Cabaret Bridgeport 17939 SW McEwan Rd Tigard, OR Jiggles 7455 SW Nyberg St Tualatin, OR
首先,我們需要將地址轉換為經度和緯度
import urllib import json def geoGrab(stAddress, city): apiStem = 'http://where.yahooapis.com/geocode?' #create a dict and constants for the goecoder params = {} params['flags'] = 'J'#JSON return type params['appid'] = 'aaa0VN6k' params['location'] = '%s %s' % (stAddress, city) url_params = urllib.parse.urlencode(params) yahooApi = apiStem + url_params #print url_params print (yahooApi) c=urllib.parse.urlopen(yahooApi) return json.loads(c.read()) from time import sleep def massPlaceFind(fileName): fw = open('places.txt', 'w') for line in open(fileName).readlines(): line = line.strip() lineArr = line.split('\t') retDict = geoGrab(lineArr[1], lineArr[2]) if retDict['ResultSet']['Error'] == 0: lat = float(retDict['ResultSet']['Results'][0]['latitude']) lng = float(retDict['ResultSet']['Results'][0]['longitude']) print ("%s\t%f\t%f" % (lineArr[0], lat, lng)) fw.write('%s\t%f\t%f\n' % (line, lat, lng)) else: print ("error fetching") sleep(1) fw.close() massPlaceFind('portlandClubs.txt')
這會在工作目錄下生成一個places.txt文本文件,接下來將使用這些點進行聚類,並將俱樂部以及他們的簇中心畫在城市地圖上。
def distSLC(vecA, vecB):#Spherical Law of Cosines a = sin(vecA[0,1]*pi/180) * sin(vecB[0,1]*pi/180) b = cos(vecA[0,1]*pi/180) * cos(vecB[0,1]*pi/180) * \ cos(pi * (vecB[0,0]-vecA[0,0]) /180) return arccos(a + b)*6371.0 #pi is imported with numpy import matplotlib import matplotlib.pyplot as plt def clusterClubs(numClust=5): datList = [] for line in open('places.txt').readlines(): lineArr = line.split('\t') datList.append([float(lineArr[4]), float(lineArr[3])]) datMat = mat(datList) myCentroids, clustAssing = biKmeans(datMat, numClust, distMeas=distSLC) fig = plt.figure() rect=[0.1,0.1,0.8,0.8] scatterMarkers=['s', 'o', '^', '8', 'p', \ 'd', 'v', 'h', '>', '<'] axprops = dict(xticks=[], yticks=[]) ax0=fig.add_axes(rect, label='ax0', **axprops) imgP = plt.imread('Portland.png') ax0.imshow(imgP) ax1=fig.add_axes(rect, label='ax1', frameon=False) for i in range(numClust): ptsInCurrCluster = datMat[nonzero(clustAssing[:,0].A==i)[0],:] markerStyle = scatterMarkers[i % len(scatterMarkers)] ax1.scatter(ptsInCurrCluster[:,0].flatten().A[0], ptsInCurrCluster[:,1].flatten().A[0], marker=markerStyle, s=90) ax1.scatter(myCentroids[:,0].flatten().A[0], myCentroids[:,1].flatten().A[0], marker='+', s=300) plt.show() clusterClubs(5)

當 將簇的數目改為6時:

當將簇的數目改為7時

參考:https://blog.csdn.net/sinat_17196995/article/details/70332664
https://blog.csdn.net/u013266697/article/details/52832215
