手寫高斯混合聚類算法


參考資料:《機器學習》

原理

n維樣本的高斯分布為:

 

 ∑為協方差矩陣

由貝葉斯定理,樣本Xj屬於i類的后驗概率為:

 

 將上式簡寫為γji

則樣本Xj分類公式為

 

 給每一個分類一個系數,采用對數似然,得

 

 上式分別對∑,μ求導,令導數為0,得

 

 

 

 系數求和為1,引入此約束,對數似然的拉格朗日形式為

 

上式對系數α求導,令導數為0,得

 

 以上,紅框部分即為參數更新公式,具體求導涉及標量對向量/矩陣的求導,一般采用微分法求解,請自行查閱求導規則。

算法流程為:

 

 實現代碼:

數據集

采用西瓜數據集4.0,如下

序號,密度,含糖量
1,0.697,0.460 2,0.774,0.376 3,0.634,0.264 4,0.608,0.318 5,0.556,0.215 6,0.403,0.237 7,0.481,0.149 8,0.437,0.211 9,0.666,0.091 10,0.243,0.267 11,0.245,0.057 12,0.343,0.099 13,0.639,0.161 14,0.657,0.198 15,0.360,0.370 16,0.593,0.042 17,0.719,0.103 18,0.359,0.188 19,0.339,0.241 20,0.282,0.257 21,0.748,0.232 22,0.714,0.346 23,0.483,0.312 24,0.478,0.437 25,0.525,0.369 26,0.751,0.489 27,0.532,0.472 28,0.473,0.376 29,0.725,0.445 30,0.446,0.459

 代碼

# 1 讀取數據
file='xigua4.txt'
x=[]
with open(file) as f:
    f.readline()
    lines=f.read().split('\n')
    for line in lines:
        data=line.split(',')
        x.append([float(data[-2]),float(data[-1])])
y=np.array(x)
# 2 算法部分
import numpy as np
import random

def probability(x,u,cov):
    cov_inv=np.linalg.inv(cov)
    cov_det=np.linalg.det(cov)
    return np.exp(-1/2*((x-u).T.dot(cov_inv.dot(x-u))))/np.sqrt(cov_det)


def gauss_mixed_clustering(x,k=3,epochs=50,reload_params=None):
    features_num=len(x[0])
    r=np.empty(shape=(len(x),k))
#     初始化系數,均值向量和協方差矩陣
    if reload_params!=None:
        a,u,cov=reload_params
    else:
        a=np.random.uniform(size=k)
        a/=np.sum(a)
        u=np.array(random.sample(list(x),k))
        cov=np.empty(shape=(k,features_num,features_num))
#         初始化為只有對角線不為0
        for i in range(k):
            for j in range(features_num):
                cov[i][j]=[0]*j+[0.5]+[0]*(features_num-j-1)
    step=0
    while step<epochs:
#         E步:計算r_ji
        for j in range(len(x)):
            for i in range(k):
                r[j,i]=a[i]*probability(x[j],u[i],cov[i])
            r[j]/=np.sum(r[j])
            
        for i in range(k):
            r_toal=np.sum(r[:,i])
            u[i]=np.sum([x[j]*r[j,i] for j in range(len(x))],axis=0)/r_toal
            cov[i]=np.sum([r[j,i]*((x[j]-u[i]).reshape((features_num,1)).dot((x[j]-u[i]).reshape((1,features_num)))) for j in range(len(x))],axis=0)/r_toal
            a[i]=r_toal/len(x)
        step+=1
    C=[]
    for i in range(k):
        C.append([])
    for j in range(len(x)):
        c_j=np.argmax(r[j,:])
        C[c_j].append(x[j])
    return C,a,u,cov

 驗證

res,A,U,COV=gauss_mixed_clustering(y)
import matplotlib.pyplot as plt
%matplotlib inline
colors=['green','blue','red','black','yellow','orange']
for i in range(len(res)):
    plt.scatter([d[0] for d in res[i]],[d[1] for d in res[i]],color=colors[i],label=str(i))
plt.scatter([d[0] for d in U],[d[1] for d in U],color=colors[-1],marker='^',label='center')

plt.xlabel('density')
plt.ylabel('suger')
plt.legend()

 50輪后

# 100輪
res,A,U,COV=gauss_mixed_clustering(y,reload_params=[A,U,COV])
for i in range(len(res)):
    plt.scatter([d[0] for d in res[i]],[d[1] for d in res[i]],color=colors[i],label=str(i))
plt.scatter([d[0] for d in U],[d[1] for d in U],color=colors[-1],marker='^',label='center')
plt.xlabel('density')
plt.ylabel('suger')
plt.legend()

 

# 200輪后不再變化
res,A,U,COV=gauss_mixed_clustering(y,epochs=100,reload_params=[A,U,COV])
for i in range(len(res)):
    plt.scatter([d[0] for d in res[i]],[d[1] for d in res[i]],color=colors[i],label=str(i))
plt.scatter([d[0] for d in U],[d[1] for d in U],color=colors[-1],marker='^',label='center')
plt.xlabel('density')
plt.ylabel('suger')
plt.legend()

 

 總結

這個算法本身不復雜,可能涉及到矩陣求導的部分會麻煩一點。西瓜數據集太小了,收斂非常快。然后,這個算法同樣對於初值敏感。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM