基本概念:
在數據處理中,經常會遇到特征維度比樣本數量多得多的情況,如果拿到實際工程中去跑,效果不一定好。一是因為冗余的特征會帶來一些噪音,影響計算的結果;二是因為無關的特征會加大計算量,耗費時間和資源。所以我們通常會對數據重新變換一下,再跑模型。數據變換的目的不僅僅是降維,還可以消除特征之間的相關性,並發現一些潛在的特征變量。
PCA的目的:
PCA是一種在盡可能減少信息損失的情況下找到某種方式降低數據的維度的方法。通常來說,我們期望得到的結果,是把原始數據的特征空間(n個d維樣本)投影到一個小一點的子空間里去,並盡可能表達的很好(就是說損失信息最少)。常見的應用在於模式識別中,我們可以通過減少特征空間的維度,抽取子空間的數據來最好的表達我們的數據,從而減少參數估計的誤差。注意,主成分分析通常會得到協方差矩陣和相關矩陣。這些矩陣可以通過原始數據計算出來。協方差矩陣包含平方和與向量積的和。相關矩陣與協方差矩陣類似,但是第一個變量,也就是第一列,是標准化后的數據。如果變量之間的方差很大,或者變量的量綱不統一,我們必須先標准化再進行主成分分析。
PCA小實例步驟:
原始數據

協方差矩陣

特征值 :
特征向量: 
對角化 :


協方差矩陣對角化:即除對角線外的其它元素化為0,並且在對角線上
將元素按大小從上到下排列
協方差矩陣對角化:

降維:

正文:
需求不同機器五個維度降維再做聚類,可分析離群機器。
import numpy as np
import pandas as pd
df = pd.read_csv('data2.csv')
df['lable'] = 'machine'
df.head()

#X取前5列數據,y取label
X = df.ix[:,0:5].values
y = df.ix[:,6].values
from matplotlib import pyplot as plt
import math
#數據歸一化處理
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
print (X_std)

#構造協方差矩陣
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print(cov_mat)

#計算特征值和特征向量
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)
特征向量如下 5x5矩陣:

特征值如下:

#把特征向量和特征值對應起來,為計算特征值權重。
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print (eig_pairs)
print ('----------')
#按特征值排序,注意與原始數據無關
eig_pairs.sort(key=lambda x: x[0], reverse=True)
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i)

#對每個特征值對應的特征向量求累加和,分析特征值的權重
tot = sum(eig_vals)
var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True)]
print (var_exp)
cum_var_exp = np.cumsum(var_exp)
cum_var_exp

#可以看出前兩個特征值的影響比較大,說明降成2維是比較合理的。當然降三維就更好一點,但是復雜度也相應提高了。
#對特征值權重可視化
plt.figure(figsize=(6, 4))
plt.bar(range(5), var_exp, alpha=0.5, align='center',
label='individual explained variance')
plt.step(range(5), cum_var_exp, where='mid',
label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

#構造一個5 x 2矩陣
matrix_w = np.hstack((eig_pairs[0][1].reshape(5,1),
eig_pairs[1][1].reshape(5,1)))
print('Matrix W:\n', matrix_w)

#降維,用原始矩陣乘以5x2矩陣
Y = X_std.dot(matrix_w)

#未降維的可視化散點圖
plt.figure(figsize=(6, 4))
for lab, col in zip(('machine',),
('blue', )):
plt.scatter(X[y==lab, 0],
X[y==lab, 1],
label=lab,
c=col)
plt.xlabel('sepal_len')
plt.ylabel('sepal_wid')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

#降維后的散點圖
plt.figure(figsize=(6, 4))
for lab, col in zip(('machine', ),
('blue',)):
plt.scatter(Y[y==lab, 0],
Y[y==lab, 1],
label =lab,
c=col,
)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(loc='lower center')
plt.tight_layout()
plt.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
對降維后的數據集用DBScan聚類
基本概念:
核心對象:若某個點的密度達到算法設定的閾值則其為核心點。(即r 鄰域內點的數量不小於minPts)
直接密度可達:若某點p在點q的r 鄰域內,且q是核心點則p-q直接密度可達。
密度可達:若有一個點的序列q0、q1、…qk,對任意qi-qi-1是直接密度可達的,則稱從q0到qk密度可達,這實際上是直接密度可達的“傳播”。
ϵ-鄰域的距離閾值:設定的半徑r
邊界點:屬於某一個類的非核心點,不能發展下線了
噪聲點:不屬於任何一個類簇的點,從任何一個核心點出發都是密度不可達的
工作流程:
參數D:輸入數據集
參數ϵ:指定半徑
MinPts:密度閾值
參數選擇:
半徑ϵ,可以根據K距離來設定:找突變點
K距離:給定數據集P={p(i); i=0,1,…n},計算點P(i)到集合D的子集S中所有點
之間的距離,距離按照從小到大的順序排序,d(k)就被稱為k-距離。
MinPts:k-距離中k的值,一般取的小一些,多次嘗試
優勢:
不需要指定簇個數
擅長找到離群點(檢測任務)
可以發現任意形狀的簇
兩個參數就夠了
劣勢:
高維數據有些困難(可以做降維)
Sklearn中效率很慢(數據削減策略)
參數難以選擇(參數對結果的影響非常大)
正文:
import numpy as np # 數據結構
import sklearn.cluster as skc # 密度聚類
from sklearn import metrics # 評估模型
db = skc.DBSCAN(eps=2.5, min_samples=2).fit(Y) #DBSCAN聚類方法 還有參數,matric = ""距離計算方法
labels = db.labels_ #和X同一個維度,labels對應索引序號的值 為她所在簇的序號。若簇編號為-1,表示為噪聲
print('每個樣本的簇標號:')
print(labels)
raito = len(labels[labels[:] == -1]) / len(labels) #計算噪聲點個數占總數的比例
print('噪聲比:', format(raito, '.2%'))
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) # 獲取分簇的數目
print('分簇的數目: %d' % n_clusters_)
print("輪廓系數: %0.3f" % metrics.silhouette_score(Y, labels)) #輪廓系數評價聚類的好壞
for i in range(n_clusters_):
print('簇 ', i, '的所有樣本:')
one_cluster = Y[labels == i]
print(one_cluster)
plt.plot(one_cluster[:,0],one_cluster[:,1],'o')
plt.plot(Y[labels == -1][:,0],Y[labels == -1][:,1],'p')
plt.show()

每個樣本的簇標號:
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
噪聲比: 0.54%
分簇的數目: 2
輪廓系數: 0.736
簇 0 的所有樣本:
[[ 9.96854482e-01 6.19896578e-01]
[ 1.08138246e+00 4.71724901e-01]
[ 1.17571829e+00 7.00808757e-01]
[ 1.39995504e+00 7.96744932e-01]
[ 6.73080835e-01 3.19912177e-01]
[ 1.02590239e+00 6.68824085e-01]
[ 1.06530721e+00 7.07484773e-01]
[ 1.11189776e+00 8.39291101e-01]
[ 3.14722562e+00 3.27367426e-01]
[ 3.21549700e+00 3.24569658e-01]
[ 3.16082539e+00 4.44035187e-01]
[ 3.35548775e+00 3.73044767e-01]
[ 2.98229019e+00 1.72184325e+00]
[ 3.08645455e+00 3.42078565e-01]
[ 2.37094213e-02 8.72146114e-01]
[ 6.84628335e-01 6.30018499e-01]
[ 8.93515804e-01 3.86466713e-01]
[ 6.92034898e-01 1.05251158e+00]
[ 7.67477265e-01 6.83223943e-01]
[ 8.81635814e-01 5.87326858e-01]
[ 5.51545794e-01 9.57492502e-01]
[ 6.65530470e-01 7.47280594e-01]
[ 6.59982303e-01 7.30787717e-01]
[ 7.97399766e-01 6.88481483e-01]
[ 7.60785976e-01 -6.20768986e-02]
[ 1.15068696e+00 -1.28522051e+00]
[-1.01029051e-01 2.17385293e-02]
[ 4.55246274e-01 2.41142183e-01]
[-3.28463613e-01 -7.63226503e-01]
[ 1.08255376e-01 -1.26750744e+00]
[ 1.88234094e-01 -7.66753674e-01]
[ 7.81723172e-01 -7.03755924e-01]
[ 4.88350439e-01 -1.03386408e+00]
[-1.17478514e+00 4.32911156e-01]
[-6.00831989e-01 -2.49343240e-01]
[-6.76806838e-01 -3.18765479e-01]
[-4.11073705e-01 -9.41105284e-01]
[-8.62394297e-01 -2.85851329e-01]
[-7.93770724e-01 -2.41953487e-01]
[-6.24775559e-01 -2.78679115e-01]
[-5.81710593e-01 7.60832837e-01]
[-9.11707243e-01 6.16547904e-01]
[-8.65850835e-01 7.61228564e-01]
[-8.95841256e-01 7.25747391e-01]
[-2.97435552e-01 -6.63055497e-01]
[-6.69001053e-01 -3.02789798e-01]
[-7.10883112e-01 -2.37620685e-01]
[-5.96463645e-01 -3.73686091e-01]
[-1.04557102e+00 3.73146260e-01]
[-1.19140463e+00 5.71983742e-01]
[-2.37932735e+00 2.20144066e+00]
[-5.88125083e-01 -3.53114799e-01]
[-6.65952275e-01 -8.21979420e-01]
[-1.47101107e-01 -3.98421552e-01]
[-3.28115960e-01 -3.28822138e-01]
[ 1.71932280e-01 -1.84936809e-01]
[-2.11203864e-01 -1.30817031e+00]
[-2.13478421e-01 -1.74130561e+00]
[-4.65828329e-01 -1.35592284e+00]
[-4.70845360e-01 -1.22311301e+00]
[-6.16090086e-01 -4.31738814e-01]
[-8.40594427e-01 -3.54071331e-01]
[-9.97703877e-01 -4.26931248e-01]
[ 8.30405171e-01 1.40241740e+00]
[-1.56380804e+00 4.06810192e-01]
[-1.50387840e+00 4.51771017e-01]
[-1.44418161e+00 2.51771913e-01]
[-1.52901957e+00 4.22767721e-01]
[-1.46765387e+00 4.69984230e-01]
[-1.21729599e+00 -5.60195905e-01]
[-1.22599765e+00 -4.69268926e-01]
[-1.29160008e+00 -4.68647997e-01]
[-1.16770311e+00 -5.06891360e-01]
[-1.32733179e+00 -5.43739520e-01]
[-1.28797606e+00 -5.87222210e-01]
[-1.31184457e+00 -6.00027185e-01]
[-1.16723893e+00 -5.00196629e-01]
[ 5.25889754e+00 1.38681160e+00]
[ 4.83020633e+00 1.60052457e+00]
[ 2.88540988e+00 3.77353302e-01]
[ 3.16501658e+00 6.11262973e-01]
[ 3.20073856e+00 5.29818489e-01]
[ 3.13011049e+00 1.31522108e+00]
[-1.90442840e-01 -6.25108065e-01]
[-4.80266056e-01 -2.62477854e-01]
[-3.13268912e-01 -3.81018964e-01]
[-4.78508983e-01 -1.89609826e-01]
[-1.71396770e-01 -2.74308409e-01]
[ 2.01161735e-01 -9.36331840e-02]
[-2.73763626e-01 -2.42025105e-01]
[-3.39595861e-01 -1.95241273e-01]
[-4.30473172e-01 8.83471269e-01]
[-7.86812641e-01 8.54301623e-01]
[-6.23150134e-01 8.16906329e-01]
[-7.37928790e-01 7.77305066e-01]
[-3.85207757e-01 -3.47488032e-01]
[-2.76369410e-01 -5.47225197e-02]
[-2.62866587e-01 -2.47706939e-01]
[-4.05693574e-01 -1.33211603e-01]
[-1.06883370e+00 4.64087658e-01]
[-1.09822647e+00 6.46430738e-01]
[-2.37367370e+00 2.18477030e+00]
[-2.88973398e-01 -1.56221787e-01]
[-3.10411358e-01 -1.28697344e-01]
[-8.03671265e-01 -4.38452971e-01]
[-9.01509608e-01 -4.84502970e-01]
[-8.26746712e-01 -4.64530516e-01]
[-7.89727574e-01 -5.22936335e-01]
[-7.30479915e-01 -3.87994232e-01]
[-2.80043634e+00 2.23689253e+00]
[-5.68881745e-01 -6.51584309e-01]
[-7.82237337e-01 -4.51296126e-01]
[-7.36060324e-01 7.15270569e-02]
[-1.19768310e+00 -1.16800394e-01]
[-9.73380913e-01 6.59565280e-02]
[-1.12321350e+00 -9.67460101e-02]
[-1.06259743e+00 -8.27447208e-02]
[ 2.48644946e+00 2.15868772e+00]
[-1.45314882e+00 2.43476167e-01]
[-1.52641231e+00 3.11302328e-01]
[-1.33779156e+00 3.72097353e-01]
[-1.37873757e+00 4.89699573e-01]
[-1.37639570e+00 4.18616151e-01]
[-8.09774354e-01 -5.08058713e-01]
[-9.31131620e-01 -1.42751196e-01]
[-4.53649994e-01 -7.72097038e-02]
[-6.65236344e-01 -2.20065071e-01]
[-6.44250028e-01 -3.73374971e-01]
[-8.42391041e-01 -6.19906940e-01]
[ 1.32781239e-01 -5.22812674e-02]
[-8.10169381e-01 -6.26064995e-01]
[-5.72634711e-01 4.18058813e-01]
[-1.16497509e+00 7.24752982e-01]
[-1.07845985e+00 7.36963143e-01]
[-9.26271043e-01 7.81711660e-01]
[-5.90426126e-01 -6.05501838e-01]
[-8.26077963e-01 -1.53064233e-01]
[-7.92101529e-01 -1.04694572e-01]
[-7.06320482e-01 -2.54049986e-01]
[-1.46436457e+00 6.52207754e-01]
[-1.47064051e+00 5.13823003e-01]
[-2.08998281e+00 1.07339874e+00]
[-9.24745013e-01 -1.05378350e-01]
[-8.95026540e-01 -8.91345628e-02]
[-8.07888497e-01 -6.65654322e-02]
[-6.96084433e-01 -3.98815263e-01]
[-7.27744516e-01 -6.39194870e-01]
[-6.29664044e-01 -7.41390880e-01]
[-8.23277654e-01 -4.96035188e-01]
[-6.52040434e-01 -9.71113524e-01]
[ 4.63635478e+00 1.40947821e+00]
[ 4.55603278e+00 1.22052016e+00]
[ 3.39684550e+00 1.47199654e+00]
[ 3.43902228e+00 1.47032924e+00]
[ 3.44439952e+00 1.33884706e+00]
[ 3.43749952e-01 4.92195283e-01]
[ 8.93159103e-01 5.54338159e-01]
[ 7.86428912e-01 5.42368005e-01]
[ 7.67590916e-01 5.68011165e-01]
[ 7.83647530e-01 5.16254564e-01]
[ 8.11704348e-01 5.02071745e-01]
[ 7.77892641e-01 4.64009450e-01]
[ 8.75448669e-01 4.66897705e-01]
[ 7.38859628e-01 4.21602580e-01]
[-1.24757204e+00 -6.99340120e-02]
[-8.72636812e-01 -4.42551962e-01]
[-1.13063387e+00 -3.39702802e-02]
[-1.30316761e+00 -3.36274459e-02]
[-1.12978717e+00 9.41996543e-02]
[-1.31048184e+00 -1.88393972e-02]
[-1.05546107e+00 -4.26031697e-03]
[-1.23538597e+00 5.55172607e-02]
[-1.19462472e+00 1.43943069e-02]
[-1.05422060e+00 7.16215092e-02]
[-1.18828910e+00 1.54679147e-02]
[-1.24832171e+00 -3.30616363e-02]
[-1.20834955e+00 -2.89264455e-02]
[-1.12933915e+00 2.71400462e-02]
[-1.11865886e+00 3.30991112e-02]
[-1.20204568e+00 -9.82951013e-03]]
簇 1 的所有樣本:
[[ 3.58349685 -6.91918766]
[ 3.62286913 -7.76192187]
[ 2.95477472 -6.43354311]
[ 2.02711371 -4.4162005 ]]