PCA降維以及維數的確定


概述

PCA(principal components analysis)即主成分分析技術,又稱為主分量分析,旨在利用降維的思想,把多個指標轉換為少數的幾個綜合指標。

主成分分析是一種簡化數據集的技術,它是一個線性變換。這個線性變化把數據變換到一個新的坐標系統中,使得任何數據投影的第一大方差在第一個坐標上(稱為第一主成分),第二個大的方差在第二個坐標上(稱為第二主成分),以此類推。主成分分析經常用於減少數據集的維數,同時保持數據集的對方差貢獻最大的特征。這是通過保留低階主成分,忽略高階主成分做到的。這樣低階成分往往能夠保留住數據的最重要方面。

PCA的原理就是將原來的樣本數據投影到一個新的空間中。其中就是將原來的樣本數據空間經過坐標變換矩陣變到新空間坐標下,新空間坐標由其原數據樣本中不同維度之間的協方差矩陣中幾個最大特征值對應的前幾個特征向量組成。較小的特征值向量作為非主要成分去掉,從而可以達到提取主要成分代表原始數據的特征,降低數據復雜度的目的。

算法步驟

  1. 將n次采樣的m維數據組織成矩陣形式\(X\in R^{n\times m}X\in R^{n\times m}\)。具體形式如下所示:

    \[\left(\begin{matrix}\begin{matrix}x_{11}&x_{12}\\x_{21}&x_{22}\\\end{matrix}&\begin{matrix}\cdots&x_{1m}\\\cdots&x_{2m}\\\end{matrix}\\\begin{matrix}\vdots&\vdots\\x_{n1}&x_{n2}\\\end{matrix}&\begin{matrix}\ddots&\vdots\\\cdots&x_{nm}\\\end{matrix}\\\end{matrix}\right) \]

  2. 將樣本矩陣\(XX\)的每一列零均值化得新矩陣 \(X^{\prime}X^{\prime}\)

    \[\boldsymbol{x}_{i} \leftarrow \boldsymbol{x}_{i}-\frac{1}{m} \sum_{i=1}^{m} \boldsymbol{x}_{i} \]

  3. 計算其樣本數據維度之間的相關度,此處使用協方差矩陣\(CC\)

    \[cov=\frac{1}{m}X^\prime{X^\prime}^T \]

  4. 計算協方差矩陣\(CC\)的特征值及其對應的特征向量,並特征值按照從大到小排列。

    \[\left(\lambda_1,\lambda_2,\cdots,\lambda_t\right)=\left(\begin{matrix}\begin{matrix}p_{11}&p_{12}\\p_{21}&p_{22}\\\end{matrix}&\begin{matrix}\cdots&p_{1t}\\\cdots&p_{2t}\\\end{matrix}\\\begin{matrix}\vdots&\vdots\\p_{n1}&p_{n2}\\\end{matrix}&\begin{matrix}\ddots&\vdots\\\cdots&p_{nt}\\\end{matrix}\\\end{matrix}\right)=\left(\boldsymbol{P}_{1}, \boldsymbol{P}_{2}, \ldots, \boldsymbol{P}_{i}\right)\ ,\ (其中\lambda_1>\lambda_2>\cdots>\lambda_t) \]

  5. 根據降維要求,比如此處降到\(kk\)維,取其前個\(kk\)向量組成降維矩陣\(PP\),如下所示:

    \[P=\left(\boldsymbol{P}_{1}, \boldsymbol{P}_{2}, \ldots, \boldsymbol{P}_{k}\right)^T ,\ P\in R^{k\times n} \]

  6. 通過變換矩陣P對原樣本數據\(XX\)進行坐標變換,從而達到數據降維與主成分提取的目的。

    \[Y=X\bullet P\ ,\ Y\in R^{k\times m} \]

重建誤差的計算

在投影完成之后,需要對投影的誤差進行重建,從而計算數據降維之后信息的損失,一般來說通過以下公式來計算。

\[{error}_1=\frac{1}{k}\sum_{i=1}^{k}{||x^{\left(i\right)}}-x_{approx}^{\left(i\right)}||^2 \]

\[{error}_2=\frac{1}{m}\sum_{i=1}^{m}{||x^{\left(i\right)}}||^2 \]

其中:

  • \(mm\)個樣本表示為\((x^{\left(1\right)},x^{\left(2\right)},\cdots,x^{\left(m\right)})(x^{\left(1\right)},x^{\left(2\right)},\cdots,x^{\left(m\right)})\)
  • 對應投影后的數據表示為\((x_{approx}^{\left(1\right)},x_{approx}^{\left(2\right)},\cdots,x_{approx}^{\left(m\right)})(x_{approx}^{\left(1\right)},x_{approx}^{\left(2\right)},\cdots,x_{approx}^{\left(m\right)})\)

則其比率\(\eta\eta\)

\[\eta=\frac{{error}_1}{{error}_2} \]

通過\(\eta\eta\)來衡量數據降維之后信息的損失。

算法描述

進而我們總結出算法描述如下:

輸入: 樣本集\(D=\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{2}, \ldots, \boldsymbol{x}_{m}\right\}D=\left\{\boldsymbol{x}_{1}, \boldsymbol{x}_{2}, \ldots, \boldsymbol{x}_{m}\right\}\)

低維空間維數

\(kk\)

過程:

  1. 對所有樣本進行零均值化:\(\boldsymbol{x}_{i} \leftarrow \boldsymbol{x}_{i}-\frac{1}{m} \sum_{i=1}^{m} \boldsymbol{x}_{i}\boldsymbol{x}_{i} \leftarrow \boldsymbol{x}_{i}-\frac{1}{m} \sum_{i=1}^{m} \boldsymbol{x}_{i}\)
  2. 計算樣本的協方差矩陣\(\mathbf{X X}^{\mathrm{T}}\mathbf{X X}^{\mathrm{T}}\)
  3. 對協方差矩陣\(\mathbf{X X}^{\mathrm{T}}\mathbf{X X}^{\mathrm{T}}\)做特征值分解;
  4. 取最大的\(kk\)個特征值所對應的特征向量\(\left(\boldsymbol{P}_{1}, \boldsymbol{P}_{2}, \ldots, \boldsymbol{P}_{k}\right)\left(\boldsymbol{P}_{1}, \boldsymbol{P}_{2}, \ldots, \boldsymbol{P}_{k}\right)\)
  5. 進行矩陣變換\(Y=P\bullet X\ ,\ Y\in R^{k\times m}Y=P\bullet X\ ,\ Y\in R^{k\times m}\)

輸出: 變換后的矩陣\(Y=X\bullet P\ ,\ Y\in R^{k\times m}Y=X\bullet P\ ,\ Y\in R^{k\times m}\)

算法實現

選用的數據集

使用數據集為:Imported Analog EMG – Voltage下的EMG1、EMG2、…、EMG8部分的數據

實驗代碼展示

fileName = 'c:\Users\Administrator\Desktop\機器學習作業\PCA\pcaData1.csv';
X = csvread(fileName); 
m = size(X,1);
meanLine = mean(X,2);
R = size(X ,2);
%對原始數據做均值化處理,每一列都減去均值
A = [];
for i = 1:R
    temp = X(:,i) - meanLine;
    A = [A temp];
end
%求其協方差矩陣
C = A'*A/R; 
%求協方差矩陣的特征值及其特征向量
[U,S,V] = svd(C); 
%設置降維的維度數k,從1維計算到R-1維
k=8; 
%計算投影后的樣本數據Y
P=[];
for x = 1:k
    P = [P U(:,x)]; 
end
Y = X*P;
%計算數據重建誤差以及比率
err1 = 0;
%獲取樣本X重建后的矩陣XR
XR= Y * pinv(P);
for i = 1:m
    err1 = norm(X(i,:)-XR(i,:))+err1;
end
%計算數據方差
err2 = 0;
for i=1:m
    err2 = norm(X(i,:))+err2;
end
eta = err1/err2

結果展示與分析

通過計算我們發現對應的特征值以及其對應的投影方向如下:

\(\lambda_1\lambda_1\)=1.8493對應的投影方向為\((-0.0164,0.0300,-0.2376,0.4247,-0.6717,0.2356,-0.2196,0.4551)(-0.0164,0.0300,-0.2376,0.4247,-0.6717,0.2356,-0.2196,0.4551)\)

\(\lambda_2\lambda_2\)=1.3836對應的投影方向為\((0.0910,0.1724,-0.0097,-0.8267,-0.1464,0.3599,0.0025,0.3570)(0.0910,0.1724,-0.0097,-0.8267,-0.1464,0.3599,0.0025,0.3570)\)

\(\lambda_3\lambda_3\)=0.5480對應的投影方向為\((-0.1396,-0.4457,-0.1668,0.0870,0.2812,0.7696,-0.1742,-0.2115)(-0.1396,-0.4457,-0.1668,0.0870,0.2812,0.7696,-0.1742,-0.2115)\)

\(\lambda_4\lambda_4\)=0.4135對應的投影方向為\((0.0622,0.1782,0.3136,-0.0080,-0.5387,0.2841,0.3300,-0.6214)(0.0622,0.1782,0.3136,-0.0080,-0.5387,0.2841,0.3300,-0.6214)\)

\(\lambda_5\lambda_5\)=0.3218對應的投影方向為\((0.2126,-0.7813,0.3136,-0.0080,-0.5387,0.2841,0.3300,-0.6214)(0.2126,-0.7813,0.3136,-0.0080,-0.5387,0.2841,0.3300,-0.6214)\)

\(\lambda_6\lambda_6\)=0.1322對應的投影方向為\((-0.0959,0.0340,-0.6943,0.0068,0.0269,0.0042,0.7119,0.0064)(-0.0959,0.0340,-0.6943,0.0068,0.0269,0.0042,0.7119,0.0064)\)

\(\lambda_7\lambda_7\)=0.0620對應的投影方向為\((0.8881,-0.0497,-0.3407,-0.0198,-0.0103,-0.0424,-0.2075,-0.2176)(0.8881,-0.0497,-0.3407,-0.0198,-0.0103,-0.0424,-0.2075,-0.2176)\)

\(\lambda_8=9.5959\times 10^{-17}\lambda_8=9.5959\times 10^{-17}\)對應的投影方向為\((0.3536,0.3536,0.3536,0.3536,0.3536,0.3536,0.3536,0.3536)(0.3536,0.3536,0.3536,0.3536,0.3536,0.3536,0.3536,0.3536)\)

k取不同值時對應的誤差比率如下所示:

k的取值 數據重建誤差eat
1 0.8265
2 0.7105
3 0.6499
4 0.5940
5 0.5521
6 0.5294
7 0.5162

參考

  1. PCA主成分數量(降維維度)選擇
  2. Imported Analog EMG – Voltage下的EMG1、EMG2、…、EMG8部分進行PCA/NMF降維


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM