1.t-SNE
- t-分布領域嵌入算法
- 雖然主打非線性高維數據降維,但是很少用,因為
- 比較適合應用於可視化,測試模型的效果
- 保證在低維上數據的分布與原始特征空間分布的相似性高
因此用來查看分類器的效果更加
1.1 復現demo
# Import TSNE
from sklearn.manifold import TSNE
# Create a TSNE instance: model
model = TSNE(learning_rate=200)
# Apply fit_transform to samples: tsne_features
tsne_features = model.fit_transform(samples)
# Select the 0th feature: xs
xs = tsne_features[:,0]
# Select the 1st feature: ys
ys = tsne_features[:,1]
# Scatter plot, coloring by variety_numbers
plt.scatter(xs,ys,c=variety_numbers)
plt.show()
2.PCA
主成分分析是進行特征提取,會在原有的特征的基礎上產生新的特征,新特征是原有特征的線性組合,因此會達到降維的目的,但是降維不僅僅只有主成分分析一種
- 當特征變量很多的時候,變量之間往往存在多重共線性。
- 主成分分析,用於高維數據降維,提取數據的主要特征分量
- PCA能“一箭雙雕”的地方在於
- 既可以選擇具有代表性的特征,
- 每個特征之間線性無關
- 總結一下就是原始特征空間的最佳線性組合
有一個非常易懂的栗子知乎
2.1 數學推理
可以參考【機器學習】降維——PCA(非常詳細)
Making sense of principal component analysis, eigenvectors & eigenvalues
2.2栗子
sklearn里面有直接寫好的方法可以直接使用
from sklearn.decomposition import PCA
# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
# Assign the 0th column of grains: width
width = grains[:,0]
# Assign the 1st column of grains: length
length = grains[:,1]
# Scatter plot width vs length
plt.scatter(width, length)
plt.axis('equal')
plt.show()
# Calculate the Pearson correlation
correlation, pvalue = pearsonr(width, length)
# Display the correlation
print(correlation)
# Import PCA
from sklearn.decomposition import PCA
# Create PCA instance: model
model = PCA()
# Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(grains)
# Assign 0th column of pca_features: xs
xs = pca_features[:,0]
# Assign 1st column of pca_features: ys
ys = pca_features[:,1]
# Scatter plot xs vs ys
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()
# Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr(xs, ys)
# Display the correlation
print(correlation)
<script.py> output:
2.5478751053409354e-17
2.3intrinsic dimension
主成分的固有維度,其實就是提取主成分,得到最佳線性組合
2.3.1 提取主成分
.n_components_
提取主成分一般占總的80%以上,不過具體問題還得具體分析
# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
# Create scaler: scaler
scaler = StandardScaler()
# Create a PCA instance: pca
pca = PCA()
# Create pipeline: pipeline
pipeline = make_pipeline(scaler,pca)
# Fit the pipeline to 'samples'
pipeline.fit(samples)
# Plot the explained variances
features =range( pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()
2.3.2Dimension reduction with PCA
主成分降維
給一個文本特征提取的小例子
雖然我還不知道這個是啥,以后學完補充啊
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Create a TfidfVectorizer: tfidf
tfidf = TfidfVectorizer()
# Apply fit_transform to document: csr_mat
csr_mat = tfidf.fit_transform(documents)
# Print result of toarray() method
print(csr_mat.toarray())
# Get the words: words
words = tfidf.get_feature_names()
# Print words
print(words)
['cats say meow', 'dogs say woof', 'dogs chase cats']
<script.py> output:
[[0.51785612 0. 0. 0.68091856 0.51785612 0. ]
[0. 0. 0.51785612 0. 0.51785612 0.68091856]
[0.51785612 0.68091856 0.51785612 0. 0. 0. ]]
['cats', 'chase', 'dogs', 'meow', 'say', 'woof']
<script.py> output:
[[0.51785612 0. 0. 0.68091856 0.51785612 0. ]
[0. 0. 0.51785612 0. 0.51785612 0.68091856]
[0.51785612 0.68091856 0.51785612 0. 0. 0. ]]
['cats', 'chase', 'dogs', 'meow', 'say', 'woof']