Python_sklearn機器學習庫學習筆記（一）_Feature Extraction and Preprocessing(特征提取與預處理）

本文轉載自查看原文 2016-12-29 10:24 4170

　　# Extracting features from categorical variables

#Extracting features from categorical variables  獨熱編碼
from sklearn.feature_extraction import DictVectorizer
onehot_encoder=DictVectorizer()
instance=[{'city':'New York'},{'city':'San Francisco'},
          {'city':'Chapel Hill'}]
print onehot_encoder.fit_transform(instance).toarray()
輸出結果：

[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]

　　# Extracting features from text文字特征提取

　　## The bag-of-words representation

#bag-of-words model.詞庫模型 
corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game'
]

#CountVectorizer類通過正則表達式用空格分割句子，然后抽取長度大於等於2的字母序列。scikit-learn實現代碼如下：
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game'
]
vectorizer=CountVectorizer()
print vectorizer.fit_transform(corpus).todense()#todense將稀疏矩陣轉化為完整特征矩陣
print vectorizer.vocabulary_

　　輸出結果：

[[1 1 0 1 0 1 0 1]
[1 1 1 0 1 0 1 0]]
{u'duke': 1, u'basketball': 0, u'lost': 4, u'played': 5, u'game': 2, u'unc': 7, u'in': 3, u'the': 6}

corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game',
'I ate a sandwich'
]
vectorizer = CountVectorizer()
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

　　輸出結果：

[[0 1 1 0 1 0 1 0 0 1]
[0 1 1 1 0 1 0 0 1 0]
[1 0 0 0 0 0 0 1 0 0]]
{u'duke': 2, u'basketball': 1, u'lost': 5, u'played': 6, u'in': 4, u'game': 3, u'sandwich': 7, u'unc': 9, u'ate': 0, u'the': 8}

　　scikit-learn里面的euclidean_distances函數可以計算若干向量的距離，表示兩個語義最相似的
文檔其向量在空間中也是最接近的。

from sklearn.metrics.pairwise import euclidean_distances
count=[[0, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 1, 1, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 0]]
print 'Distance between 1st and 2nd documents:',euclidean_distances(count[0],count[1])

　輸出結果：Distance between 1st and 2nd documents: [[ 2.]]

#format方法
for x,y in[[0,1],[0,2],[1,2]]:
    count=[[0, 1, 1, 0, 0, 1, 0, 1],
           [0, 1, 1, 1, 1, 0, 0, 0],
           [1, 0, 0, 0, 0, 0, 1, 0]]
    dist=euclidean_distances(count[x],count[y])
    print '文檔{}文檔{}文檔{}'.format(x,y,dist)

　　輸出結果：

文檔0文檔1文檔[[ 2.]]
文檔0文檔2文檔[[ 2.44948974]]
文檔1文檔2文檔[[ 2.44948974]]

## Stop-word filtering 停用詞過濾
CountVectorizer類可以通過設置stop_words參數過濾停用詞，默認是英語常用的停用詞。

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game',
'I ate a sandwich'
]
vectorizer=CountVectorizer(stop_words='english')
print vectorizer.fit_transform(corpus).todense()
print vectorizer.vocabulary_

　　輸出結果：

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{u'duke': 2, u'basketball': 1, u'lost': 4, u'played': 5, u'game': 3, u'sandwich': 6, u'unc': 7, u'ate': 0}

　　# Stemming and lemmatization 詞根還原和詞形還原　

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['He ate the sandwiches',
          'Every sandwich was eaten by him']
vectorizer=CountVectorizer(binary=True,stop_words='english')
print vectorizer.fit_transform(corpus).todense()
print vectorizer.vocabulary_

　　輸出結果：

　　[[1 0 0 1]
　　[0 1 1 0]]
　　{u'sandwich': 2, u'ate': 0, u'sandwiches': 3, u'eaten': 1}

　　### 讓我們分析一下單詞gathering的詞形還原：

corpus = [
'I am gathering ingredients for the sandwich.',
'There were many wizards at the gathering.'
]
import nltk
nltk.download()

from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
wordnet_tags = ['n', 'v']
corpus = [
'He ate the sandwiches',
'Every sandwich was eaten by him'
] 
stemmer = PorterStemmer()
print('Stemmed:', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus])

　　輸出結果：

　　('Stemmed:', [[u'He', u'ate', u'the', u'sandwich'], [u'Everi', u'sandwich', u'wa', u'eaten', u'by', u'him']])

def lemmatize(token, tag):
    if tag[0].lower() in ['n', 'v']:
        return lemmatizer.lemmatize(token, tag[0].lower())
    return token
lemmatizer = WordNetLemmatizer()
tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]
print('Lemmatized:', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus])

　　輸出結果：

　　('Lemmatized:', [['He', u'eat', 'the', u'sandwich'], ['Every', 'sandwich', u'be', u'eat', 'by', 'him']])

　　## 帶TF-IDF權重的擴展詞庫

from sklearn.feature_extraction.text import CountVectorizer
corpus=['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']
vectorizer=CountVectorizer(stop_words='english')
print vectorizer.fit_transform(corpus).todense()
print vectorizer.vocabulary_

　　輸出結果：

　　[[2 1 3 1 1]]
　　{u'sandwich': 2, u'wizard': 4, u'dog': 1, u'transfigured': 3, u'ate': 0}

#tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['The dog ate a sandwich and I ate a sandwich','The wizard transfigured a sandwich']
vectorizer=TfidfVectorizer(stop_words='english')
print vectorizer.fit_transform(corpus).todense()
print vectorizer.vocabulary_

　　輸出結果：

　　[[ 0.75458397 0.37729199 0.53689271 0. 0. ]
　　[ 0. 0. 0.44943642 0.6316672 0.6316672 ]]
　　{u'sandwich': 2, u'wizard': 4, u'dog': 1, u'transfigured': 3, u'ate': 0}

　　## 通過哈希技巧實現特征向量

from sklearn.feature_extraction.text import HashingVectorizer
corpus = ['the', 'ate', 'bacon', 'cat']
vectorizer = HashingVectorizer(n_features=6)
print(vectorizer.transform(corpus).todense())

　　輸出結果：

[[-1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0.  0.]]
設置成6是為了演示。另外，注意有些單詞頻率是負數。由於Hash碰撞可能發生，所以HashingVectorizer用有符號哈希函數（signed hash function）。特征值和它的詞塊的哈希值帶
同樣符號，如果cats出現過兩次，被哈希成-3，文檔特征向量的第四個元素要減去2。如果dogs出現過兩次，被哈希成3，文檔特征向量的第四個元素要加上2。

## 圖片特征提取
#通過像素值提取特征
scikit-learn的digits數字集包括至少1700種0-9的手寫數字圖像。每個圖像都有8x8像像素構成。每
個像素的值是0-16，白色是0，黑色是16。如下圖所示：

%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
digits=datasets.load_digits()
print 'Digit:',digits.target[0]
print digits.images[0]
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

輸出結果：

　　Digit: 0
[[ 0.   0.   5. 13.   9.   1.   0.   0.]
[ 0.   0. 13. 15. 10. 15.   5.   0.]
[ 0.   3. 15.   2.   0. 11.   8.   0.]
[ 0.   4. 12.   0.   0.   8.   8.   0.]
[ 0.   5.   8.   0.   0.   9.   8.   0.]
[ 0.   4. 11.   0.   1. 12.   7.   0.]
[ 0.   2. 14.   5. 10. 12.   0.   0.]
[ 0.   0.   6. 13. 10.   0.   0.   0.]]

digits=datasets.load_digits()
print('Feature vector:\n',digits.images[0].reshape(-1,64))

輸出結果：

　　('Feature vector:\n', array([[ 0.,   0.,   5., 13.,   9.,   1.,   0.,   0.,   0.,   0., 13.,
         15., 10., 15.,   5.,   0.,   0.,   3., 15.,   2.,   0., 11.,
          8.,   0.,   0.,   4., 12.,   0.,   0.,   8.,   8.,   0.,   0.,
          5.,   8.,   0.,   0.,   9.,   8.,   0.,   0.,   4., 11.,   0.,
          1., 12.,   7.,   0.,   0.,   2., 14.,   5., 10., 12.,   0.,
          0.,   0.,   0.,   6., 13., 10.,   0.,   0.,   0.]]))

%matplotlib inline
import numpy as np
from skimage.feature import corner_harris,corner_peaks
from skimage.color import rgb2gray
import matplotlib.pyplot as plt
import skimage.io as io
from skimage.exposure import equalize_hist

def show_corners(corners,image):
    fig=plt.figure()
    plt.gray()
    plt.imshow(image)
    y_corner,x_corner=zip(*corners)
    plt.plot(x_corner,y_corner,'or')
    plt.xlim(0,image.shape[1])
    plt.ylim(image.shape[0],0)
    fig.set_size_inches(np.array(fig.get_size_inches())*1.5)
    plt.show()

mandrill=io.imread('1.jpg')
mandrill=equalize_hist(rgb2gray(mandrill))
corners=corner_peaks(corner_harris(mandrill),min_distance=2)
show_corners(corners,mandrill)

　　### SIFT和SURF

import mahotas as mh
from mahotas.features import surf
image = mh.imread('2.jpg', as_grey=True)
print('第一個SURF描述符：\n{}\n'.format(surf.surf(image)[0]))
print('抽取了%s個SURF描述符' % len(surf.surf(image)))

　　輸出結果：

第一個SURF描述符：
[  4.40526550e+02   2.82058666e+02   1.80770206e+00   2.56869094e+02
   1.00000000e+00   1.91360320e+00  -6.59236825e-04  -2.96877983e-04
   1.09769833e-03   3.67625424e-04  -1.90927908e-03  -9.72986820e-04
   2.86457301e-03   9.74479580e-04  -2.15057079e-04  -1.42831161e-04
   2.23010810e-04   1.42831161e-04   3.37184432e-06   1.74527115e-06
   3.37184454e-06   1.74527136e-06   3.90064757e-02   3.58161210e-03
   3.90511371e-02   4.40730516e-03   4.41527246e-01   2.71798365e-02
   4.41527246e-01   8.70393902e-02   4.56954581e-01  -2.29019329e-02
   4.56954581e-01   9.63314021e-02   6.29652613e-02   1.77485267e-02
   6.29652613e-02   2.13300792e-02   2.23341915e-03  -7.45940061e-04
   6.30745845e-03   5.05762292e-03  -1.57216338e-02   7.64635174e-02
   1.43149320e-01   3.04822002e-01  -2.48229831e-02  -1.02886168e-01
   8.65904522e-02   1.43815811e-01  -6.32987455e-03  -5.59536669e-03
   2.03817407e-02   1.31338762e-02   6.68332753e-04   4.10704922e-05
   1.25106500e-03   1.20076608e-03   5.65924789e-03  -9.40465975e-03
   2.08687062e-02   4.03695676e-02   3.18301424e-03  -1.22350925e-02
   1.59209535e-02   1.88643296e-02   1.13586147e-03   4.11031770e-04
   1.96554689e-03   1.16562736e-03]

抽取了826個SURF描述符

　　## 數據標准化

#scikit-learn的scale函數可以實現：
#解釋變量的值可以通過正態分布進行標准化，減去均值后除以標准差。
from sklearn import preprocessing
import numpy as np
X=np.array([[0., 0., 5., 13., 9., 1.],
            [0., 0., 13., 15., 10., 15.],
            [0., 3., 15., 2., 0., 11.]])
print(preprocessing.scale(X))

　　輸出結果：

　　[[ 0.         -0.70710678 -1.38873015 0.52489066 0.59299945 -1.35873244]
　　[ 0.         -0.70710678 0.46291005 0.87481777 0.81537425 1.01904933]
　　[ 0.          1.41421356 0.9258201 -1.39970842 -1.4083737   0.33968311]]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習【八】數據預處理、降維、特征提取及聚類 Python_sklearn機器學習庫學習筆記（三）logistic regression（邏輯回歸） Python_sklearn機器學習庫學習筆記（一）_一元回歸 Python_sklearn機器學習庫學習筆記（四）decision_tree（決策樹）機器學習之路： python nltk 文本特征提取 sklearn學習筆記（一）——數據預處理 sklearn.preprocessing scikit-learn 4.2 Feature extraction特征提取機器學習算法選擇——特征提取圖像預處理（一）基本特征提取 Python機器學習筆記：sklearn庫的學習