異常值（outlier）

本文轉載自查看原文 2019-02-28 21:23 699 機器學習

簡介

在數據挖掘的過程中，我們可能會經常遇到一些偏離於預測趨勢之外的數據，通常我們稱之為異常值。

通常將這樣的一些數據的出現歸為誤差。有很多情況會出現誤差，具體的情況需要就對待：

傳感器故障　　　->　　忽略

數據輸入錯誤　　->　　忽略

反常事件　　　　->　　重視

異常值檢測/刪除算法

1、訓練數據

2、異常值檢測，找出訓練集中訪問最多的點，去除這些點（一般約10%的異常數據）

3、再訓練

需要多次重復2、3步驟

例：對數據第一次使用回歸后的擬合

誤差點的出現使擬合線相對偏離，將誤差點去除后進行一次回歸：

去除誤差點后的回歸線很好的對數據進行了擬合

代碼實現

環境：MacOS mojave　　10.14.3

Python　　3.7.0

使用庫：scikit-learn 0.19.2

原始數據集：

對原始數據進行一次回歸：

刪除10%的異常值后進行一次回歸：
　　　　

outlier_removal_regression.py　　主程序

#!/usr/bin/python

import random
import numpy
import matplotlib.pyplot as plt
import pickle

from outlier_cleaner import outlierCleaner

class StrToBytes:
    def __init__(self, fileobj):
        self.fileobj = fileobj
    def read(self, size):
        return self.fileobj.read(size).encode()
    def readline(self, size=-1):
        return self.fileobj.readline(size).encode()


### load up some practice data with outliers in it
ages = pickle.load(StrToBytes(open("practice_outliers_ages.pkl", "r") ) )
net_worths = pickle.load(StrToBytes(open("practice_outliers_net_worths.pkl", "r") ) )



### ages and net_worths need to be reshaped into 2D numpy arrays
### second argument of reshape command is a tuple of integers: (n_rows, n_columns)
### by convention, n_rows is the number of data points
### and n_columns is the number of features
ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))
from sklearn.cross_validation import train_test_split
ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)

### fill in a regression here!  Name the regression object reg so that
### the plotting code below works, and you can see what your regression looks like


from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(ages_train,net_worths_train)
print (reg.coef_)
print (reg.intercept_)
print (reg.score(ages_test,net_worths_test) )



try:
    plt.plot(ages, reg.predict(ages), color="blue")
except NameError:
    pass
plt.scatter(ages, net_worths)
plt.show()


### identify and remove the most outlier-y points
cleaned_data = []
try:
    predictions = reg.predict(ages_train)
    cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train )

except NameError:
    print ("your regression object doesn't exist, or isn't name reg")
    print ("can't make predictions to use in identifying outliers")







### only run this code if cleaned_data is returning data
if len(cleaned_data) > 0:
    ages, net_worths, errors = zip(*cleaned_data)
    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

    ### refit your cleaned data!
    try:
        reg.fit(ages, net_worths)
        plt.plot(ages, reg.predict(ages), color="blue")
        print (reg.coef_)
        print (reg.intercept_)
        print (reg.score(ages_test,net_worths_test) )
    except NameError:
        print ("you don't seem to have regression imported/created,")
        print ("   or else your regression object isn't named reg")
        print ("   either way, only draw the scatter plot of the cleaned data")
    plt.scatter(ages, net_worths)
    plt.xlabel("ages")
    plt.ylabel("net worths")
    plt.show()


else:
    print ("outlierCleaner() is returning an empty list, no refitting to be done")

outlier_cleaner.py　　清除10%的異常值

import numpy as np
import math
 
def outlierCleaner(predictions, ages, net_worths):
    """
        Clean away the 10% of points that have the largest
        residual errors (difference between the prediction
        and the actual net worth).
        Return a list of tuples named cleaned_data where 
        each tuple is of the form (age, net_worth, error).
    """
    
    cleaned_data = []
 
 
    ages = ages.reshape((1,len(ages)))[0]
    net_worths = net_worths.reshape((1,len(ages)))[0]
    predictions = predictions.reshape((1,len(ages)))[0]
    # zip() 函數用於將可迭代的對象作為參數，將對象中對應的元素打包成一個個元組，然后返回由這些元組組成的列表。
    cleaned_data = zip(ages,net_worths,abs(net_worths-predictions))
    #按照error大小排序
    cleaned_data = sorted(cleaned_data , key=lambda x: (x[2]))
    #ceil() 函數返回數字的上入整數，計算要刪除的元素個數
    cleaned_num = int(-1 * math.ceil(len(cleaned_data)* 0.1))
    #切片
    cleaned_data = cleaned_data[:cleaned_num]
    
    return cleaned_data

同時得到這兩次回歸的擬合優度：

第一次：0.8782624703664675

第二次：0.983189455395532

可見，去除異常值對於預測數據具有重要作用

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 outlier異常值檢驗原理和處理方法異常值處理 Matlab IQR准則剔除異常值數據清洗之異常值處理機器學習——異常值檢測基於Halcon深度學習異常值檢測方法數據預處理之異常值處理數據處理—異常值處理 MATLAB處理缺失值和異常值異常檢測——局部異常因子（Local Outlier Factor ，LOF）算法