數據競賽實戰(2)——形狀識別:是方還是圓


前言

1,背景介紹

  數據給出四千張圖像作為訓練集。每個圖像中只有一個圖像,要么是圓形,要么是正方形。你的任務根據這四千張圖片訓練出一個二元分類模型,並用它在測試集上判斷每個圖像中的形狀。

2,任務類型

  二元分類,圖像識別

3,數據文件說明

train.csv        訓練集     文件大小為19.7MB

test.csv               預測集    文件大小為17.5MB

sample_submit.csv    提交示例      文件大小為25KB

4,數據變量說明  

 

  訓練集中共有4000個灰度圖像,預測集中有3550個灰度圖像。每個圖像中都會含有大量的噪點。圖像的分辨率為40*40,也就是40*40的矩陣,每個矩陣以行向量的形式被存放在train.csv 和test.csv中。train.csv 和test.csv 中每行數據代表一個圖像,也就是說每行都有1600個特征。

  下面左圖為train.csv第一行樣本所對應的圖像(方形),右圖為train.csv第二行樣本所對應的圖像(圓形)。

   選手的任務是提交預測集中的每個圖像的標簽(而非概率),標簽以 0 或者 1 表示,格式應與sample_submit.csv一致。

  train.csv test.csv 均為逗號分隔形式的文件。在Python中可以通過如下形式讀取。

train = pd.read_csv('train.csv') 
test = pd.read_csv('test.csv') 

  

5,評估方法

  提交的結果為每行的預測標簽,也就是 0 或者 1 。評估方法為F1 score。

  F1 score 的取值范圍是0到1,越接近1,說明模型預測的結果越佳。F1-score的計算公式如下:

  其中precision為精度,recall為召回,他們可以根據混淆矩陣計算得到。

6,混淆矩陣介紹

   可以參考博文:機器學習筆記:常用評估方法

   混淆矩陣是用來總結一個分類器結果的矩陣,對於K元分類,其實它就是一個k x k 的表格,用來記錄分類器的預測結果。

  對於最常見的二元分類來說,它的混淆矩陣是2 * 2的,如下:

    TP = True Postive = 真陽性; FP = False Positive = 假陽性

    FN = False Negative = 假陰性; TN = True Negative = 真陰性

 下面舉個例子

  比如我們一個模型對15個樣本預測,然后結果如下:

預測值:1    1    1    1    1    0    0    0    0    0    1    1    1    0    1

真實值:0    1    1    0    1    1    0    0    1    0    1    0    1    0    0

  上面的就是混淆矩陣,混淆矩陣的這四個數值,經常被用來定義其他的一些度量。

准確度(Accuracy) = (TP+TN) / (TP+TN+FN+TN)

  在上面的例子中,准確度 = (5+4) / 15 = 0.6

精度(precision, 或者PPV, positive predictive value) = TP / (TP + FP)

  在上面的例子中,精度 = 5 / (5+4) = 0.556

召回(recall, 或者敏感度,sensitivity,真陽性率,TPR,True Positive Rate) = TP / (TP + FN)

  在上面的例子中,召回 = 5 / (5+2) = 0.714

特異度(specificity,或者真陰性率,TNR,True Negative Rate) = TN / (TN + FP)

  在上面的例子中,特異度 = 4 / (4+2) = 0.667 

F1-值(F1-score) = 2*TP / (2*TP+FP+FN) 

  在上面的例子中,F1-值 = 2*5 / (2*5+4+2) = 0.625

 

7,完整代碼,請移步小編的github

  傳送門:請點擊我

數據預處理

1,准備工作

PIL詳細文檔

  The most important class in the Python Imaging Library is the Image class, defined in the module with the same name. You can create instances of this class in several ways; either by loading images from files, processing other images, or creating images from scratch.

  Python映像庫中最重要的類是Image類,定義在具有相同名稱的模塊中。您可以通過多種方式創建該類的實例;通過從文件加載圖像,處理其他圖像,或從頭創建圖像/。

使用之前先按照pillow的包

pip install pillow

  

2,利用python進行矩陣與圖像之間的轉換

  下面左圖為train.csv第一行樣本所對應的圖像(方形),右圖為train.csv第二行樣本所對應的圖像(圓形)。是我們使用代碼進行還原,和上面的結果一樣。

import numpy as np
import pandas as pd
from scipy.misc import imsave


def loadData():
    # 讀取數據
    train_data = pd.read_csv('data/train.csv')
    data0 = train_data.iloc[0,1:-1]
    data1 = train_data.iloc[1, 1:-1]
    data0 = np.matrix(data0)
    data1 = np.matrix(data1)
    data0 = np.reshape(data0, (40, 40))
    data1 = np.reshape(data1, (40, 40))
    imsave('test1.jpg', data0)
    imsave('test1.jpg', data1)


if __name__ =='__main__':
    loadData()

  結果:

3,使用二值化處理,去噪

  灰度化:在RGB模型中,如果R = G = B時,則彩色表示一種灰度顏色,其中R = G = B的值叫灰度值,因此灰度圖像每個像素只需要一個字節存放灰度值(又稱強度值,亮度值),灰度范圍為0-255.一般常用的是加權平均法來獲取每個像素點的灰度值。

  二值化:圖像的二值化,就是將圖像的響度點的灰度值設置為0或者255,也就是將整個圖像呈現出明顯的只有黑和白的視覺效果。

  這里參考討論區網友的意見,在CNN標桿模型中加一個median filter,去除背景中的噪音,把圖形變為binary圖像,就能加快收斂,Accuracy能達到1。

from scipy.ndimage import median_filter

def data_modify_suitable_train(data_set=None, type=True):
    if data_set is not None:
        data = []
        if type is True:
            np.random.shuffle(data_set)
            data = data_set[:, 0: data_set.shape[1] - 1]
        else:
            data = data_set
    data = np.array([np.reshape(i, (40, 40)) for i in data])
    data = np.array([median_filter(i, size=(3, 3)) for i in data])
    data = np.array([(i>10)*100 for i in data])
    data = np.array([np.reshape(i, (i.shape[0], i.shape[1], 1)) for i in data])
    return data

  這里我們對數據進行測試一下,我的測試代碼如下:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.ndimage import median_filter

def loadData(trainFile, show_fig=False):
    train = pd.read_csv(trainFile)
    data = np.array(train.iloc[1, 1:-1])
    origin_data = np.reshape(data, (40, 40))

    # 做了中值濾波
    median_filter_data = np.array(median_filter(origin_data, size=(3, 3)))

    # 二值化
    # binary_data = np.array((origin_data > 127) * 256)
    binary_data = np.array((origin_data > 1) * 256)
    if show_fig:
        # fig = plt.figure(figsize=(8, 4))
        plt.subplot(1, 3, 1)
        plt.gray()
        plt.imshow(origin_data)
        plt.axis('off')
        plt.title('origin photo')

        plt.subplot(1, 3, 2)
        plt.imshow(median_filter_data)
        plt.axis('off')
        plt.title('median filter photo')

        plt.subplot(1, 3, 3)
        plt.imshow(binary_data)
        plt.axis('off')
        plt.title('binary photo')


if __name__ == '__main__':
    trainFile = 'datain/train.csv'
    loadData(trainFile, True)

  效果圖如下:

 

   我們發現,做了二值處理,效果更佳明顯。我們可以看http://sofasofa.io/forum_main_post.php?postid=1002062  的討論,我直接把討論內容摳出來了,如下:

   所以10和100是作者隨手寫的,所以我這里做了很多測試,看圖的效果,發現1和256很不錯。所以后面使用了二值處理(1, 256)做了,結果如下:

 

   效果和(10, 100)沒啥區別,可能是運氣不太好吧。。。。

模型訓練及其結果展示

1,邏輯回歸模型(標桿代碼)

  Logistic回歸是機器學習中最常用最經典的分類方法之一,有人稱之為邏輯回歸或者邏輯斯蒂回歸。雖然他稱為回歸模型,但是卻處理的是分類問題,這主要是因為它的本質是一個線性模型加上一個映射函數Sigmoid,將線性模型得到的連續結果映射到離散型上。它常用於二分類問題,在多分類問題的推廣叫softmax。

  該模型預測結果的F1 score 為0.97318

# -*- coding: utf-8 -*-
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

train_labels = train.pop('y')

clf = LogisticRegression()
clf.fit(train, train_labels)

submit = pd.read_csv('sample_submit.csv')
submit['y'] = clf.predict(test)
submit.to_csv('my_LR_prediction.csv', index=False)

  

我修改的代碼:

#_*_cioding:utf-8_*_
import pandas as pd
from numpy import *
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# load data
def loadDataSet(trainname,testname):
    '''
            對於trainSet 數據,每行前面的值分別是X1...Xn,最后一個值對應的類別標簽
        :param filename:
        :return:
        '''
    datafile = pd.read_csv(trainname)
    testfile = pd.read_csv(testname)
    print(type(datafile))
    dataMat , labelMat = datafile.iloc[:,1:-1] , datafile.iloc[:,-1]
    testMat = testfile.iloc[:,1:]
    return dataMat , labelMat,testMat

def sklearn_logistic(dataMat, labelMat,testMat,submitfile):
    trainSet,testSet,trainLabels,testLabels = train_test_split(dataMat,labelMat,
                                                               test_size=0.3,random_state=400)
    classifier = LogisticRegression(solver='sag',max_iter=5000)
    classifier.fit(trainSet, trainLabels)
    test_accurcy = classifier.score(testSet,testLabels) * 100
    print(" 正確率為  %s%%" %test_accurcy)
    submit = pd.read_csv(submitfile)
    submit['y'] = classifier.predict(testMat)
    submit.to_csv('my_LR_prediction.csv',index=False)

    return test_accurcy

if __name__ == '__main__':
    TrainFile = 'data/train.csv'
    TestFile = 'data/test.csv'
    SubmitFile = 'data/sample_submit.csv'
    dataMat, labelMat, testMat = loadDataSet(TrainFile,TestFile)
    sklearn_logistic(dataMat , labelMat,testMat,SubmitFile)

  結果:

 

2,卷積神經網絡(標桿代碼)

該模型預測結果的F1 score為:0.98632

  我提交的結果是 0.99635

# -*- coding: utf-8 -*-
from keras.callbacks import TensorBoard
from keras.layers import Dense, Dropout, MaxPooling2D, Flatten, Convolution2D
from keras.models import Sequential
from keras import backend as K
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

def load_train_test_data(train, test):
    np.random.shuffle(train)
    labels = train[:, -1]
    data_test = np.array(test)
    #
    data, data_test = data_modify_suitable_train(train, True), data_modify_suitable_train(test, False)
    train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.7)
    return train_x, train_y, test_x, test_y, data_test


def data_modify_suitable_train(data_set=None, type=True):
    if data_set is not None:
        data = []
        if type is True:
            np.random.shuffle(data_set)
            data = data_set[:, 0: data_set.shape[1] - 1]
        else:
            data = data_set
    data = np.array([np.reshape(i, (40, 40)) for i in data])
    data = np.array([np.reshape(i, (i.shape[0], i.shape[1], 1)) for i in data])
    return data


def f1(y_true, y_pred):
    def recall(y_true,y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true,y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall))


def built_model(train, test):
    model = Sequential()
    model.add(Convolution2D(filters=8,
                            kernel_size=(5, 5),
                            input_shape=(40, 40, 1),
                            activation='relu'))
    model.add(Convolution2D(filters=16,
                            kernel_size=(3, 3),
                            activation='relu'))
    model.add(MaxPooling2D(pool_size=(4, 4)))
    model.add(Convolution2D(filters=16,
                            kernel_size=(3, 3),
                            activation='relu'))
    model.add(MaxPooling2D(pool_size=(4, 4)))
    model.add(Flatten())
    model.add(Dense(units=128,
                    activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(units=1,
                    activation='sigmoid'))
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy', f1])
    model.summary()
    return model


def train_model(train, test, batch_size=64, epochs=20, model=None):
    train_x, train_y, test_x, test_y, t = load_train_test_data(train, test)
    if model is None:
        model = built_model(train, test)
        history = model.fit(train_x, train_y,
                            batch_size=batch_size,
                            epochs=epochs,
                            verbose=2,
                            validation_split=0.2)
    print("刻畫損失函數在訓練與驗證集的變化")
    plt.plot(history.history['loss'], label='train')
    plt.plot(history.history['val_loss'], label='valid')
    plt.legend()
    plt.show()
    pred_prob = model.predict(t, batch_size=batch_size, verbose=1)
    pred = np.array(pred_prob > 0.5).astype(int)
    score = model.evaluate(test_x, test_y, batch_size=batch_size)
    print(score)
    print("刻畫預測結果與測試集結果")
    return pred


if __name__ == '__main__':
    train, test = pd.read_csv('train.csv'), pd.read_csv('test.csv') 
    train = np.array(train.drop('id', axis=1))
    test = np.array(test.drop('id', axis=1))
    
    pred = train_model(train, test)
    submit = pd.read_csv('sample_submit.csv')
    submit['y'] = pred
    submit.to_csv('my_CNN_prediction.csv', index=False)

  我修改的CNN代碼:

#_*_coding:utf-8
from keras.callbacks import  TensorBoard
from keras.layers import Dense, Dropout, MaxPooling2D, Flatten, Convolution2D
from keras.models import Sequential
from keras import backend as K
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.ndimage import median_filter

def load_train_test_data(train, test):
    np.random.shuffle(train)
    labels = train[:, -1]
    data_test = np.array(test)  
    data, data_test = data_modify_suitable_train(train, True),data_modify_suitable_train(test, False)
    train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=0.7)
    return train_x, train_y, test_x, test_y, data_test


def data_modify_suitable_train(data_set = None, type=True):
    if data_set is not None:
        data = []
        if type is True:
            np.random.shuffle(data_set)
            data = data_set[:, 0: data_set.shape[1] - 1]
        else:
            data = data_set
    data = np.array([np.reshape(i, (40,40)) for i in data])
    data = np.array([median_filter(i, size=(3, 3)) for i in data])
    data = np.array([(i>10) * 100 for i in data])
    data = np.array([np.reshape(i, (i.shape[0], i.shape[1],1)) for i in data])
    return data


def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0 ,1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision

    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall))

def bulit_model(train, test):
    model = Sequential()
    model.add(Convolution2D(filters= 8,
                            kernel_size=(5, 5),
                            input_shape=(40, 40, 1),
                            activation='relu'))
    model.add(Convolution2D(filters=16,
                            kernel_size=(3, 3),
                            activation='relu'))
    model.add(MaxPooling2D(pool_size=(4, 4)))
    model.add(Convolution2D(filters=16,
                            kernel_size=(3,3),
                            activation='relu'))
    model.add(MaxPooling2D(pool_size=(4,4)))
    model.add(Flatten())
    model.add(Dense(units=128,
                    activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(units=1,
                    activation='sigmoid'))
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy',f1])
    model.summary()
    return model

def train_models(train, test, batch_size = 64, epochs=20, model=None):
    train_x, train_y, test_x, test_y, t = load_train_test_data(train, test)
    if model is None:
        model = bulit_model(train, test)
        history = model.fit(train_x, train_y,
                            batch_size=batch_size,
                            epochs=epochs,
                            verbose=2,
                            validation_split=0.2)
        print("刻畫損失函數在訓練與驗證集的變化")
        plt.plot(history.history['loss'], label= 'train')
        plt.plot(history.history['val_loss'], label= 'valid')
        plt.legend()
        plt.show()
        pred_prob = model.predict(t, batch_size = batch_size, verbose=1)
        pred = np.array(pred_prob > 0.5).astype(int)
        score = model.evaluate(test_x, test_y, batch_size=batch_size)
        print("score is %s"%score)
        print("刻畫預測結果與測試集結果")
        return pred

if __name__ == '__main__':
    train = pd.read_csv('data/train.csv')
    test  = pd.read_csv('data/test.csv')
    # train = train.iloc[:,1:]
    # test = test.iloc[:,1:]
    print(type(train))
    train = np.array(train.drop('id', axis=1))
    test = np.array(test.drop('id', axis=1))
    print(type(train))

    pred = train_models(train, test)
    submit = pd.read_csv('data/sample_submit.csv')
    submit['y'] = pred
    submit.to_csv('my_CNN_prediction.csv', index=False)

  結果如下:

   而如果只是加了中值濾波降噪的話,效果不如降噪與將圖像變為二維圖像。

 

   我后面做了嘗試。將卷積神經網絡模型更改,加深一點點,看看效果。模型代碼如下:

def bulit_model_new(train, test):
    n_filter = 32
    model = Sequential()
    model.add(Convolution2D(filters=n_filter,
                            kernel_size=(5, 5),
                            input_shape=(40, 40, 1),
                            activation='relu'))
    model.add(Convolution2D(filters=n_filter,
                            kernel_size=(5, 5),
                            activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Convolution2D(filters=n_filter,
                            kernel_size=(5, 5),
                            activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Convolution2D(filters=n_filter,
                            kernel_size=(3, 3),
                            activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(Dense(units=128,
                    activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy', f1])
    model.summary()
    return model

    得分如下:

 

   算了,放棄將准確率提升到1。

 3,Xgboost分類

代碼:

#_*_cioding:utf-8_*_
import pandas as pd
from numpy import *
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score

# load data
def loadDataSet(trainname,testname):
    '''
            對於trainSet 數據,每行前面的值分別是X1...Xn,最后一個值對應的類別標簽
        :param filename:
        :return:
        '''
    datafile = pd.read_csv(trainname)
    testfile = pd.read_csv(testname)
    print(type(datafile))
    dataMat , labelMat = datafile.iloc[:,1:-1] , datafile.iloc[:,-1]
    testMat = testfile.iloc[:,1:]
    from sklearn.preprocessing import MinMaxScaler
    dataMat = MinMaxScaler().fit_transform(dataMat)
    testMat = MinMaxScaler().fit_transform(testMat)
    return dataMat , labelMat,testMat

def xgbFunc(dataMat, labelMat,testMat,submitfile):
    trainSet,testSet,trainLabels,testLabels = train_test_split(dataMat,labelMat,
                                                               test_size=0.3,random_state=400)
    xgbModel = xgb.XGBClassifier(
        learning_rate=0.1,
        n_estimators=1000,
        max_depth=5,
        gamma=0,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='binary:logistic',
        nthread=4,
        seed=27
    )

    xgbModel.fit(trainSet, trainLabels)
    # 對測試集進行預測
    test_pred = xgbModel.predict(testSet)
    print(test_pred)
    from sklearn.metrics import confusion_matrix
    test_accurcy = accuracy_score(testLabels,test_pred)
    # test_accurcy = xgbModel.score(testLabels, test_pred) * 100
    print(" 正確率為  %s%%" %test_accurcy)
    submit = pd.read_csv(submitfile)
    submit['y'] = xgbModel.predict(testMat)
    submit.to_csv('my_XGB_prediction.csv',index=False)

    # return test_accurcy

if __name__ == '__main__':
    TrainFile = 'data/train.csv'
    TestFile = 'data/test.csv'
    SubmitFile = 'data/sample_submit.csv'
    dataMat, labelMat, testMat = loadDataSet(TrainFile,TestFile)
    xgbFunc(dataMat , labelMat,testMat,SubmitFile)

  

  測試集結果:

[1 0 1 ... 1 1 0]

 正確率為  0.9975%

   預測的結果為0.99635。

Xgboost調參后的結果:

 

 4,KNN算法

  在討論區,看見網友們說,KNN不調參都可以達到99%的准確率,我這里就嘗試了一下。

代碼:

#_*_cioding:utf-8_*_
import pandas as pd
from numpy import *
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# load data
def loadDataSet(trainname,testname):
    datafile = pd.read_csv(trainname)
    testfile = pd.read_csv(testname)
    print(type(datafile))
    dataMat , labelMat = datafile.iloc[:,1:-1] , datafile.iloc[:,-1]
    testMat = testfile.iloc[:,1:]
    return dataMat , labelMat,testMat

def sklearn_logistic(dataMat, labelMat,testMat,submitfile):
    trainSet,testSet,trainLabels,testLabels = train_test_split(dataMat,labelMat,
                                                               test_size=0.2,random_state=400)
    classifier = KNeighborsClassifier()
    classifier.fit(trainSet, trainLabels)
    test_accurcy = classifier.score(testSet,testLabels) * 100
    print(" 正確率為  %.2f%%" %test_accurcy)
    submit = pd.read_csv(submitfile)
    submit['y'] = classifier.predict(testMat)
    submit.to_csv('my_KNN_prediction.csv',index=False)

    return test_accurcy

if __name__ == '__main__':
    TrainFile = 'data/train.csv'
    TestFile = 'data/test.csv'
    SubmitFile = 'data/sample_submit.csv'
    dataMat, labelMat, testMat = loadDataSet(TrainFile,TestFile)
    sklearn_logistic(dataMat , labelMat,testMat,SubmitFile)

  

這里提交的是Nearest-Neighbors的測試結果  0.99062:

 

5,使用隨機森林等8種算法嘗試

  之前在隨機森林算法筆記里(傳送門:機器學習筆記——隨機森林)嘗試了隨機森林算法與其他算法對比的優勢,這里覺得題目簡單,就來嘗試一下,結果發現確實有好幾種優秀的算法可以對此二分類,話不多說,直接看結果。

代碼:

#_*_cioding:utf-8_*_
import pandas as pd
from numpy import *
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.metrics import accuracy_score

# load data
def loadDataSet(trainname,testname):
    '''
            對於trainSet 數據,每行前面的值分別是X1...Xn,最后一個值對應的類別標簽
        :param filename:
        :return:
        '''
    datafile = pd.read_csv(trainname)
    testfile = pd.read_csv(testname)
    print(type(datafile))
    dataMat , labelMat = datafile.iloc[:,1:-1] , datafile.iloc[:,-1]
    testMat = testfile.iloc[:,1:]
    from sklearn.preprocessing import MinMaxScaler
    dataMat = MinMaxScaler().fit_transform(dataMat)
    testMat = MinMaxScaler().fit_transform(testMat)
    return dataMat , labelMat,testMat

def AlgorithmFunc(dataMat, labelMat,testMat,submitfile):
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.svm import SVC
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,ExtraTreesClassifier
    from sklearn.naive_bayes import GaussianNB

    trainSet, testSet, trainLabels, testLabels = train_test_split(dataMat,labelMat,
                                                               test_size=0.3,random_state=400)
    names = ['Nearest-Neighbors','Linear-SVM', 'RBF-SVM','Decision-Tree',
             'Random-Forest', 'AdaBoost', 'Naiva-Bayes', 'ExtraTrees']

    classifiers = [
        KNeighborsClassifier(),
        SVC(kernel='rbf', C=50, max_iter=5000),
        SVC(gamma=2, C=1),
        DecisionTreeClassifier(),
        RandomForestClassifier(),
        AdaBoostClassifier(),
        GaussianNB(),
        ExtraTreesClassifier(),
    ]

    for name,classifiers in zip(names, classifiers):
        classifiers.fit(trainSet, trainLabels)
        score = classifiers.score(testSet, testLabels)
        print('%s   is     %s'%(name, score))
        # from sklearn.metrics import confusion_matrix
        # test_accurcy = accuracy_score(testLabels,test_pred)
        # # test_accurcy = xgbModel.score(testSet, test_pred) * 100
        # print(" 正確率為  %s%%" %test_accurcy)
        submit = pd.read_csv(submitfile)
        submit['y'] = classifiers.predict(testMat)
        submit.to_csv('my_'+ str(name) +'_prediction.csv',index=False)

        # return test_accurcy

if __name__ == '__main__':
    TrainFile = 'data/train.csv'
    TestFile = 'data/test.csv'
    SubmitFile = 'data/sample_submit.csv'
    dataMat, labelMat, testMat = loadDataSet(TrainFile,TestFile)
    AlgorithmFunc(dataMat , labelMat,testMat,SubmitFile)

  測試結果:

Nearest-Neighbors   is     0.9925

Linear-SVM   is     0.9758333333333333

RBF-SVM   is     0.5233333333333333

Random-Forest   is     0.9975

Decision-Tree   is     0.9825

AdaBoost   is     0.9808333333333333

Naiva-Bayes   is     0.8858333333333334

ExtraTrees   is     0.9891666666666666

  

  這里打算提交隨機森林的結果,看看測試集的精度:

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM