機器學習 | 算法筆記- 邏輯斯蒂回歸（Logistic Regression）

本文轉載自查看原文 2019-03-10 17:55 4045 監督學習/ 機器學習/ 邏輯斯蒂回歸/ sklearn

前言

本系列為機器學習算法的總結和歸納，目的為了清晰闡述算法原理，同時附帶上手代碼實例，便於理解。

一、算法簡介

1.1 定義

邏輯斯蒂回歸(Logistic Regression) 雖然名字中有回歸，但模型最初是為了解決二分類問題。

線性回歸模型幫助我們用最簡單的線性方程實現了對數據的擬合，但只實現了回歸而無法進行分類。因此LR就是在線性回歸的基礎上，構造的一種分類模型。

對線性模型進行分類如二分類任務，簡單的是通過階躍函數(unit-step function)，即將線性模型的輸出值套上一個函數進行分割，大於z的判定為0，小於z的判定為1。如下圖左所示

但這樣的分段函數數學性質不好，既不連續也不可微。因此有人提出了對數幾率函數，見上圖右，簡稱Sigmoid函數。

該函數具有很好的數學性質，既可以用於預測類別，並且任意階可微，因此可用於求解最優解。將函數帶進去，可得LR模型為

其實，LR 模型就是在擬合 z = w^T x +b 這條直線，使得這條直線盡可能地將原始數據中的兩個類別正確的划分開。

1.2 損失函數

回歸問題的損失函數一般為平均誤差平方損失 MSE，LR解決二分類問題中，損失函數為如下形式

這個函數通常稱為對數損失logloss，這里的對數底為自然對數 e ，其中真實值 y 是有 0/1 兩種情況，而推測值 y^ 由於借助對數幾率函數，其輸出是介於0~1之間連續概率值。因此損失函數可以轉換為分段函數

1.3 優化求解

確定損失函數后，要不斷優化模型。LR的學習任務轉化為數學的優化形式為

是一個關於w和b的函數。同樣，采用梯度下降法進行求解，過程需要鏈式求導法則

此處忽略求解過程。

此外，優化算法還包括

Newton Method（牛頓法）
Conjugate gradient method(共軛梯度法)
Quasi-Newton Method(擬牛頓法)
BFGS Method
L-BFGS(Limited-memory BFGS)

上述優化算法中，BFGS與L-BFGS均由擬牛頓法引申出來，與梯度下降算法相比，其優點是：第一、不需要手動的選擇步長；第二、比梯度下降算法快。但缺點是這些算法更加復雜，實用性不如梯度下降。

二、實例

2.1 自主實現

首先，建立 logistic_regression.py 文件，構建 LR 模型的類，內部實現了其核心的優化函數。

# -*- coding: utf-8 -*-

import numpy as np


class LogisticRegression(object):

    def __init__(self, learning_rate=0.1, max_iter=100, seed=None):
        self.seed = seed
        self.lr = learning_rate
        self.max_iter = max_iter

    def fit(self, x, y):
        np.random.seed(self.seed)
        self.w = np.random.normal(loc=0.0, scale=1.0, size=x.shape[1])
        self.b = np.random.normal(loc=0.0, scale=1.0)
        self.x = x
        self.y = y
        for i in range(self.max_iter):
            self._update_step()
            # print('loss: \t{}'.format(self.loss()))
            # print('score: \t{}'.format(self.score()))
            # print('w: \t{}'.format(self.w))
            # print('b: \t{}'.format(self.b))

    def _sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-z))

    def _f(self, x, w, b):
        z = x.dot(w) + b
        return self._sigmoid(z)

    def predict_proba(self, x=None):
        if x is None:
            x = self.x
        y_pred = self._f(x, self.w, self.b)
        return y_pred

    def predict(self, x=None):
        if x is None:
            x = self.x
        y_pred_proba = self._f(x, self.w, self.b)
        y_pred = np.array([0 if y_pred_proba[i] < 0.5 else 1 for i in range(len(y_pred_proba))])
        return y_pred

    def score(self, y_true=None, y_pred=None):
        if y_true is None or y_pred is None:
            y_true = self.y
            y_pred = self.predict()
        acc = np.mean([1 if y_true[i] == y_pred[i] else 0 for i in range(len(y_true))])
        return acc

    def loss(self, y_true=None, y_pred_proba=None):
        if y_true is None or y_pred_proba is None:
            y_true = self.y
            y_pred_proba = self.predict_proba()
        return np.mean(-1.0 * (y_true * np.log(y_pred_proba) + (1.0 - y_true) * np.log(1.0 - y_pred_proba)))

    def _calc_gradient(self):
        y_pred = self.predict()
        d_w = (y_pred - self.y).dot(self.x) / len(self.y)
        d_b = np.mean(y_pred - self.y)
        return d_w, d_b

    def _update_step(self):
        d_w, d_b = self._calc_gradient()
        self.w = self.w - self.lr * d_w
        self.b = self.b - self.lr * d_b
        return self.w, self.b

View Code

然后，這里我們創建了一個data_helper.py文件，單獨用於創建模擬數據，並且內部實現了訓練/測試數據的划分功能。

# -*- coding: utf-8 -*-

import numpy as np


def generate_data(seed):
    np.random.seed(seed)
    data_size_1 = 300
    x1_1 = np.random.normal(loc=5.0, scale=1.0, size=data_size_1)
    x2_1 = np.random.normal(loc=4.0, scale=1.0, size=data_size_1)
    y_1 = [0 for _ in range(data_size_1)]
    data_size_2 = 400
    x1_2 = np.random.normal(loc=10.0, scale=2.0, size=data_size_2)
    x2_2 = np.random.normal(loc=8.0, scale=2.0, size=data_size_2)
    y_2 = [1 for _ in range(data_size_2)]
    x1 = np.concatenate((x1_1, x1_2), axis=0)
    x2 = np.concatenate((x2_1, x2_2), axis=0)
    x = np.hstack((x1.reshape(-1,1), x2.reshape(-1,1)))
    y = np.concatenate((y_1, y_2), axis=0)
    data_size_all = data_size_1+data_size_2
    shuffled_index = np.random.permutation(data_size_all)
    x = x[shuffled_index]
    y = y[shuffled_index]
    return x, y

def train_test_split(x, y):
    split_index = int(len(y)*0.7)
    x_train = x[:split_index]
    y_train = y[:split_index]
    x_test = x[split_index:]
    y_test = y[split_index:]
    return x_train, y_train, x_test, y_test

View Code

最后，創建 train.py 文件，調用之前自己寫的 LR 類模型實現分類任務，查看分類的精度。

# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import data_helper
from logistic_regression import *


# data generation
x, y = data_helper.generate_data(seed=272)
x_train, y_train, x_test, y_test = data_helper.train_test_split(x, y)

# visualize data
# plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')
# plt.show()
# plt.scatter(x_test[:,0], x_test[:,1], c=y_test, marker='.')
# plt.show()

# data normalization
x_train = (x_train - np.min(x_train, axis=0)) / (np.max(x_train, axis=0) - np.min(x_train, axis=0))
x_test = (x_test - np.min(x_test, axis=0)) / (np.max(x_test, axis=0) - np.min(x_test, axis=0))

# Logistic regression classifier
clf = LogisticRegression(learning_rate=0.1, max_iter=500, seed=272)
clf.fit(x_train, y_train)

# plot the result
split_boundary_func = lambda x: (-clf.b - clf.w[0] * x) / clf.w[1]
xx = np.arange(0.1, 0.6, 0.1)
plt.scatter(x_train[:,0], x_train[:,1], c=y_train, marker='.')
plt.plot(xx, split_boundary_func(xx), c='red')
plt.show()

# loss on test set
y_test_pred = clf.predict(x_test)
y_test_pred_proba = clf.predict_proba(x_test)
print(clf.score(y_test, y_test_pred))
print(clf.loss(y_test, y_test_pred_proba))
# print(y_test_pred_proba)

View Code

2.2 sklearn

sklearn.linear_model模塊提供了很多模型供我們使用，比如Logistic回歸、Lasso回歸、貝葉斯脊回歸等，可見需要學習的東西還有很多很多。本篇文章，我們使用LogisticRegressioin。

LogisticRegression這個函數，一共有14個參數，詳見 https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LogisticRegression.html

除此之外，LogisticRegression也有一些方法供我們使用：

# -*- coding:UTF-8 -*-
from sklearn.linear_model import LogisticRegression

"""
函數說明:使用Sklearn構建Logistic回歸分類器

Parameters:
    無
Returns:
    無

"""
def colicSklearn():
    frTrain = open('horseColicTraining.txt')                                        #打開訓練集
    frTest = open('horseColicTest.txt')                                                #打開測試集
    trainingSet = []; trainingLabels = []
    testSet = []; testLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr = []
        for i in range(len(currLine)-1):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[-1]))
    for line in frTest.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(len(currLine)-1):
            lineArr.append(float(currLine[i]))
        testSet.append(lineArr)
        testLabels.append(float(currLine[-1]))
    classifier = LogisticRegression(solver='liblinear',max_iter=10).fit(trainingSet, trainingLabels)
    test_accurcy = classifier.score(testSet, testLabels) * 100
    print('正確率:%f%%' % test_accurcy)

if __name__ == '__main__':
    colicSklearn()

View Code

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python機器學習算法 — 邏輯回歸（Logistic Regression）機器學習算法與Python實踐之（七）邏輯回歸（Logistic Regression）機器學習簡要筆記（五）——Logistic Regression(邏輯回歸）機器學習算法與Python實踐之（七）邏輯回歸（Logistic Regression）機器學習 (三) 邏輯回歸 Logistic Regression 機器學習之邏輯回歸（Logistic Regression）【機器學習】邏輯回歸（Logistic Regression）在opencv3中實現機器學習之：利用邏輯斯諦回歸（logistic regression)分類機器學習---邏輯回歸（一）（Machine Learning Logistic Regression I）談談對機器學習中邏輯回歸的理解（Logistic Regression）