python3二元Logistics Regression 回歸分析(LogisticRegression)


 

綱要

boss說增加項目平台分析方法:

T檢驗(獨立樣本T檢驗)、線性回歸、二元Logistics回歸、因子分析、可靠性分析

根本不懂,一臉懵逼狀態,分析部確實有人才,反正我是一臉懵

 

首先解釋什么是二元Logistic回歸分析吧

 二元Logistics回歸 可以用來做分類,回歸更多的是用於預測 

 

官方簡介:

鏈接:https://pythonfordatascience.org/logistic-regression-python/

 

 

Logistic regression models are used to analyze the relationship between a dependent variable (DV) and independent variable(s) (IV) when the DV is dichotomous. The DV is the outcome variable, a.k.a. the predicted variable, and the IV(s) are the variables that are believed to have an influence on the outcome, a.k.a. predictor variables. If the model contains 1 IV, then it is a simple logistic regression model, and if the model contains 2+ IVs, then it is a multiple logistic regression model.

Assumptions for logistic regression models:

The DV is categorical (binary)
If there are more than 2 categories in terms of types of outcome, a multinomial logistic regression should be used
Independence of observations
Cannot be a repeated measures design, i.e. collecting outcomes at two different time points.
Independent variables are linearly related to the log odds
Absence of multicollinearity
Lack of outliers
原文

 

理解了什么是二元以后,開始找庫

需要用的包

 這里需要特別說一下,第一天晚上我就用的logit,但結果不對,然后用機器學習搞,發現結果還不對,用spss比對的值

奇怪,最后沒辦法,只能抱大腿了,因為他們糾結Logit和Logistic的區別,然后有在群里問了下,有大佬給解惑了

而且也有下面文章給解惑

 

1. 是 statsmodels 的logit模塊

2. 是 sklearn.linear_model 的 LogisticRegression模塊

 

 

 先說第一種方法

首先借鑒文章鏈接:https://blog.csdn.net/zj360202/article/details/78688070?utm_source=blogxgwz0

解釋的比較清楚,但是一定要注意一點就是,截距項,我就是在這個地方出的問題,因為我覺得不重要,就沒加

#!/usr/bin/env
# -*- coding:utf-8 -*-

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
from pandas import DataFrame, Series
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from collections import OrderedDict

data = {
    'y': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1],
    'x': [i for i in range(1, 21)],
}

df = DataFrame(OrderedDict(data))


df["intercept"] = 1.0  # 截距項,很重要的呦,我就錯在這里了


print(df)
print("==================")
print(len(df))
print(df.columns.values)

print(df[df.columns[1:]])

logit = sm.Logit(df['y'],  df[df.columns[1:]])
#
result = logit.fit()
#
res = result.summary2()

print(res)

 

 

 

 

 

這么寫我覺得更好,因為上面那么寫執行第二遍的時候總是報錯:

statsmodels.tools.sm_exceptions.PerfectSeparationError: Perfect separation detected, results not available

 

我改成x, y變量自己是自己的,就莫名其妙的好了

 

 

    obj = TwoDimensionalLogisticRegressionModel()
    data_x = obj.SelectVariableSql( UserID, ProjID, QuesID, xVariable, DatabaseName, TableName, CasesCondition)
    data_y = obj.SelectVariableSql( UserID, ProjID, QuesID, yVariable, DatabaseName, TableName, CasesCondition)
    if len(data_x) != len(data_y):
        raise MyCustomError(retcode=4011)
    obj.close()

    df_X = DataFrame(OrderedDict(data_x))
    df_Y = DataFrame(OrderedDict(data_y))

    df_X["intercept"] = 1.0  # 截距項,很重要的呦,我就錯在這里了
    logit = sm.Logit(df_Y, df_X)
    result = logit.fit()
    res = result.summary()

    data = [j for j in [i for i in str(res).split('\n')][-3].split(' ') if j != ''][1:]

    return data

 

允許二分數值虛擬變量的使用,修改后

obj = TwoDimensionalLogisticRegressionModel()
data_x = obj.SelectVariableSql( UserID, ProjID, QuesID, xVariable, DatabaseName, TableName, CasesCondition)
data_y = obj.SelectVariableSql( UserID, ProjID, QuesID, yVariable, DatabaseName, TableName, CasesCondition)
if len(data_x) != len(data_y):
    raise MyCustomError(retcode=4011)
obj.close()

df_X = DataFrame(data_x)
df_Y = DataFrame(data_y) # 因變量,01

df_X["intercept"] = 1.0  # 截距項,很重要的呦,我就錯在這里了

YColumnList = list(df_Y[yVariable].values)
setYColumnList = list(set(YColumnList))
if len(setYColumnList) > 2 or len(setYColumnList) < 2:
    raise MyCustomError(retcode=4015)
else:
    if len(setYColumnList) == 2 and [0,1] != [int(i) for i in setYColumnList]:
        newYcolumnsList = []
        for i in YColumnList:
            if i == setYColumnList[0]:
                newYcolumnsList.append(0)
            else:
                newYcolumnsList.append(1)
        df_Y = DataFrame({yVariable:newYcolumnsList})
logit = sm.Logit(df_Y, df_X)
result = logit.fit()
res = result.summary()

data = [j for j in [i for i in str(res).split('\n')][-3].split(' ') if j != '']

return data[1:]

 

 

 再次更新后

def TwoDimensionalLogisticRegressionDetail(UserID, ProjID, QuesID, xVariableID, yVariableID, CasesCondition):
    two_obj = TwoDimensionalLogisticModel()
    sql_data, xVarName, yVarName = two_obj.showdatas(UserID, ProjID, QuesID, xVariableID, yVariableID, CasesCondition)

    two_obj.close()

    df_dropna = DataFrame(sql_data).dropna()
    df_X = DataFrame()
    df_Y = DataFrame()  # 因變量,0, 1

    df_X[xVarName] = df_dropna[xVarName]
    df_Y[yVarName] = df_dropna[yVarName]

    df_X["intercept"] = 1.0  # 截距項,很重要的呦,我就錯在這里了

    YColumnList = list(df_Y[yVarName].values)
    setYColumnList = list(set(YColumnList))

    # print(setYColumnList)
    if len(setYColumnList) > 2 or len(setYColumnList) < 2:
        raise MyCustomError(retcode=4015)
    # else:
    if len(setYColumnList) == 2 and [0, 1] != [int(i) for i in setYColumnList]:
        newYcolumnsList = []
        for i in YColumnList:
            if i == setYColumnList[0]:
                newYcolumnsList.append(0)
            else:
                newYcolumnsList.append(1)
        df_Y = DataFrame({yVarName: newYcolumnsList})
    logit = sm.Logit(df_Y, df_X)
    res = logit.fit()
    res_all = res.summary()
    LogLikelihood = [i.strip() for i in str(res_all).split("\n")[6].split("   ") if i][3]
    # 沒找到具體參數, 只能這么分割
    index_var = [i.strip() for i in str(res_all).split("\n")[12].split("   ") if i]
    intercept = [i.strip() for i in str(res_all).split("\n")[13].split("   ") if i]
    std_err = [index_var[2], intercept[2]]
    z = [index_var[3], intercept[3]]
    P_z = [index_var[4], intercept[4]] # 顯著性
    interval_25 = [index_var[5], intercept[5]]
    interval_975 = [index_var[6], intercept[6]]
    Odds_Ratio = [math.e ** i for i in list(res.params)]
    return {
        "No_Observations": res.nobs,#No. Observations
        "Pseudo_R": res.prsquared,# Pseudo R^2
        "Log_Likelihood": LogLikelihood, # LogLikelihood
        "LLNull": res.llnull,
        "llr_pvalue": res.llr_pvalue, #llr顯著性
        "coef": list(res.params), # 系數
        "std_err": std_err,
        "Odds_Ratio": Odds_Ratio,
        "z": z,
        "P": P_z, #顯著性
        "interval_25": interval_25, # 區間0.025
        "interval_975": interval_975
    }
View Code

 

 

 第二種方法,機器學習

參考鏈接:https://zhuanlan.zhihu.com/p/34217858

#!/usr/bin/env python
# -*- coding:utf-8 -*-

from collections import OrderedDict
import pandas as pd



examDict = {
    '學習時間': [i for i in range(1, 20)],
    '通過考試': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1]
}

examOrderDict = OrderedDict(examDict)
examDF = pd.DataFrame(examOrderDict)
# print(examDF.head())

exam_X = examDF.loc[:, "學習時間"]
exam_Y = examDF.loc[:, "通過考試"]

print(exam_X)
# print(exam_Y)

from sklearn.cross_validation import train_test_split

X_train,X_test,y_train, y_test = train_test_split(exam_X,exam_Y, train_size=0.8)

# print(X_train.values)
print(len(X_train.values))
X_train = X_train.values.reshape(-1, 1)
print(len(X_train))
print(X_train)
X_test = X_test.values.reshape(-1, 1)


from sklearn.linear_model import LogisticRegression

module_1 = LogisticRegression()
module_1.fit(X_train, y_train)

print("coef:", module_1.coef_)

front = module_1.score(X_test,y_test)
print(front)

print("coef:", module_1.coef_)
print("intercept_:", module_1.intercept_)

# 預測
pred1 = module_1.predict_proba(3)
print("預測概率[N/Y]", pred1)

pred2 = module_1.predict(5)
print(pred2)

 

 但是,機器學習的這個有問題,就是只抽取了15個值

 

statsmodels的庫鏈接

Statsmodels:http://www.statsmodels.org/stable/index.html

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM