机器学习入门
(注:无基础可快速入门,想提高准确率还得多下功夫,文中各名词不做过多解释)
Python语言、pandas包、sklearn包 建议在Jupyter环境操作
操作步骤
1.pandas包加载给机器学习训练的表格
依照机器学习领域的习惯,我们把特征叫做X,目标叫做y,通常一列数据最后一列作为目标列
2.映射数据列为整型(Python做决策树需要整型或者实数)
3.拆分训练集、测试集
4.sklearn创建训练模型、测试模型准确率等
5.预测结果导出
算法
1.PCA算法
2.LDA算法
3.线性回归
4.逻辑回归
5.朴素贝叶斯
6.决策树
7.SVM
8.神经网络
9.KNN算法
import pandas as pd import matplotlib.pyplot as plt X = pd.read_csv('x_train.csv') X = X.drop('target', axis=1) y = df.target #print(X.shape,y.head(10),y.shape,y.head(10)) #处理转换为整型(存在优化空间) from sklearn.preprocessing import LabelEncoder from collections import defaultdict d = defaultdict(LabelEncoder) X_train = X.apply(lambda x: d[x.name].fit_transform(x)) #X_train.tail(10) #拆分训练集、测试集 from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_train, y,test_size=0.25, random_state=7) #print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) #决策树 from sklearn import tree clf = tree.DecisionTreeClassifier(max_depth=3) clf = clf.fit(X_train, y_train) #acc正确率 from sklearn.metrics import accuracy_score print(accuracy_score(y_test, clf.predict(X_test))) #F1 score #from sklearn import metrics #predict_labels = clf.predict(X_test) #F1_scores = metrics.f1_score(y_test, predict_labels, pos_label=0) #print(F1_scores) #预测 X_pred = pd.read_csv('x_test') dx = defaultdict(LabelEncoder) X_pred = X_pred.apply(lambda x: dx[x.name].fit_transform(x)) pred_list = clf.predict(X_pred) pred_proba_list = clf.predict_proba(X_pred) print(pred_list) print(pred_proba_list) print(type(pred_list),type(pred_proba_list)) tag_list =pred_list.tolist() proba_list = [] for i in pred_proba_list.tolist(): proba_list.append(i[1]) X_pred["Proba"] = proba_list X_pred["Tag"] = tag_list X_pred.head(10) X_pred.to_csv('./predict.csv',index=False,encoding='utf-8') #from sklearn.svm import SVC ## 模型训练 #clf = SVC(kernel='linear') #clf.fit(X_train, y_train) ## 模型存储 #joblib.dump(clf, './model/svm_mode.pkl') #