pyspark 隨機森林特征重要性

本文轉載自查看原文 2019-02-27 18:46 1355 機器學習/深度學習/ 數據分析

# IMPORT
>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> from pyspark.ml.classification import RandomForestClassifier

# PREPARE DATA
>>> df = spark.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)

# BUILD THE MODEL
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
>>> model = rf.fit(td)

# FEATURE IMPORTANCES
>>> model.featureImportances
SparseVector(1, {0: 1.0})

重要性：

model.featureImportances

pyspark 模型簡單實例：

https://blog.csdn.net/Katherine_hsr/article/details/80988994

概率：

predictions.select("probability", "label").show(1000)

probability--->即為輸出概率

pandas 打亂樣本：

import pandas as pd
df = pd.read_excel("window regulator01 _0914新增樣本.xlsx")
df = df.sample(frac = 1) #打亂樣本

pyspark train、test 隨機划分

 train, test = labeled_v.randomSplit([0.75, 0.25])

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 kaggle數據挖掘競賽初步--Titanic<隨機森林&特征重要性> xgboost 特征重要性計算 XGBoost 輸出特征重要性以及篩選特征特征選擇-隨機森林可以衡量特征的重要程度關於習慣的重要性證書的重要性員工的重要性論JS的重要性機器學習入門-隨機森林溫度預測的案例 1.datetime.datetime.datetime(將字符串轉為為日期格式) 2.pd.get_dummies(將文本標簽轉換為one-hot編碼) 3.rf.feature_importances_(研究樣本特征的重要性) 4.fig.autofmt_xdate(rotation=60) 對標簽進行翻轉 EDI的含義及其重要性