數據挖掘_Python-Spark-Flink機器學習開發工具對比

本文轉載自查看原文 2020-10-21 20:04 742 數據分析與機器學習

不同的工具

在機器學習的常用工具中，一般的數據挖掘和數據統計分析的工具，是R語言和Python，大量的數據時候，使用的是Flink和Spark。
了解和熟悉工具的使用，對於一些數據進行探索和實現。
 本文主要是基於Python的數據挖掘和機器學習的流程，來對比Spark和Flink的機器學習包，進而通過使用其中的一種情況而熟悉其他，達到觸類旁通的效果

Python

 一般流程： 獲取數據 -> 數據預處理 -> 訓練建模 -> 模型評估 -> 預測，分類
scikit-learn ：  NumPy  SciPy  matplotlib
  管道機制實現了對全部步驟的流式化封裝和管理（streaming workflows with pipelines）
      許多算法模型串聯起來，比如將特征提取、歸一化、分類組織在一起形成一個典型的機器學習問題工作流 編程技巧的創新，而非算法的創新
     Transformer 轉換器  Estimator 估計器  Pipeline 管道
  具體
     01.Transformer 轉換器 (StandardScaler，MinMaxScaler)
     02.Estimator 估計器（LinearRegression、LogisticRegression、LASSO、Ridge），
        所有的機器學習算法模型，都被稱為估計器
     03.Pipeline 管道 將Transformer、Estimator 組合起來成為一個大模型
    	 pipeline
        使用PipeLine對數據進行預處理組成新的模型
        直接調用fit和predict方法來對pipeline中的所有算法模型進行訓練和預測
    	可以結合grid search對參數進行選擇
 示例
     eg： from sklearn.pipeline import Pipeline
     過程：
      數據歸一化(Data Normalization)  from sklearn import preprocessing
      特征選擇(Feature Selection)     from sklearn.ensemble import ExtraTreesClassifier
      算法的使用                      from sklearn.linear_model import LogisticRegression
      優化算法參數                    from sklearn.grid_search import GridSearchCV 
     one-hot編碼
	 數據集拆分
	 模型：
	  # 擬合模型
      model.fit(X_train, y_train)
     # 模型預測
      model.predict(X_test)    
     # 獲得這個模型的參數
      model.get_params()
	 模型保存和載入
	  from sklearn.externals import joblib
	# 保存模型
	  joblib.dump(model, 'model.pickle')
	#載入模型
	  model = joblib.load('model.pickle')

Spark

1.基本概念

org.apache.spark.ml 
PipelineStage
A stage in a pipeline, either an [[Estimator]] or a [[Transformer]].
Transformer
transform one dataset into another.
Estimator
estimators that fit models to data.
Model
A fitted model, i.e., a [[Transformer]] produced by an [[Estimator]].
Pipeline
A Pipeline consists of a sequence of stages, each of which is either an [[Estimator]] or a [[Transformer]]

PipelineModel
 object PipelineModel extends MLReadable[PipelineModel]
Parameter 
 被用來設置 Transformer 或者 Estimator 的參數
VectorAssembler
   CrossValidatorModel
        Params for [[CrossValidator]] and [[CrossValidatorModel]].
		Spark提供在org.apache.spark.ml.tuning包下提供了模型選擇器，可以替換參數然后比較模型輸出

2.Spark 的 Dataset

randomSplit
Randomly splits this Dataset with the provided weights.

 randomSplitAsList
 Returns a Java list that contains randomly split Dataset with the provided weights.
輸入： weights: Array[Double]
       weights: List[Double]
返回： Array[Dataset]or List
示例：
 正樣本和負樣本截取（樣本數據過多的情況）
                       double[] weights = {pos_rate,1.0-pos_rate};
                       Dataset<Row>[] arr = posSet.randomSplit(weights);
                       posSet = arr[0];
  正樣本和負樣本均衡
//合並正負樣本數據
                   Dataset<Row> dataUse = dataPos_sample.union(dataNeg_sample);   
// 定義 Pipeline 中的各個 PipelineStage ，如指標提取和轉換模型訓練等。
  有了這些處理特定問題的 Transformer 和 Estimator，
 我們就可以按照具體的處理邏輯來有序的組織 PipelineStages 並創建一個 Pipeline
 每個stage要么是一個Transformer，要么是一個Estimator。
 這些stage是按照順序執行的，輸入的dataframe當被傳入每個stage的時候會被轉換
 Pipeline pipeline = new Pipeline().setStages(Array(stage1,stage2,stage3,…))
 然后就可以把 訓練數據集 作為入參並調用 Pipeline 實例的 fit 方法來開始以流的方式來處理源訓練數據

//構建完成一個 stage piple
    Pipeline pipeline = new Pipeline().setStages(pipeArr);
	PipelineModel model = pipeline.fit(train_data);

    加載模型： PipelineModel model2 = PipelineModel.load(path);
 方式 獲得 CrossValidator 的最佳模型參數 -- 通過交叉驗證進行模型選擇
  CrossValidator rf_cv = new CrossValidator().setEstimator(pipeline)
  CrossValidatorModel rf_model = rf_cv.fit(train_data);
    加載模型： CrossValidatorModel rf_model2 = CrossValidatorModel.load(path);
	  
 eg： // Chain indexers and tree in a Pipeline.
 Pipeline pipeline = new Pipeline()
  .setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter});

Flink

1.Flink ML

PipelineStage 
    Base class for a stage in a pipeline，and does not have any actual functionality
    Its subclasses must be either Estimator or Transformer    
Transformer
       * A transformer is a {@link PipelineStage} that transforms an input {@link Table} to a result {@link Table}.   
Estimator
        Estimators are {@link PipelineStage}s responsible for training and generating machine learning models.
Model
       A model is an ordinary {@link Transformer} except how it is created.   
 Pipeline
       A pipeline is a linear workflow which chains {@link Estimator}s and {@link Transformer}s to execute an algorithm.
     can also be used as a {@link PipelineStage} in another pipeline
   
 Params WithParams  ParamInfoFactory  ParamInfo

2.Alink

com.alibaba.alink.pipeline
 Pipeline
     A pipeline is a linear workflow which chains {@link EstimatorBase}s and {@link TransformerBase}s to
  * execute an algorithm.
     public class Pipeline extends EstimatorBase<Pipeline, PipelineModel> 
 PipelineModel
      public class PipelineModel extends ModelBase<PipelineModel> implements LocalPredictable {
 PipelineStageBase
      The base class for a stage in a pipeline, either an [[EstimatorBase]] or a [[TransformerBase]].
 EstimatorBase
    public abstract class EstimatorBase<E extends EstimatorBase<E, M>, M extends ModelBase<M>> extends PipelineStageBase<E> implements Estimator<E, M>
 TransformerBase 
     public abstract class TransformerBase<T extends TransformerBase<T>>  extends PipelineStageBase<T> implements Transformer<T>
 VectorAssembler
     VectorAssembler is a transformer that combines a given list of columns

參考

源碼

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python數據可視化、數據挖掘、機器學習、深度學習常用庫、IDE等數據挖掘、機器學習書籍推薦！！ .NET數據挖掘與機器學習開源框架數據挖掘優秀工具對比初識TPOT：一個基於Python的自動化機器學習開發工具【原】數據分析/數據挖掘/機器學習---- 必讀書目機器學習&數據挖掘筆記_21（PGM練習五：圖模型的近似推理）機器學習與數據挖掘期末考試復習重點整理機器學習&數據挖掘筆記_10（高斯過程簡單理解）【目錄】數據挖掘與機器學習相關算法文章總目錄