spark MLlib的 pipeline方式


spark mllib的pipeline,是指將多個機器學習的算法串聯到一個工作鏈中,依次執行各種算法。

在Pipeline中的每個算法被稱為“PipelineStage”,表示其中的一個算法。PipelineStage分為兩種類型, Estimator和Transformer, 其中
  • Transformer將數據轉換為兩一種形式(例如修改格式),以供后續的Estimator使用,統一的轉換函數transform;
  • Estimator是由數據得到一個Mode(Mode也是繼承於Transformer),有統一觸發的函數fit。

然后一個“綜合”的算法就可以通過pipeline封裝起來。這樣做的好處是可以很方便的替換算法。例如,我們在應用中往往只是籠統的期望一個“分類”、”擬合“這樣的功能,但不知道是用分類或擬合的那個算法效果是最好的,有了這種pipeline機制后,很方便替換各種分類和擬合算法,從而得到最好的效果。

/**
* :: Experimental ::
* A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each
* of which is either an [[Estimator]] or a [[Transformer]]. When [[Pipeline#fit]] is called, the
* stages are executed in order. If a stage is an [[Estimator]], its [[Estimator#fit]] method will
* be called on the input dataset to fit a model. Then the model, which is a transformer, will be
* used to transform the dataset as the input to the next stage. If a stage is a [[Transformer]],
* its [[Transformer#transform]] method will be called to produce the dataset for the next stage.
* The fitted model from a [[Pipeline]] is an [[PipelineModel]], which consists of fitted models and
* transformers, corresponding to the pipeline stages. If there are no stages, the pipeline acts as
* an identity transformer.
*/
@Experimental
class Pipeline(override val uid: String) extends Estimator[PipelineModel] {






免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM