一、VectorAssembler
package com.home.spark.ml import org.apache.spark.SparkConf import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors import org.apache.spark.sql.SparkSession /** * VectorAssembler是一種轉換器,它將給定的多個列組合為單個向量列。 * 這對於將原始特征和由不同特征轉換器生成的特征組合到單個特征向量中很有用,以便訓練諸如邏輯回歸和決策樹之類的ML模型。 * * VectorAssembler接受以下輸入列類型:所有數字類型,布爾類型和向量類型。在每一行中,輸入列的值將按指定順序連接到向量中。 **/ object Ex_VectorAssembler { def main(args: Array[String]): Unit = { val conf: SparkConf = new SparkConf(true).setMaster("local[2]").setAppName("spark ml") val spark = SparkSession.builder().config(conf).getOrCreate() val dataset = spark.createDataFrame( Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0)) ).toDF("id", "hour", "mobile", "userFeatures", "clicked") val assembler = new VectorAssembler() .setInputCols(Array("hour", "mobile", "userFeatures")) .setOutputCol("features") val output = assembler.transform(dataset) println("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'") output.select("*").show(false) spark.stop() } }
Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'
+---+----+------+--------------+-------+-----------------------+
|id |hour|mobile|userFeatures |clicked|features |
+---+----+------+--------------+-------+-----------------------+
|0 |18 |1.0 |[0.0,10.0,0.5]|1.0 |[18.0,1.0,0.0,10.0,0.5]|
+---+----+------+--------------+-------+-----------------------+
二、VectorIndexer
主要作用:提高決策樹或隨機森林等ML方法的分類效果。
VectorIndexer是對數據集特征向量中的類別(離散值)特征(index categorical features categorical features )進行編號。
它能夠自動判斷那些特征是離散值型的特征,並對他們進行編號,具體做法是通過設置一個maxCategories,特征向量中某一個特征不重復取值個數小於maxCategories,則被重新編號為0~K(K<=maxCategories-1)。某一個特征不重復取值個數大於maxCategories,則該特征視為連續值,不會重新編號(不會發生任何改變)。
//定義輸入輸出列和最大類別數為5,某一個特征 //(即某一列)中多於5個取值視為連續值 VectorIndexerModel featureIndexerModel=new VectorIndexer() .setInputCol("features") .setMaxCategories(5) .setOutputCol("indexedFeatures")
+-------------------------+-------------------------+
|features |indexedFeatures |
+-------------------------+-------------------------+
|(3,[0,1,2],[2.0,5.0,7.0])|(3,[0,1,2],[2.0,1.0,1.0])|
|(3,[0,1,2],[3.0,5.0,9.0])|(3,[0,1,2],[3.0,1.0,2.0])|
|(3,[0,1,2],[4.0,7.0,9.0])|(3,[0,1,2],[4.0,3.0,2.0])|
|(3,[0,1,2],[2.0,4.0,9.0])|(3,[0,1,2],[2.0,0.0,2.0])|
|(3,[0,1,2],[9.0,5.0,7.0])|(3,[0,1,2],[9.0,1.0,1.0])|
|(3,[0,1,2],[2.0,5.0,9.0])|(3,[0,1,2],[2.0,1.0,2.0])|
|(3,[0,1,2],[3.0,4.0,9.0])|(3,[0,1,2],[3.0,0.0,2.0])|
|(3,[0,1,2],[8.0,4.0,9.0])|(3,[0,1,2],[8.0,0.0,2.0])|
|(3,[0,1,2],[3.0,6.0,2.0])|(3,[0,1,2],[3.0,2.0,0.0])|
|(3,[0,1,2],[5.0,9.0,2.0])|(3,[0,1,2],[5.0,4.0,0.0])|
+-------------------------+-------------------------+
結果分析:特征向量包含3個特征,即特征0,特征1,特征2。如Row=1,對應的特征分別是2.0,5.0,7.0.被轉換為2.0,1.0,1.0。 我們發現只有特征1,特征2被轉換了,特征0沒有被轉換。這是因為特征0有6種取值(2,3,4,5,8,9),多於前面的設置setMaxCategories(5) ,因此被視為連續值了,不會被轉換。 特征1中,(4,5,6,7,9)-->(0,1,2,3,4,5) 特征2中, (2,7,9)-->(0,1,2)
總結一句話:小於等於 MaxCategories => 轉換 , 大於 => 不變