一、One-Hot Encoding


- 性別:["male","female"]
- 地區:["Europe","US","Asia"]
- 瀏覽器:["Firefox","Chrome","Safari","Internet Explorer"]
二、One-Hot Encoding的處理方法
One-Hot Encoding 作用也就是為了將特征數字化為一個特征向量
package Spark_MLlib import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer} import org.apache.spark.sql.SparkSession object 特征變換_OneHotEncoder { val spark=SparkSession.builder().master("local[2]").appName("IndexToString").getOrCreate() import spark.implicits._ def main(args: Array[String]): Unit = { val df=spark.createDataFrame(Seq( (0,"log"), (1,"text"), (2,"text"), (3,"soyo"), (4,"text"), (5,"log"), (6,"log"), (7,"log"), (8,"hadoop") )).toDF("id","label") val df2=spark.createDataFrame(Seq( (0,"log"), (1,"soyo"), (2,"soyo") )).toDF("id","label") val indexer=new StringIndexer().setInputCol("label").setOutputCol("label_index") val model=indexer.fit(df) val indexed1=model.transform(df)//這里測試數據用的是df indexed1.show() val indexed=model.transform(df2)//測試數據換為df2 val encoder=new OneHotEncoder().setInputCol("label_index").setOutputCol("lable_vector").setDropLast(false) //setDropLast:被編碼為全0向量的標簽也可以占有一個二進制特征 val encodered1=encoder.transform(indexed1) encodered1.show() val encodered=encoder.transform(indexed)//(4,[2],[1.0]) //這里的4表示訓練數據中有4中類型的標簽 encodered.show() } }
結果:
+---+------+-----------+
| id| label|label_index|
+---+------+-----------+
| 0| log| 0.0|
| 1| text| 1.0|
| 2| text| 1.0|
| 3| soyo| 2.0|
| 4| text| 1.0|
| 5| log| 0.0|
| 6| log| 0.0|
| 7| log| 0.0|
| 8|hadoop| 3.0|
+---+------+-----------+
+---+------+-----------+-------------+
| id| label|label_index| lable_vector|
+---+------+-----------+-------------+
| 0| log| 0.0|(4,[0],[1.0])|
| 1| text| 1.0|(4,[1],[1.0])|
| 2| text| 1.0|(4,[1],[1.0])|
| 3| soyo| 2.0|(4,[2],[1.0])|
| 4| text| 1.0|(4,[1],[1.0])|
| 5| log| 0.0|(4,[0],[1.0])|
| 6| log| 0.0|(4,[0],[1.0])|
| 7| log| 0.0|(4,[0],[1.0])|
| 8|hadoop| 3.0|(4,[3],[1.0])|
+---+------+-----------+-------------+
+---+-----+-----------+-------------+
| id|label|label_index| lable_vector|
+---+-----+-----------+-------------+
| 0| log| 0.0|(4,[0],[1.0])|
| 1| soyo| 2.0|(4,[2],[1.0])|
| 2| soyo| 2.0|(4,[2],[1.0])|
+---+-----+-----------+-------------+