Spark 機器學習 ---Word2Vec


package Spark_MLlib import org.apache.spark.ml.feature.Word2Vec import org.apache.spark.sql.SparkSession 
object 特征抽取_Word2Vec { val spark=SparkSession.builder().master("local").appName("Word2Vec").getOrCreate() import spark.implicits._ def main(args: Array[String]): Unit = { val documentDF= spark.createDataFrame(Seq( "soyo like spark and hadoop".split(" "), "scala is good tool to study".split(" "), "but java i want to study and spark".split(" "), "soyo like spark and hadoop ".split(" ") ).map(Tuple1.apply)).toDF("text") val word2Vec=new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(5).setMinCount(0)  //設置特征向量維數為5
        val word2Vec_model=word2Vec.fit(documentDF)  //訓練模型
        val result=word2Vec_model.transform(documentDF) //把文檔轉換成特征向量
            result.show(false) } }
結果:文檔相同或着相似 特征向量就相同或者在特征空間中特征向量越相近
|text                                       |result                                                                                                       |
+-------------------------------------------+-------------------------------------------------------------------------------------------------------------+
|[soyo, like, spark, and, hadoop]           |[0.010919421538710596,-0.013777335733175279,0.02715198565274477,-0.010085364431142808,0.019428260042332113]  |
|[scala, is, good, tool, to, study]         |[-0.048216115372876324,-0.00931493720660607,0.0237591746263206,0.04614267808695634,0.018560086687405903]     |
|[but, java, i, want, to, study, and, spark]|[0.025922087021172047,-0.027650322022964247,0.029493116540834308,-0.029830976389348507,-0.025802675168961287]|
|[soyo, like, spark, and, hadoop]           |[0.010919421538710596,-0.013777335733175279,0.02715198565274477,-0.010085364431142808,0.019428260042332113]  |
+-------------------------------------------+-------------------------------------------------------------------------------------------------------------+

紅色的兩個文檔相同

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM