Spark ML聚類分析之k-means||

本文轉載自查看原文 2016-08-16 15:02 5076 Distributed

今天更新了電腦上的spark環境，因為上次運行新的流水線的時候，有的一些包在1.6.1中並不支持

只需要更改系統中用戶的環境變量即可

然后在eclipse中新建pydev工程，執行環境是python3這里面關聯的三個舊的庫也換掉，最后eclipse環境變量換掉

隨后開始看新的文檔

地址： http://spark.apache.org/docs/latest/ml-clustering.html

這次是聚類的學習

1. K-mean

MLlib實現了這個算法的並行版本k-mean++方法，稱為kmean||

這個算法是一個 Estimator

輸入：featuresCol

輸出：predictionCol

執行示例代碼的時候

遇到一個錯誤：

Relati ve path in absolute URI

意思是相對路徑出現在了絕對的統一資源定位符中

根據下面的參考：

http://stackoverflow.com/questions/38669206/spark-2-0-relative-path-in-absolute-uri-spark-warehouse

在構建SparkSession的時候，多傳遞一個一個路徑參數的設置 spark.sql.warehouse.dir

因為

pyspark.sql.utils.IllegalArgumentException: 'java.net.URISyntaxException: Relati

ve path in absolute URI: file:D:/software/spark-2.0.0-bin-hadoop2.7/examples/src

/main/python/ml/spark-warehouse'

實際是讀取當前路徑下的 spark.sql.warehouse.dir

這個設置應該是直接把這個做成了絕對路徑

然后還需要把整個的data文件夾拷貝到當前的ml文件夾下

這樣示例程序中原始的相對路徑不用再修改了

因為我發現用../並不能從當前執行路徑跳轉到設置的data路徑

    
    
    
            
     
     
     
             from __future__ import print_function
     
     
     
             
     
     
     
             # $example on$
     
     
     
             from pyspark.ml.clustering import KMeans
     
     
     
             # $example off$
     
     
     
             
     
     
     
             from pyspark.sql import SparkSession
     
     
     
             from pyspark.tests import SPARK_HOME
     
     
     
             
     
     
     
             """
     
     
     
             An example demonstrating k-means clustering.
     
     
     
             Run with:
     
     
     
              bin/spark-submit examples/src/main/python/ml/kmeans_example.py
     
     
     
             
     
     
     
             This example requires NumPy (http://www.numpy.org/).
     
     
     
             """
     
     
     
             
     
     
     
             
     
     
     
             if __name__ == "__main__":
     
     
     
             
     
     
     
              spark = SparkSession\
     
     
     
              .builder\
     
     
     
              .appName("PythonKMeansExample")\
     
     
     
              .config('spark.sql.warehouse.dir','file:///D:/software/spark-2.0.0-bin-hadoop2.7')\
     
     
     
              .getOrCreate()
     
     
     
             
     
     
     
              # $example on$
     
     
     
              # Loads data.
     
     
     
              # 需要將data文件夾拷貝到當前的執行路徑也就是ml文件夾下
     
     
     
              dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
     
     
     
             
     
     
     
              # Trains a k-means model.
     
     
     
              kmeans = KMeans().setK(2).setSeed(1)
     
     
     
              model = kmeans.fit(dataset)
     
     
     
             
     
     
     
              # Evaluate clustering by computing Within Set Sum of Squared Errors.
     
     
     
              wssse = model.computeCost(dataset)
     
     
     
              print("Within Set Sum of Squared Errors = " + str(wssse))
     
     
     
             
     
     
     
              # Shows the result.
     
     
     
              centers = model.clusterCenters()
     
     
     
              print("Cluster Centers: ")
     
     
     
              for center in centers:
     
     
     
              print(center)
     
     
     
              # $example off$
     
     
     
             
     
     
     
              spark.stop()
     
     
     
              
     
     
     
             '''
     
     
     
             sample_kmeans_data.txt
     
     
     
             0 1:0.0 2:0.0 3:0.0
     
     
     
             1 1:0.1 2:0.1 3:0.1
     
     
     
             2 1:0.2 2:0.2 3:0.2
     
     
     
             3 1:9.0 2:9.0 3:9.0
     
     
     
             4 1:9.1 2:9.1 3:9.1
     
     
     
             5 1:9.2 2:9.2 3:9.2
     
     
     
             '''
     
     
     
              
     
     
     
             '''
     
     
     
             Within Set Sum of Squared Errors = 0.11999999999994547
     
     
     
             Cluster Centers: 
     
     
     
             [ 0.1 0.1 0.1]
     
     
     
             [ 9.1 9.1 9.1]
     
     
     
             '''

來自為知筆記(Wiz)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 K-means聚類分析 k-means聚類分析聚類分析一、K-Means 使用 Spark MLlib 做 K-means 聚類分析[轉] K-Means 聚類分析學習筆記用K-Means聚類分析做客戶分群 Python K-Means廣告效果聚類分析 k-means聚類分析 python 代碼實現(不使用現成聚類庫) 數學模型：3.非監督學習--聚類分析和K-means聚類小白學數據分析----->聚類分析理論之K-means理論篇