spark----詞頻統計(一)


 

利用Linux系統中安裝的spark來統計:

 

1.選擇目錄,並創建一個存放文本的目錄,將要處理的文本保存在該目錄下以供查找操作:

① cd /usr/local ②mkdir mycode ③ cd mycode ④查看當前新目錄: ll

⑤新建文本: vim wordcount.txt (文本內容隨機copy一段英文)

[root@node01 mycode]# vim  wordcount.txt
  uded among the most successful influencers in Open Source, The Apache Software Foundation's commitment to collaborative development has long served as a model for producing consistently high quality software that advances the future of open development. https://s.apache.org/PIRA
 

 

2.為方便查詢文本和其它操作,可以在當前操作節點上復制另一個操作節點,作為它的第二個終端操作窗口:

如:打開:node01------>復制node01 ,然后在復制的節點上去查詢之前所創建的目錄及文本.

  
 >>>cd /usr/local/mycode/
  >>>ll

 

 

3.啟動spark: 本機spark安裝在(/home/mysoft/spark-1.6),以具體路徑為准!

① 跳轉路徑

  
 cd /home/mysoft/spark-1.6  

 

②啟動命令: (或者 cd bin ----->pyspark (enter) 亦可)

  
./bin/pyspark

 

-------出現spark的正常啟動信息即啟動成功!

 Welcome to
        ____              __
       / __/__  ___ _____/ /__
      _\ \/ _ \/ _ `/ __/  '_/
     /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
        /_/
  ​
  Using Python version 3.5.0 (default, Jul 12 2018 03:34:21)
  SparkContext available as sc, HiveContext available as sqlContext.
  >>> 

 

 

4.加載文件:

 >>>textFile = sc.textFile("file:///usr/local/mycode/wordcount.txt")
  >>> textFile.first()

 



之后會在屏幕顯示之前創建的文本!

注:first()是一個“行動”(Action)類型的操作,會啟動真正的計算過程,從文件中加載數據到變量textFile中,並取

出第一行文本,另因為Spark采用了惰性機制,在執行轉換操作的時候,即使我們輸入了錯誤的語句,pyspark也不

會馬上報錯,而是等到執行“行動”類型的語句時啟動真正的計算,那個時候“轉換”操作語句中的錯誤就會顯示出來:

拒絕連接!

 

5.統計詞頻:(繼續上述代碼輸入)

 <<<Count = textFile.flatMap(lambda line: line.split(" ")).map(lambda word:       (word,1)).reduceByKey(lambda a, b : a + b)
  <<<Count.collect()

  

 

6.打印結果:

 [('development', 1), ('producing', 1), ('among', 1), ('Source,', 1), ('for', 1), ('quality', 1), ('to', 1), ('influencers', 1), ('advances', 1), ('collaborative', 1), ('model', 1), ('in', 1), ('the', 2), ('of', 1), ('has', 1), ('successful', 1), ('Software', 1), ("Foundation's", 1), ('most', 1), ('long', 1), ('that', 1), ('uded', 1), ('as', 1), ('Open', 1), ('The', 1), ('commitment', 1), ('software', 1), ('consistently', 1), ('a', 1), ('development.', 1), ('high', 1), ('future', 1), ('Apache', 1), ('served', 1), ('open', 1), ('https://s.apache.org/PIRA', 1)]

 

 

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM