1.編寫程序代碼如下:
Wordcount.scala
package Wordcount import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ /** * @author hadoop * 統計字符出現個數 * */ object Wordcount { def main(args: Array[String]) { if(args.length < 1) { System.err.println("Usage: <file>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) //SparkContext 是把代碼提交到集群或者本地的通道,我們編寫Spark代碼,無論是要本地運行還是集群運行都必須有SparkContext的實例 val line = sc.textFile(args(0)) //把讀取的內容保存給line變量,其實line是一個MappedRDD,Spark的所有操作都是基於RDD的 line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect.foreach(println) sc.stop } }
2.將程序打包成wordcount.jar
3.編寫wordcount.sh腳本
#!/bin/bash cd $SPARK_HOME/bin spark-submit \ --master spark://192.168.1.154:7077 \ --class Wordcount.Wordcount \ --name wordcount \ --executor-memory 400M \ --driver-memory 512M \ /usr/local/myjar/wordcount.jar \ hdfs://192.168.1.154:9000/user/hadoop/wordcount.txt
其中的wordcount.txt是要統計的文本。
4.將wordcount.txt文件上傳到hdfs中對應的目錄,並啟動Spark集群
5.執行腳本