沒想到,在我的hadoop2.2.0小集群上上安裝傳說中的Spark竟然如此順利,可能是因為和搭建Hadoop時比較像,更多需要學習的地方還是scala編程和RDD機制吧
總之,開個好頭
原來的集群:全源碼安裝,包括hadoop2.2.0 hive0.13.0 hbase-0.96.2-hadoop2 hbase-0.96.2-hadoop2 sqoop-1.4.5.bin__hadoop-2.0.4-alpha pig-0.12.1
hive和hbase的版本要求比較嚴格,才能相互調用,所以,雖然hadoop可以升級到2.6,0,先保險起見。還是不單獨升級。
Spark的偽分布式安裝
1.下載合適的版本
http://spark.apache.org/downloads.html
這里下載的是spark-1.0.2-bin-hadoop2
http://www.scala-lang.org/download/2.11.0.html
2.解壓到/usr/local/hadoop
tar -zxvf ...
建立軟連接:
ln -s spark-1.0.2-bin-hadoop2 spark
ln -s scala-2.11.0 scala
3.配置路徑
進入SPARK_HOME/conf目錄,復制一份spark-env.sh.template並更改文件名為spark-env.sh
vim /etc/profile
export JAVA_HOME=/usr/java/jdk1.8.0_25
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop-2.2.0
export HBASE_HOME=/usr/local/hbase
export HIVE_HOME=/usr/local/hive
export SQOOP_HOME=/usr/local/sqoop
export PIG_HOME=/usr/local/pig
export PIG_CALSSPATH=$HADOOP_HOME/etc/hadoop
export MAVEN_HOME=/opt/apache-maven-3.2.3
export ANT_HOME=/opt/apache-ant-1.9.4
export PATH=$PATH:$HADOOP_HOME/:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$HIVE_HOME/bin:$MAVEN_HOME/bin:$ANT_HOME/bin:$SQOOP_HOME/bin:$PIG_HOME/bin
export SCALA_HOME=/usr/local/scala
export SPARK_MASTER=localhost
export SPARK_LOCAL_IP=localhost
export SPARK_HOME=/usr/local/spark
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
export PATH=$PATH:$SCALA_HOME/bin:$SPARK_HOME/bin
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
。。。安裝了這么多東西,都要配置
讓配置生效:
source /etc/profile
4.查看scala版本
[root@centos local]# scala -version
Scala code runner version 2.11.0 -- Copyright 2002-2013, LAMP/EPFL
5.啟動spark
進入到SPARK_HOME/sbin下,運行:
start-all.sh
[root@centos local]# jps
7953 DataNode
8354 NodeManager
8248 ResourceManager
8104 SecondaryNameNode
10396 Jps
7836 NameNode
7613 Worker
7485 Master
有一個Master跟Worker進程 說明啟動成功
可以通過http://localhost:8080/查看spark集群狀況
6.兩種模式運行Spark例子程序
1.Spark-shell
此模式用於interactive programming,具體使用方法如下(先進入bin文件夾)
./spark-shell Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.0.2 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25) Type in expressions to have them evaluated. Type :help for more information. 15/03/17 19:15:18 INFO spark.SecurityManager: Changing view acls to: root scala> val days = List("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday") days: List[String] = List(Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday) scala> val daysRDD =sc.parallelize(days) daysRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:14 scala>daysRDD.count()
顯示以下信息:
res0:Long =7
2.運行腳本
運行Spark自帶的example中的SparkPi,在
這里要注意,以下兩種寫法都有問題
./bin/run-example org.apache.spark.examples.SparkPi spark://localhost:7077
./bin/run-example org.apache.spark.examples.SparkPi local[3]
local表示本地,[3]表示3個線程跑
這樣就可以:
./bin/run-example org.apache.spark.examples.SparkPi 2 spark://192.168.0.120:7077 15/03/17 19:23:56 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0) 15/03/17 19:23:56 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 0.416 s 15/03/17 19:23:56 INFO spark.SparkContext: Job finished: reduce at SparkPi.scala:35, took 0.501835986 s Pi is roughly 3.14086
7.scala特點
MR不理想的最主要的原因有幾個:
1.它是以job形式進行提交的
2.它的Job相對來說比較重,包括步驟jar到各個節點, Job進行數據的迭代等,一個最簡單的Job都要秒計MP
Scala的幾個特性,讓你有興趣去學這門新語言:
1. 它最終也會編譯成Java VM代碼,看起來象不象Java的殼程序?-至少做為一個Java開發人員,你會松一口氣
2. 它可以使用Java包和類 - 又放心了一點兒,這樣不用擔心你寫的包又得用另外一種語言重寫一遍
3. 更簡潔的語法和更快的開發效率