福利 => 每天都推送
為什么,我要在這里提出要用Ultimate版本。
IDEA Community(社區版)再談之無奈之下還是去安裝旗艦版
IntelliJ IDEA的黑白色背景切換(Ultimate和Community版本皆通用)
使用 IntelliJ IDEA 導入 Spark 最新源碼及編譯 Spark 源代碼
IDEA里如何多種方式打jar包,然后上傳到集群
IntelliJ IDEA(Community版本)的下載、安裝和WordCount的初步使用(本地模式和集群模式)
IntelliJ IDEA(Ultimate版本)的下載、安裝和WordCount的初步使用(本地模式和集群模式)
基於Intellij IDEA搭建Spark開發環境搭——參考文檔
參考文檔http://spark.apache.org/docs/latest/programming-guide.html
操作步驟
a)創建maven 項目
b)引入依賴(Spark 依賴、打包插件等等)
基於Intellij IDEA搭建Spark開發環境—maven vs sbt
a)哪個熟悉用哪個
b)Maven也可以構建scala項目
基於Intellij IDEA搭建Spark開發環境搭—maven構建scala項目
參考文檔http://docs.scala-lang.org/tutorials/scala-with-maven.html
操作步驟
a) 用maven構建scala項目(基於net.alchim31.maven:scala-archetype-simple)
GroupId:zhouls.bigdata
ArtifactId:mySpark
Version:1.0-SNAPSHOT
mySpark
E:\Code\IntelliJIDEAUltimateVersionCode\mySpark
因為,我本地的scala版本是2.10.5
選中,delete就好。
其實,這個就是windows里的cmd終端,只是IDEA它把這個cmd終端集成到這了。
mvn clean package
這只是做個測試而已。
b)pom.xml引入依賴(spark依賴、打包插件等等)
注意:scala與java版本的兼容性
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>zhouls.bigdata</groupId> <artifactId>mySpark</artifactId> <version>1.0-SNAPSHOT</version> <name>mySpark</name> <inceptionYear>2008</inceptionYear> <properties> <scala.version>2.10.5</scala.version> <spark.version>1.6.1</spark.version> </properties> <repositories> <repository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>scala-tools.org</id> <name>Scala-Tools Maven2 Repository</name> <url>http://scala-tools.org/repo-releases</url> </pluginRepository> </pluginRepositories> <dependencies> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.4</version> <scope>test</scope> </dependency> <dependency> <groupId>org.specs</groupId> <artifactId>specs</artifactId> <version>1.2.5</version> <scope>test</scope> </dependency> <!--spark --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> </dependencies> <build> <!-- <sourceDirectory>src/main/scala</sourceDirectory> <testSourceDirectory>src/test/scala</testSourceDirectory> --> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> <configuration> <scalaVersion>${scala.version}</scalaVersion> <args> <arg>-target:jvm-1.5</arg> </args> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-eclipse-plugin</artifactId> <configuration> <downloadSources>true</downloadSources> <buildcommands> <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand> </buildcommands> <additionalProjectnatures> <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature> </additionalProjectnatures> <classpathContainers> <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer> <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer> </classpathContainers> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.4.1</version> <executions> <!-- Run shade goal on package phase --> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <!-- add Main-Class to manifest file --> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <!--<mainClass>com.dajiang.MyDriver</mainClass>--> </transformer> </transformers> <createDependencyReducedPom>false</createDependencyReducedPom> </configuration> </execution> </executions> </plugin> </plugins> </build> <reporting> <plugins> <plugin> <groupId>org.scala-tools</groupId> <artifactId>maven-scala-plugin</artifactId> <configuration> <scalaVersion>${scala.version}</scalaVersion> </configuration> </plugin> </plugins> </reporting> </project>
為了養成,開發規范。
默認,創建是沒有生效的,比如做如下,才能生效。
同樣,對於下面的單元測試,也是一樣
默認,也是沒有生效的。
必須做如下的動作,才能生效。
開發第一個Spark程序
scala入門-01-IDEA安裝scala插件
a) 第一個Scala版本的spark程序
package zhouls.bigdata import org.apache.spark.{SparkConf, SparkContext} /** * Created by zhouls on 2016-6-19. */ object MyScalaWordCount { def main(args: Array[String]): Unit = { //參數檢查 if (args.length < 2) { System.err.println("Usage: MyScalaWordCout <input> <output> ") System.exit(1) } //獲取參數 val input=args(0) val output=args(1) //創建scala版本的SparkContext val conf=new SparkConf().setAppName("MyScalaWordCout ") val sc=new SparkContext(conf) //讀取數據 val lines=sc.textFile(input) //進行相關計算 val resultRdd=lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_) //保存結果 resultRdd.saveAsTextFile(output) sc.stop() } }
b) 第一個Java版本的spark程序
package zhouls.bigdata; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import scala.Tuple2; import java.util.Arrays; /** * Created by zhouls on 2016-6-19. */ public class MyJavaWordCount { public static void main(String[] args) { //參數檢查 if(args.length<2){ System.err.println("Usage: MyJavaWordCount <input> <output> "); System.exit(1); } //獲取參數 String input=args[0]; String output=args[1]; //創建java版本的SparkContext SparkConf conf=new SparkConf().setAppName("MyJavaWordCount"); JavaSparkContext sc=new JavaSparkContext(conf); //讀取數據 JavaRDD inputRdd=sc.textFile(input); //進行相關計算 JavaRDD words=inputRdd.flatMap(new FlatMapFunction() { public Iterable call(String line) throws Exception { return Arrays.asList(line.split(" ")); } }); JavaPairRDD result=words.mapToPair(new PairFunction() { public Tuple2 call(String word) throws Exception { return new Tuple2(word,1); } }).reduceByKey(new Function2() { public Integer call(Integer x, Integer y) throws Exception { return x+y; } }); //保存結果 result.saveAsTextFile(output); //關閉sc sc.stop(); } }
或者
package zhouls.bigdata; /** *Created by zhouls on 2016-6-19. */ import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import scala.Tuple2; import java.util.Arrays; import java.util.List; import java.util.regex.Pattern; public final class MyJavaWordCount { private static final Pattern SPACE = Pattern.compile(" "); public static void main(String[] args) throws Exception { if (args.length < 1) { System.err.println("Usage: MyJavaWordCount <file>"); System.exit(1); } SparkConf sparkConf = new SparkConf().setAppName("MyJavaWordCount "); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDD<String> lines = ctx.textFile(args[0], 1); JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { return Arrays.asList(SPACE.split(s)); } }); JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() { public Integer call(Integer i1, Integer i2) { return i1 + i2; } }); List<Tuple2<String, Integer>> output = counts.collect(); for (Tuple2<?, ?> tuple : output) { System.out.println(tuple._1() + ": " + tuple._2()); } ctx.stop(); } }
運行自己開發第一個Spark程序
Spark maven 項目打包
IDEA里如何多種方式打jar包,然后上傳到集群
推薦下面這種方式
1、先切換到此工程路徑下
默認,會到E:\Code\IntelliJIDEAUltimateVersionCode\mySpark>
mvn clean package
mvn package
為了,更好的學習,其實,我們可以將它拷貝到桌面,去看看,是否真正打包進入。因為這里,是需要包括MyJavaWordCount.java和MyScalaWordCout.scala
准備好數據
[spark@sparksinglenode wordcount]$ pwd /home/spark/testspark/inputData/wordcount [spark@sparksinglenode wordcount]$ ll total 4 -rw-rw-r-- 1 spark spark 92 Mar 24 18:45 wc.txt [spark@sparksinglenode wordcount]$ cat wc.txt hadoop spark storm zookeeper scala java hive hbase mapreduce hive hadoop hbase spark hadoop [spark@sparksinglenode wordcount]$
上傳好剛之前打好的jar包
提交Spark 集群運行
a) 提交Scala版本的Wordcount
到$SPARK_HOME安裝目錄下,去執行如下命令。
[spark@sparksinglenode spark-1.6.1-bin-hadoop2.6]$ $HADOOP_HOME/bin/hadoop fs -mkdir -p hdfs://sparksinglenode:9000/testspark/inputData/wordcount
[spark@sparksinglenode spark-1.6.1-bin-hadoop2.6]$ $HADOOP_HOME/bin/hadoop fs -copyFromLocal /home/spark/testspark/inputData/wordcount/wc.txt hdfs://sparksinglenode:9000/testspark/inputData/wordcount/
[spark@sparksinglenode spark-1.6.1-bin-hadoop2.6]$ bin/spark-submit --class zhouls.bigdata.MyScalaWordCount /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar hdfs://sparksinglenode:9000/testspark/inputData/wordcount/wc.txt hdfs://sparksinglenode:9000/testspark/outData/MyScalaWordCount
注意,以上,是輸入路徑和輸出都要在集群里。因為我這里的程序打包里,制定是在集群里(即hdfs)。所以只能用這種方法。
成功!
[spark@sparksinglenode spark-1.6.1-bin-hadoop2.6]$ $HADOOP_HOME/bin/hadoop fs -cat hdfs://sparksinglenode:9000/testspark/outData/MyScalaWordCount/part-* 17/03/27 20:12:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable (storm zookeeper,1) (hadoop spark,1) (spark hadoop,1) (mapreduce hive,1) (scala java,1) (hive hbase,1) (hadoop hbase,1) [spark@sparksinglenode spark-1.6.1-bin-hadoop2.6]$
注意:若想要在本地(即windows里或linux里能運行的話。則只需在程序代碼里。注明是local就好,這個很簡單。不多贅述,再打包。再運行就可以了。
[spark@sparksinglenode spark-1.6.1-bin-hadoop2.6]$ bin/spark-submit --class zhouls.bigdata.MyScalaWordCount /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar /home/spark/testspark/inputData/wordcount/wc.txt /home/spark/testspark/outData/MyScalaWordCount
b) 提交Java版本的Wordcount
[spark@sparksinglenode spark-1.6.1-bin-hadoop2.6]$ bin/spark-submit --class zhouls.bigdata.MyJavaWordCount /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar hdfs://sparksinglenode:9000/testspark/inputData/wordcount/wc.txt hdfs://sparksinglenode:9000/testspark/outData/MyJavaWordCount
storm zookeeper: 1 hadoop spark: 1 spark hadoop: 1 mapreduce hive: 1 scala java: 1 hive hbase: 1 hadoop hbase: 1
注意:若想要在本地(即windows里或linux里能運行的話。則只需在程序代碼里。注明是local就好,這個很簡單。不多贅述,再打包。再運行就可以了。
bin/spark-submit --class com.zhouls.test.MyJavaWordCount /home/spark/testspark/mySpark-1.0.SNAPSHOT.jar /home/spark/testspark/inputData/wordcount/wc.txt /home/spark/testspark/outData/MyJavaWordCount
成功!
關於對pom.xml的進一步深入,見
對於maven創建spark項目的pom.xml配置文件(圖文詳解)
推薦博客
Scala IDEA for Eclipse里用maven來創建scala和java項目代碼環境(圖文詳解)
用maven來創建scala和java項目代碼環境(圖文詳解)(Intellij IDEA(Ultimate版本)、Intellij IDEA(Community版本)和Scala IDEA for Eclipse皆適用)(博主推薦)
同時,大家可以關注我的個人博客:
http://www.cnblogs.com/zlslch/ 和 http://www.cnblogs.com/lchzls/ http://www.cnblogs.com/sunnyDream/
詳情請見:http://www.cnblogs.com/zlslch/p/7473861.html
人生苦短,我願分享。本公眾號將秉持活到老學到老學習無休止的交流分享開源精神,匯聚於互聯網和個人學習工作的精華干貨知識,一切來於互聯網,反饋回互聯網。
目前研究領域:大數據、機器學習、深度學習、人工智能、數據挖掘、數據分析。 語言涉及:Java、Scala、Python、Shell、Linux等 。同時還涉及平常所使用的手機、電腦和互聯網上的使用技巧、問題和實用軟件。 只要你一直關注和呆在群里,每天必須有收獲
對應本平台的討論和答疑QQ群:大數據和人工智能躺過的坑(總群)(161156071)
打開百度App,掃碼,精彩文章每天更新!歡迎關注我的百家號: 九月哥快訊