本地开发执行Spark应用
1.Windows安装spark,设置环境变量SPARK_HOME=D:\spark-3.0.1,并在环境变量PATH加上%SPARK_HOME%\bin;%SPARK_HOME\sbin%;
2.Idea新建gradle项目spark,在build.gradle里添加
dependencies {
implementation("org.apache.spark:spark-sql_2.12:3.0.1")
}
在main方法里添加如下代码
private static void main(String[] args){ //SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://master1:7070") // .setJars(new String[]{"E:\\spark\\out\\artifacts\\spark_main_jar\\spark.main.jar"}); SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]"); System.setProperty("user.name", "root"); String logFile = "file:///e:/README.md"; // Should be some file on your system SparkSession spark = SparkSession.builder().appName("Simple Application").config(conf).getOrCreate(); Dataset<String> logData = spark.read().textFile(logFile).cache(); FilterFunction<String> ffa = s->s.contains("a"); FilterFunction<String> ffb= s->s.contains("b"); long numAs = logData.filter(ffa).count(); long numBs = logData.filter(ffb).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); spark.stop(); }
3. 配置idea Run/Debug
提交Spark Standardone集群或yarn运行
1. 修改main方法
public class SimpleApp { public static void main(String[] args) { String logFile = "file:///usr/local/spark/README.md"; // Should be some file on your system SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate(); Dataset<String> logData = spark.read().textFile(logFile).cache(); FilterFunction<String> ffa = s->s.contains("a"); FilterFunction<String> ffb= s->s.contains("b"); long numAs = logData.filter(ffa).count(); long numBs = logData.filter(ffb).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); spark.stop(); } }
2. 将项目打包成可执行jar(通过gradle或idea Artifacts生成)
#gradle 打成可执行jar task
jar { mainClassName = "com.test.SimpleApp" manifest { attributes "Manifest-Version": 1.0, 'Main-Class': 'com.test.SimpleApp' } }
3. 上传jar要spark机器所在集群
4. 提交到Spark集群上运行
/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master spark://master1:7070 spark.jar
5. 提交到yarn上运行
/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master yarn --deploy-mode cluster spark.jar
远程调试
1. 提交任务时添加--driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"
2. 提交到spark集群,以debug模式运行
/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master spark://master1:7070 --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888" spark.jar
3. Idea添加远程调试
4. 当然也可调试yark上的spark任务,只需在提交任务时添加--driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"。
唯一区别:不确定到底要attach哪个Host,可以Yarn集群页面可以看到ACCEPTED的任务在哪台机器,然后在idea里修改Host的ip即可。