Idea开发Spark应用及远程运行及调试


本地开发执行Spark应用

1.Windows安装spark,设置环境变量SPARK_HOME=D:\spark-3.0.1,并在环境变量PATH加上%SPARK_HOME%\bin;%SPARK_HOME\sbin%;

2.Idea新建gradle项目spark,在build.gradle里添加

dependencies {
implementation("org.apache.spark:spark-sql_2.12:3.0.1")
}

在main方法里添加如下代码
    private static void main(String[] args){
        //SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://master1:7070")
        //        .setJars(new String[]{"E:\\spark\\out\\artifacts\\spark_main_jar\\spark.main.jar"});
        SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");
        System.setProperty("user.name", "root");
        String logFile = "file:///e:/README.md"; // Should be some file on your system
        SparkSession spark = SparkSession.builder().appName("Simple Application").config(conf).getOrCreate();
        Dataset<String> logData = spark.read().textFile(logFile).cache();

        FilterFunction<String> ffa = s->s.contains("a");
        FilterFunction<String> ffb= s->s.contains("b");

        long numAs = logData.filter(ffa).count();
        long numBs = logData.filter(ffb).count();

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

        spark.stop();
    }

3. 配置idea Run/Debug

提交Spark Standardone集群或yarn运行

1. 修改main方法

public class SimpleApp {
    public static void main(String[] args) {
        String logFile = "file:///usr/local/spark/README.md"; // Should be some file on your system
        SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
        Dataset<String> logData = spark.read().textFile(logFile).cache();

        FilterFunction<String> ffa = s->s.contains("a");
        FilterFunction<String> ffb= s->s.contains("b");

        long numAs = logData.filter(ffa).count();
        long numBs = logData.filter(ffb).count();

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

        spark.stop();
    }
}

2. 将项目打包成可执行jar(通过gradle或idea Artifacts生成)

#gradle 打成可执行jar task
jar { mainClassName
= "com.test.SimpleApp" manifest { attributes "Manifest-Version": 1.0, 'Main-Class': 'com.test.SimpleApp' } }

3. 上传jar要spark机器所在集群

4. 提交到Spark集群上运行

/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master spark://master1:7070 spark.jar

5. 提交到yarn上运行

/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master yarn --deploy-mode cluster spark.jar

远程调试

1. 提交任务时添加--driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"

2. 提交到spark集群,以debug模式运行

/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master spark://master1:7070 --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888" spark.jar

3. Idea添加远程调试

 4. 当然也可调试yark上的spark任务,只需在提交任务时添加--driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"。

     唯一区别:不确定到底要attach哪个Host,可以Yarn集群页面可以看到ACCEPTED的任务在哪台机器,然后在idea里修改Host的ip即可。

 


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM