Idea開發Spark應用及遠程運行及調試


本地開發執行Spark應用

1.Windows安裝spark,設置環境變量SPARK_HOME=D:\spark-3.0.1,並在環境變量PATH加上%SPARK_HOME%\bin;%SPARK_HOME\sbin%;

2.Idea新建gradle項目spark,在build.gradle里添加

dependencies {
implementation("org.apache.spark:spark-sql_2.12:3.0.1")
}

在main方法里添加如下代碼
    private static void main(String[] args){
        //SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://master1:7070")
        //        .setJars(new String[]{"E:\\spark\\out\\artifacts\\spark_main_jar\\spark.main.jar"});
        SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]");
        System.setProperty("user.name", "root");
        String logFile = "file:///e:/README.md"; // Should be some file on your system
        SparkSession spark = SparkSession.builder().appName("Simple Application").config(conf).getOrCreate();
        Dataset<String> logData = spark.read().textFile(logFile).cache();

        FilterFunction<String> ffa = s->s.contains("a");
        FilterFunction<String> ffb= s->s.contains("b");

        long numAs = logData.filter(ffa).count();
        long numBs = logData.filter(ffb).count();

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

        spark.stop();
    }

3. 配置idea Run/Debug

提交Spark Standardone集群或yarn運行

1. 修改main方法

public class SimpleApp {
    public static void main(String[] args) {
        String logFile = "file:///usr/local/spark/README.md"; // Should be some file on your system
        SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate();
        Dataset<String> logData = spark.read().textFile(logFile).cache();

        FilterFunction<String> ffa = s->s.contains("a");
        FilterFunction<String> ffb= s->s.contains("b");

        long numAs = logData.filter(ffa).count();
        long numBs = logData.filter(ffb).count();

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

        spark.stop();
    }
}

2. 將項目打包成可執行jar(通過gradle或idea Artifacts生成)

#gradle 打成可執行jar task
jar { mainClassName
= "com.test.SimpleApp" manifest { attributes "Manifest-Version": 1.0, 'Main-Class': 'com.test.SimpleApp' } }

3. 上傳jar要spark機器所在集群

4. 提交到Spark集群上運行

/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master spark://master1:7070 spark.jar

5. 提交到yarn上運行

/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master yarn --deploy-mode cluster spark.jar

遠程調試

1. 提交任務時添加--driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"

2. 提交到spark集群,以debug模式運行

/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master spark://master1:7070 --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888" spark.jar

3. Idea添加遠程調試

 4. 當然也可調試yark上的spark任務,只需在提交任務時添加--driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"。

     唯一區別:不確定到底要attach哪個Host,可以Yarn集群頁面可以看到ACCEPTED的任務在哪台機器,然后在idea里修改Host的ip即可。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM