本地開發執行Spark應用
1.Windows安裝spark,設置環境變量SPARK_HOME=D:\spark-3.0.1,並在環境變量PATH加上%SPARK_HOME%\bin;%SPARK_HOME\sbin%;
2.Idea新建gradle項目spark,在build.gradle里添加
dependencies {
implementation("org.apache.spark:spark-sql_2.12:3.0.1")
}
在main方法里添加如下代碼
private static void main(String[] args){ //SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("spark://master1:7070") // .setJars(new String[]{"E:\\spark\\out\\artifacts\\spark_main_jar\\spark.main.jar"}); SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]"); System.setProperty("user.name", "root"); String logFile = "file:///e:/README.md"; // Should be some file on your system SparkSession spark = SparkSession.builder().appName("Simple Application").config(conf).getOrCreate(); Dataset<String> logData = spark.read().textFile(logFile).cache(); FilterFunction<String> ffa = s->s.contains("a"); FilterFunction<String> ffb= s->s.contains("b"); long numAs = logData.filter(ffa).count(); long numBs = logData.filter(ffb).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); spark.stop(); }
3. 配置idea Run/Debug
提交Spark Standardone集群或yarn運行
1. 修改main方法
public class SimpleApp { public static void main(String[] args) { String logFile = "file:///usr/local/spark/README.md"; // Should be some file on your system SparkSession spark = SparkSession.builder().appName("Simple Application").getOrCreate(); Dataset<String> logData = spark.read().textFile(logFile).cache(); FilterFunction<String> ffa = s->s.contains("a"); FilterFunction<String> ffb= s->s.contains("b"); long numAs = logData.filter(ffa).count(); long numBs = logData.filter(ffb).count(); System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); spark.stop(); } }
2. 將項目打包成可執行jar(通過gradle或idea Artifacts生成)
#gradle 打成可執行jar task
jar { mainClassName = "com.test.SimpleApp" manifest { attributes "Manifest-Version": 1.0, 'Main-Class': 'com.test.SimpleApp' } }
3. 上傳jar要spark機器所在集群
4. 提交到Spark集群上運行
/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master spark://master1:7070 spark.jar
5. 提交到yarn上運行
/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master yarn --deploy-mode cluster spark.jar
遠程調試
1. 提交任務時添加--driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"
2. 提交到spark集群,以debug模式運行
/usr/local/spark/bin/spark-submit --class com.test.SimpleApp --master spark://master1:7070 --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888" spark.jar
3. Idea添加遠程調試
4. 當然也可調試yark上的spark任務,只需在提交任務時添加--driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=8888"。
唯一區別:不確定到底要attach哪個Host,可以Yarn集群頁面可以看到ACCEPTED的任務在哪台機器,然后在idea里修改Host的ip即可。