場景:
有兩個表,表可以是文本或Json數據,結構化后分別是Table1(A,B,C)和Table2(C、D、E),兩個表通過C關聯,要求求出D+E之和,並以(A、B、D+E)三列返回
解答:
思路:SparkSQL支持讀取Json創建表,同時創建的表可以做聯合查詢,類似傳統Sql語句進行關聯查詢和統計分析
代碼:
package study import org.apache.spark.SparkContext import org.apache.spark.sql.SparkSession object TestDataFrame2{ def main(args:Array[String]):Unit={ import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .master("local[*]") .appName("Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate() spark.sql("""create table table1 using org.apache.spark.sql.json options(path "F://0002_BigData//Soft//comoceanspark//src//resources//Table1.json")""") spark.sql("""create table table2 using org.apache.spark.sql.json options(path "F://0002_BigData//Soft//comoceanspark//src//resources//Table2.json")""") spark.sql("show tables").show() spark.sql("select A,B,(D+E) as DE from table1 inner join table2 on table1.C = table2.C order by DE desc limit 5").show() } }
Table1.json:
{"A":"A1", "B":30, "C":1}
{"A":"A2", "B":31, "C":2}
{"A":"A3", "B":32, "C":3}
{"A":"A4", "B":33, "C":4}
{"A":"A5", "B":34, "C":5}
{"A":"A6", "B":35, "C":6}
{"A":"A7", "B":36, "C":7}
{"A":"A8", "B":37, "C":8}
{"A":"A9", "B":38, "C":9}
Table2.json:
{"C":1, "D":1, "E":1}
{"C":2, "D":2, "E":2}
{"C":3, "D":3, "E":3}
{"C":4, "D":4, "E":4}
{"C":5, "D":5, "E":5}
{"C":6, "D":6, "E":6}
{"C":7, "D":7, "E":7}
{"C":8, "D":8, "E":8}
{"C":9, "D":9, "E":9}
結果:
表顯示

計算結果顯示:

