
Spark與KUDU集成支持:
-
DDL操作(創建/刪除)
-
本地Kudu RDD
-
Native Kudu數據源,用於DataFrame集成
-
從kudu讀取數據
-
從Kudu執行插入/更新/ upsert /刪除
-
謂詞下推
-
Kudu和Spark SQL之間的模式映射
到目前為止,我們已經聽說過幾個上下文,例如SparkContext,SQLContext,HiveContext,SparkSession,現在,我們將使用Kudu引入一個KuduContext。這是可在Spark應用程序中廣播的主要可序列化對象。此類代表在Spark執行程序中與Kudu Java客戶端進行交互。
KuduContext提供執行DDL操作所需的方法,與本機Kudu RDD的接口,對數據執行更新/插入/刪除,將數據類型從Kudu轉換為Spark等。
比較常見的操作:
// Create a Spark and SQL context val sc = new SparkContext(sparkConf) val sqlContext = new SQLContext(sc) // Comma-separated list of Kudu masters with port numbers val master1 = "ip-10-13-4-249.ec2.internal:7051" val master2 = "ip-10-13-5-150.ec2.internal:7051" val master3 = "ip-10-13-5-56.ec2.internal:7051" val kuduMasters = Seq(master1, master2, master3).mkString(",") // Create an instance of a KuduContext val kuduContext = new KuduContext(kuduMasters)
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-client -->
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client</artifactId>
<version>1.6.0-cdh5.14.0</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-client-tools -->
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-client-tools</artifactId>
<version>1.6.0-cdh5.14.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2 -->
<dependency>
<groupId>org.apache.kudu</groupId>
<artifactId>kudu-spark2_2.11</artifactId>
<version>1.6.0-cdh5.14.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
