轉載自:http://blog.csdn.net/jiangpeng59/article/details/53318761
foreachRDD通常用來把SparkStream運行得到的結果保存到外部系統比如HDFS、Mysql、Redis等等。了解下面的知識可以幫助我們避免很多誤區
誤區1:實例化外部連接對象的位置不正確,比如下面代碼
- dstream.foreachRDD { rdd =>
- val connection = createNewConnection() // executed at the driver
- rdd.foreach { record =>
- connection.send(record) // executed at the worker
- }
- }
誤區2:為每條記錄都創建一個連接對象
- dstream.foreachRDD { rdd =>
- rdd.foreach { record =>
- val connection = createNewConnection()
- connection.send(record)
- connection.close()
- }
- }
然后,給出了一個比較好的方法,為每一個分區創建一個連接對象,其具體代碼如下
- dstream.foreachRDD { rdd =>
- rdd.foreachPartition { partitionOfRecords =>
- val connection = createNewConnection()
- partitionOfRecords.foreach(record => connection.send(record))
- connection.close()
- }
- }
- dstream.foreachRDD { rdd =>
- rdd.foreachPartition { partitionOfRecords =>
- // ConnectionPool is a static, lazily initialized pool of connections
- val connection = ConnectionPool.getConnection()
- partitionOfRecords.foreach(record => connection.send(record))
- ConnectionPool.returnConnection(connection) // return to the pool for future reuse
- }
- }
下面給出網上一段把SparkStream的結果保存到Mysql中的代碼示例
- package spark.examples.streaming
- import java.sql.{PreparedStatement, Connection, DriverManager}
- import java.util.concurrent.atomic.AtomicInteger
- import org.apache.spark.SparkConf
- import org.apache.spark.streaming.{Seconds, StreamingContext}
- import org.apache.spark.streaming._
- import org.apache.spark.streaming.StreamingContext._
- object SparkStreamingForPartition {
- def main(args: Array[String]) {
- val conf = new SparkConf().setAppName("NetCatWordCount")
- conf.setMaster("local[3]")
- val ssc = new StreamingContext(conf, Seconds(5))
- //The DStream is a collection of RDD, which makes the method foreachRDD reasonable
- val dstream = ssc.socketTextStream("192.168.26.140", 9999)
- dstream.foreachRDD(rdd => {
- //embedded function
- def func(records: Iterator[String]) {
- var conn: Connection = null
- var stmt: PreparedStatement = null
- try {
- val url = "jdbc:mysql://192.168.26.140:3306/person";
- val user = "root";
- val password = ""
- conn = DriverManager.getConnection(url, user, password)
- records.flatMap(_.split(" ")).foreach(word => {
- val sql = "insert into TBL_WORDS(word) values (?)";
- stmt = conn.prepareStatement(sql);
- stmt.setString(1, word)
- stmt.executeUpdate();
- })
- } catch {
- case e: Exception => e.printStackTrace()
- } finally {
- if (stmt != null) {
- stmt.close()
- }
- if (conn != null) {
- conn.close()
- }
- }
- }
- val repartitionedRDD = rdd.repartition(3)
- repartitionedRDD.foreachPartition(func)
- })
- ssc.start()
- ssc.awaitTermination()
- }
- }
注意的細節:
Dstream和RDD一樣是延遲執行,只有遇到action操作才會真正去計算。因此在Dstream的內部RDD必須包含Action操作才能是接受到的數據得到處理。即使代碼中包含foreachRDD,但在內部卻沒有action的RDD,SparkStream只會簡單地接受數據數據而不進行處理