Spark的Streaming + Flume進行數據采集（flume主動推送或者Spark Stream主動拉取）

本文轉載自查看原文 2018-04-24 10:53 1313 Spark+Storm/ Flume

1、針對國外的開源技術，還是學會看國外的英文說明來的直接，迅速，這里簡單貼一下如何看：

2、進入到flume的conf目錄，創建一個flume-spark-push.sh的文件：

[hadoop@slaver1 conf]$ vim flume-spark-push.sh

配置一下這個文件，flume使用avro的。

# example.conf: A single-node Flume configuration

# Name the components on this agent
#定義這個agent中各組件的名字，給那三個組件sources，sinks，channels取個名字,是一個邏輯代號:
#a1是agent的代表。
a1.sources = r1
a1.channels = c1
a1.sinks = k1

# Describe/configure the source 描述和配置source組件：r1
#類型, 從網絡端口接收數據,在本機啟動, 所以localhost, type=spoolDir采集目錄源,目錄里有就采
#type是類型，是采集源的具體實現，這里是接受網絡端口的，netcat可以從一個網絡端口接受數據的。netcat在linux里的程序就是nc，可以學習一下。
#bind綁定本機localhost。port端口號為44444。

a1.sources.r1.type = exec
a1.sources.r1.bind = tail -f /home/hadoop/data_hadoop/spark-flume/wctotal.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink 描述和配置sink組件：k1
#type，下沉類型，使用logger，將數據打印到屏幕上面。
#a1.sinks.k1.type = logger

# Use a channel which buffers events in memory 描述和配置channel組件，此處使用是內存緩存的方式
#type類型是內存memory。
#下沉的時候是一批一批的, 下沉的時候是一個個eventChannel參數解釋：
#capacity：默認該通道中最大的可以存儲的event數量，1000是代表1000條數據。
#trasactionCapacity：每次最大可以從source中拿到或者送到sink中的event數量。
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100


# define sink
a1.sinks.k1.type= avro
a1.sinks.k1.hostname = slaver1
a1.sinks.k1.port = 9999


# Bind the source and sink to the channel 描述和配置source  channel   sink之間的連接關系
#將sources和sinks綁定到channel上面。
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

3、然后去Spark的github查看項目demo：https://github.com/apache/spark

具體案例如：https://github.com/apache/spark/blob/v1.5.1/examples/src/main/scala/org/apache/spark/examples/streaming/FlumeEventCount.scala

代碼如下所示：

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam

val ssc = new StreamingContext(sc, Seconds(5))

val stream = FlumeUtils.createStream(ssc, slaver1, 9999, StorageLevel.MEMORY_ONLY_SER_2)

stream.count().map(cnt => "Received " + cnt + " flume events." ).print()

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

導入flume的包的時候出現問題，找不到包：import org.apache.spark.streaming.flume._

scala> import org.apache.spark.streaming.flume._
<console>:28: error: object flume is not a member of package org.apache.spark.streaming
       import org.apache.spark.streaming.flume._

由於沒有搭建maven項目，在命令行需要導入jar包，這里先放置一下，稍后繼續記筆記。

待續.......

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark Streaming中向flume拉取數據 Flume+Kafka+Spark Streaming實現大數據實時流式數據采集 flume+kafka+spark streaming整合 Spark Streaming和Flume-NG對接實驗大數據之flume數據采集 python爬蟲等獲取實時數據+Flume+Kafka+Spark Streaming+mysql+Echarts實現數據動態實時采集、分析、展示實時采集日志的數據采集引擎 flume cdh環境下，spark streaming與flume的集成問題總結數據采集組件：Flume基礎用法和Kafka集成 canal/flume + kafka在實時數據采集中的使用