因業務要求,我們需要從Kafka中讀取數據,變換后最終Sink到業務的消息隊列中,為保證數據的可靠性,我們同時對Sink的結果數據,進行保存。最終選擇將流數據Sink到HDFS上,在Flink中,同時也提供了HDFS Connector。下面就介紹如何將流式數據寫入HDFS,同時將數據load到Hive表中。
一、pom.xml文件配置
<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-filesystem_2.11</artifactId> <version>1.8.0</version> </dependency>
二、Flink DataStream代碼
代碼是將最后的結果數據,拼接成CSV格式,最后寫入HDFS中。下面的邏輯在真實地業務中刪除許多。僅保留有用對大家的代碼。
public class RMQAndBucketFileConnectSink { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); Properties p = new Properties(); p.setProperty("bootstrap.servers", "localhost:9092"); SingleOutputStreamOperator<String> ds = env.addSource(new FlinkKafkaConsumer010<String>("user", new SimpleStringSchema(), p)) .map(new MapFunction<String, User>() { @Override public User map(String value) throws Exception { return new Gson().fromJson(value, User.class); } }) .assignTimestampsAndWatermarks(new AscendingTimestampExtractor<User>() { @Override public long extractAscendingTimestamp(User element) { return element.createTime; } }) .map(new MapFunction<User, String>() { @Override public String map(User value) throws Exception { return value.userId + "," + value.name + "," + value.age + "," + value.sex + "," + value.createTime + "," + value.updateTime; } }); // 寫入RabbitMQ RMQConnectionConfig rmqConnectionConfig = new RMQConnectionConfig.Builder() .setHost("localhost") .setVirtualHost("/") .setPort(5672) .setUserName("admin") .setPassword("admin") .build(); // 寫入RabbitMQ,如果隊列是持久化的,需要重寫RMQSink的 setupQueue方法 RMQSink<String> rmqSink = new RMQSink<>(rmqConnectionConfig, "queue_name", new SimpleStringSchema()); ds.addSink(rmqSink); // 寫入HDFS BucketingSink<String> bucketingSink = new BucketingSink<>("/apps/hive/warehouse/users"); // 設置以yyyyMMdd的格式進行切分目錄,類似hive的日期分區 bucketingSink.setBucketer(new DateTimeBucketer<>("yyyyMMdd", ZoneId.of("Asia/Shanghai"))); // 設置文件塊大小128M,超過128M會關閉當前文件,開啟下一個文件 bucketingSink.setBatchSize(1024 * 1024 * 128L); // 設置一小時翻滾一次 bucketingSink.setBatchRolloverInterval(60 * 60 * 1000L); // 設置等待寫入的文件前綴,默認是_ bucketingSink.setPendingPrefix(""); // 設置等待寫入的文件后綴,默認是.pending bucketingSink.setPendingSuffix(""); //設置正在處理的文件前綴,默認為_ bucketingSink.setInProgressPrefix("."); ds.addSink(bucketingSink); env.execute("RMQAndBucketFileConnectSink"); } }
寫入的HDFS文件目錄如下:
/apps/hive/warehouse/users/20190708
/apps/hive/warehouse/users/20190709
/apps/hive/warehouse/users/20190710
...
三、Hive表的創建以及導入
創建hive表
create external table default.users( `userId` string, `name` string, `age` int, `sex` int, `ctime` string, `utime` string, ) partitioned by(dt string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
創建定時任務,每天凌晨導入HDFS文件到Hive,導入Hive腳本。
load_hive.sh
如下:
#!/usr/bin/env bash d=`date -d "-1 day" +%Y%m%d` # 每天HDFS的數據導入hive分區中 /usr/hdp/2.6.3.0-235/hive/bin/hive -e "alter table default.users add partition (dt='${d}') location '/apps/hive/warehouse/users/${d}'"
使用crontab
每天凌晨調度就行。