1. 引言
Airflow是Airbnb開源的一個用Python寫就的工作流管理平台(workflow management platform)。在前一篇文章中,介紹了如何用Crontab管理數據流,但是缺點也是顯而易見。針對於Crontab的缺點,靈活可擴展的Airflow具有以下特點:
- 工作流依賴關系的可視化;
- 日志追蹤;
- (Python腳本)易於擴展
對比Java系的Oozie,Airflow奉行“Configuration as code”哲學,對於描述工作流、判斷觸發條件等全部采用Python,使得你編寫工作流就像在寫腳本一樣;能debug工作流(test backfill命令),更好地判別是否有錯誤;能更快捷地在線上做功能擴展。Airflow充分利用Python的靈巧輕便,相比之下Oozie則顯得笨重厚拙太多(其實我沒在黑Java~~)。《What makes Airflow great?》介紹了更多關於Airflow的優良特性;其他有關於安裝、介紹的文檔在這里、還有這里。
下表給出Airflow(基於1.7版本)與Oozie(基於4.0版本)對比情況:
| 功能 | Airflow | Oozie |
|---|---|---|
| 工作流描述 | Python | xml |
| 數據觸發 | Sensor | datasets, input-events |
| 工作流節點 | operator | action |
| 完整工作流 | DAG | workflow |
| 定期調度 | DAG schedule_interval | coordinator frequency |
| 任務依賴 | >>, << |
<ok to> |
| 內置函數、變量 | template macros | EL function, EL constants |
之前我曾提及Oozie沒有能力表達復雜的DAG,是因為Oozie只能指定下流依賴(downstream)而不能指定上流依賴(upstream)。與之相比,Airflow就能表示復雜的DAG。Airflow沒有像Oozie一樣區分workflow與coordinator,而是把觸發條件、工作流節點都看作一個operator,operator組成一個DAG。
2. 實戰
Airflow常見命令如下:
- initdb,初始化元數據DB,元數據包括了DAG本身的信息、運行信息等;
- resetdb,清空元數據DB;
- list_dags,列出所有DAG;
- list_tasks,列出某DAG的所有task;
- test,測試某task的運行狀況;
- backfill,測試某DAG在設定的日期區間的運行狀況;
- webserver,開啟webserver服務;
- scheduler,用於監控與觸發DAG。
下面將給出如何用Airflow完成data pipeline任務。
首先簡要地介紹下背景:定時(每周)檢查Hive表的partition的任務是否有生成,若有則觸發Hive任務寫Elasticsearch;然后等Hive任務完后,執行Python腳本查詢Elasticsearch發送報表。但是,Airflow對Python3支持有問題(依賴包為Python2編寫);因此不得不自己寫HivePartitionSensor:
# -*- coding: utf-8 -*-
# @Time : 2016/11/29
# @Author : rain
from airflow.operators import BaseSensorOperator
from airflow.utils.decorators import apply_defaults
from impala.dbapi import connect
import logging
class HivePartitionSensor(BaseSensorOperator):
"""
Waits for a partition to show up in Hive.
:param host, port: the host and port of hiveserver2
:param table: The name of the table to wait for, supports the dot notation (my_database.my_table)
:type table: string
:param partition: The partition clause to wait for. This is passed as
is to the metastore Thrift client,and apparently supports SQL like
notation as in ``ds='2016-12-01'``.
:type partition: string
"""
template_fields = ('table', 'partition',)
ui_color = '#2b2d42'
@apply_defaults
def __init__(
self,
conn_host, conn_port,
table, partition="ds='{{ ds }}'",
poke_interval=60 * 3,
*args, **kwargs):
super(HivePartitionSensor, self).__init__(
poke_interval=poke_interval, *args, **kwargs)
if not partition:
partition = "ds='{{ ds }}'"
self.table = table
self.partition = partition
self.conn_host = conn_host
self.conn_port = conn_port
self.conn = connect(host=self.conn_host, port=self.conn_port, auth_mechanism='PLAIN')
def poke(self, context):
logging.info(
'Poking for table {self.table}, '
'partition {self.partition}'.format(**locals()))
cursor = self.conn.cursor()
cursor.execute("show partitions {}".format(self.table))
partitions = cursor.fetchall()
partitions = [i[0] for i in partitions]
if self.partition in partitions:
return True
else:
return False
Python3連接Hive server2的采用的是impyla模塊,HivePartitionSensor用於判斷Hive表的partition是否存在。寫自定義的operator,有點像寫Hive、Pig的UDF;寫好的operator需要放在目錄~/airflow/dags,以便於DAG調用。那么,完整的工作流DAG如下:
# tag cover analysis, based on Airflow v1.7.1.3
from airflow.operators import BashOperator
from operatorUD.HivePartitionSensor import HivePartitionSensor
from airflow.models import DAG
from datetime import datetime, timedelta
from impala.dbapi import connect
conn = connect(host='192.168.72.18', port=10000, auth_mechanism='PLAIN')
def latest_hive_partition(table):
cursor = conn.cursor()
cursor.execute("show partitions {}".format(table))
partitions = cursor.fetchall()
partitions = [i[0] for i in partitions]
return partitions[-1].split("=")[1]
log_partition_value = """{{ macros.ds_add(ds, -2)}}"""
tag_partition_value = latest_hive_partition('tag.dmp')
args = {
'owner': 'jyzheng',
'depends_on_past': False,
'start_date': datetime.strptime('2016-12-06', '%Y-%m-%d')
}
# execute every Tuesday
dag = DAG(
dag_id='tag_cover', default_args=args,
schedule_interval='@weekly',
dagrun_timeout=timedelta(minutes=10))
ad_sensor = HivePartitionSensor(
task_id='ad_sensor',
conn_host='192.168.72.18',
conn_port=10000,
table='ad.ad_log',
partition="day_time={}".format(log_partition_value),
dag=dag
)
ad_hive_task = BashOperator(
task_id='ad_hive_task',
bash_command='hive -f /path/to/cron/cover/ad_tag.hql --hivevar LOG_PARTITION={} '
'--hivevar TAG_PARTITION={}'.format(log_partition_value, tag_partition_value),
dag=dag
)
ad2_hive_task = BashOperator(
task_id='ad2_hive_task',
bash_command='hive -f /path/to/cron/cover/ad2_tag.hql --hivevar LOG_PARTITION={} '
'--hivevar TAG_PARTITION={}'.format(log_partition_value, tag_partition_value),
dag=dag
)
report_task = BashOperator(
task_id='report_task',
bash_command='sleep 5m; python3 /path/to/cron/report/tag_cover.py {}'.format(log_partition_value),
dag=dag
)
ad_sensor >> ad_hive_task >> report_task
ad_sensor >> ad2_hive_task >> report_task
