Mysql
安裝
MySQL
安裝可以參考我之前寫過的博客:linux下安裝MySQL5.7及遇到的問題總結
MySQL
安裝完成后,需要創建airflow
數據庫,用戶,並賦予相關權限
CREATE DATABASE airflow CHARACTER SET utf8; CREATE USER 'airflow'@'%' IDENTIFIED BY 'yourpassword'; GRANT ALL PRIVILEGES ON *.* TO 'airflow'@'%' IDENTIFIED BY 'yourpassword' WITH GRANT OPTION; set global explicit_defaults_for_timestamp =1; FLUSH PRIVILEGES;
安裝python3.7.5(重要)
該部分需要在所有airflow安裝節點
進行操作
Airflow
官方文檔中,給出的安裝方式是Python3
,CentOS7
機器上是默認是python2
,安裝airflow
過程中會出現各種各樣的問題.
安裝編譯相關工具
yum -y groupinstall "Development tools" yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel yum install libffi-devel -y
下載編譯Python3.7
wget https://www.python.org/ftp/python/3.7.5/Python-3.7.5.tar.xz tar -xvJf Python-3.7.5.tar.xz mkdir /usr/python3.7 cd Python-3.7.5 ./configure --prefix=/usr/python3.7 make && make install
創建軟鏈接
ln -s /usr/python3.7/bin/python3 /usr/bin/python3.7 ln -s /usr/python3.7/bin/pip3 /usr/bin/pip3.7
驗證是否安裝成功
python3.7 -V pip3.7 -V
如下所示,證明配置成功:
因為執行yum
需要python2
版本,所以我們還要修改yum
的配置
vim /usr/bin/yum #! /usr/bin/python修改為#! /usr/bin/python2
vim /usr/libexec/urlgrabber-ext-down #! /usr/bin/python 也要修改為#! /usr/bin/python2
確保安裝必要軟件(重要)
# 安裝airflow pip版本過低會導致安裝失敗 pip3.7 install --upgrade pip sudo pip3.7 install pymysql sudo pip3.7 install celery sudo pip3.7 install flower sudo pip3.7 install psycopg2-binary
二、安裝Airflow(重要)
注意: 2.1,2.2,2.3
部分需要在所有安裝節點進行操作
2.1 配置 airflow sudo權限
這里使用airflow
用戶進行
配置airflow用戶
sudo權限
# 以下命令使用root用戶 useradd airflow vi /etc/sudoers ## Allow root to run any commands anywhere rootALL=(ALL) ALL airflow ALL=(ALL) NOPASSWD: ALL #加入這一行
2.2 設置Airflow環境變量
安裝完后airflow安裝路徑默認為: /home/airflow/.local/bin
#使用root用戶執行 vi /etc/profile export PATH=$PATH:/usr/python3.7/bin:/home/airflow/.local/bin source /etc/profile
此處的/home/airflow/.local/bin
為~/.local/bin
,
根據實際配置PATH=$PATH:~/.local/bin
#配置環境變量,使用airflow用戶執行(可選,默認為~/airflow) export AIRFLOW_HOME=~/airflow
2.3 安裝airflow
su airflow #root用戶 # 以下命令使用airflow用戶 AIRFLOW_VERSION=2.1.1 PYTHON_VERSION="$(python3.7 --version | cut -d " " -f 2 | cut -d "." -f 1-2)" CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-no-providers-${PYTHON_VERSION}.txt" # 這里要加sudo,否則會存在部分缺失,並且沒有報錯,這里要注意添加mysql,celery,cncf.kubernetes依賴,否則后續啟動airflow時會報錯 sudo pip3.7 install "apache-airflow[mysql,celery,cncf.kubernetes]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}" -i https://pypi.rasa.com/simple --use-deprecated=legacy-resolver
如果airflow
安裝正常,此時將能夠使用airflow命令
,並且airflow安裝目錄下有如下文件:
airflow.cfg webserver_config.py
2.4 配置ariflow
airflow
高可用架構如下:
修改{AIRFLOW_HOME}/airflow.cfg
文件
# 在{AIRFLOW_HOME}/airflow.cfg 添加或者修改如下配置 # 1. 修改Executor配置 # executor = LocalExecutor executor = CeleryExecutor # 2. 修改元數據庫(metestore)配置 #sql_alchemy_conn = sqlite:home/apps/airflow/airflow.db sql_alchemy_conn = mysql+pymysql://airflow:yourpassword@hostname:3306/airflow # 3.設置消息隊列broker,此處使用 RabbitMQ # broker_url = redis://redis:6379/0 broker_url = amqp://admin:yourpassword@hostname:5672/ # 4.設定結果存儲后端backend # result_backend = db+postgresql://postgres:airflow@postgres/airflow result_backend = db+mysql://airflow:yourpassword@hostname:3306/airflow # 5. 修改時區 # default_timezone = utc default_timezone = Asia/Shanghai default_ui_timezone = Asia/Shanghai # 6. 配置web端口(默認8080,因為被ambari占用所以改為8081) endpoint_url = http://localhost:8081 base_url = http://localhost:8081 web_server_port = 8081
修改后的{AIRFLOW_HOME}/airflow.cfg需要同步到所有安裝airflow的服務器上
同時,需要根據dags_folder,base_log_folder
配置創建相關目錄,防止后面執行dag時報錯
2.5 啟動airflow集群
初始化數據庫
airflow db init
mysql
中出現如下表結構證明初始化成功
創建用戶:
airflow users create \ --username admin \ --firstname Lixiaolong \ --lastname Bigdata \ --role Admin \ --email spiderman@superhero.org
根據控制台輸出設置Password
Password設置為:yourpassword
啟動webserver:
airflow webserver -D
啟動scheduler
nohup airflow scheduler &;
啟動worker
# 先啟動flower,在需要啟動worker服務器執行 airflow celery flower -D # 啟動worker,在需要啟動worker服務器執行 airflow celery worker -D
2.6 登錄webui查看
webui: http://master1:8081/ 賬號: admin 密碼: 2.5階段設置的密碼
界面顯示如下圖:
worker的信息可以通過http://hostip:5555
進行查看,如下圖:
2.7 使用Airflow配置作業
Airflow
默認配置了32
個Dag
供大家食用,webui
選中Dag
點擊一下,即可變成Active
狀態
接下來,以常用的Hive Operator
舉例,如何編寫並執行自定義Dag
依賴安裝
使用Hive Operator
,需要首先安裝Hive
相關依賴
如果使用中遇到類似如下的問題:
ModuleNotFoundError: No module named 'airflow.providers.apache'
就需要手動安裝hive
依賴,命令如下
su airflow pip3.7 install airflow[hive]
Dag編寫
Dag
目錄: 見airflow.cfg
配置項dags_folder
將寫好的python
文件放置該目錄下,舉例:
該示例為定時 每隔一分鍾
查詢hive
表中數據,Dag
名稱為test_hive2
from airflow import DAG from airflow.providers.apache.hive.operators.hive import HiveOperator from datetime import datetime, timedelta from airflow.models import Variable from airflow.utils.dates import days_ago default_args = { 'owner': 'airflow', 'depends_on_past': True, 'start_date': days_ago(1), 'retries': 10, 'retry_delay': timedelta(seconds=5), } dag = DAG('test_hive2', default_args=default_args, schedule_interval='*/1 * * * *', catchup=False) t1 = HiveOperator( task_id='hive_task', hql='select * from test.data_demo', dag=dag) t1
如果Dag
格式正確,將會在webui
上刷新出新添加的dag
信息
配置Connection
如下圖所示,界面點擊admin
–>Connections
配置connection
hive
默認使用的connections
是hive_cli_default
需要注意下圖中標記出來的幾個配置項:Conn Type
選擇Hive Client Wrapper(如果安裝了hive依賴,默認就是這個)
Host
設置為安裝了Hive
的節點Login
需要設置為一個有權限執行hive
任務的用戶
配置完成,保存即可
任務調度
如下圖所示,為開啟調度
,和手動觸發
任務觸發后,點擊任務欄中間的部分,可以查看任務運行細節
,
舉例,點進去一個任務之后,我們可以看到它的運行細節
和運行日志
三. 遇到的問題
3.1 python模塊下載報錯
Collecting flask-appbuilder<2.0.0,>=1.12.2; python_version < "3.6" Using cached Flask-AppBuilder-1.13.1.tar.gz (1.5 MB) ERROR: Command errored out with exit status 1: command: /bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-EFxJZq/flask-appbuilder/setup.py'"'"'; __file__='"'"'/tmp/pip-install-EFxJZq/flask-appbuilder/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-StYjJL cwd: /tmp/pip-install-EFxJZq/flask-appbuilder/ Complete output (3 lines): /usr/lib64/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'long_description_content_type' warnings.warn(msg) error in Flask-AppBuilder setup command: 'install_requires' must be a string or list of strings containing valid project/version requirement specifiers
解決方案:
把setuptools
升級到最新版即可
pip install setuptools -U
3.2 執行ariflow相關命令報錯 error: sqlite C library version too old (< {min_sqlite_version}).
詳細報錯如下:
Traceback (most recent call last): File "/usr/python3.7/bin/airflow", line 5, in <module> from airflow.__main__ import main File "/usr/python3.7/lib/python3.7/site-packages/airflow/__init__.py", line 34, in <module> from airflow import settings File "/usr/python3.7/lib/python3.7/site-packages/airflow/settings.py", line 35, in <module> from airflow.configuration import AIRFLOW_HOME, WEBSERVER_CONFIG, conf # NOQA F401 File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 1114, in <module> conf.validate() File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 202, in validate self._validate_config_dependencies() File "/usr/python3.7/lib/python3.7/site-packages/airflow/configuration.py", line 243, in _validate_config_dependencies f"error: sqlite C library version too old (< {min_sqlite_version}). " airflow.exceptions.AirflowConfigException: error: sqlite C library version too old (< 3.15.0). See https://airflow.apache.org/docs/apache-airflow/2.1.1/howto/set-up-database.rst#setting-up-a-sqlite-database
原因: airflow
默認使用sqlite
作為metastore
,但我們使用的是mysql
,實際上用不到sqlite
解決方案:修改{AIRFLOW_HOME}/airflow.cfg
,
將元數據庫信息sql_alchemy_conn
修改為
sql_alchemy_conn = mysql+pymysql://airflow:yourpassword@hostname:3306/airflow`
3.3 執行airflow db init失敗 Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql
File "/usr/python3.7/lib/python3.7/site-packages/airflow/migrations/versions/0e2a74e0fc9f_add_time_zone_awareness.py", line 44, in upgrade raise Exception("Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql") Exception: Global variable explicit_defaults_for_timestamp needs to be on (1) for mysql
解決方法:
進入mysql airflow
數據庫,設置global explicit_defaults_for_timestamp
SHOW GLOBAL VARIABLES LIKE '%timestamp%'; SET GLOBAL explicit_defaults_for_timestamp =1;
設置前:
設置后: