介紹
1.原理:
從hive metadata中加載數據源
根據用戶指定的數據質量檢查的規則,將規則轉換為Spark程序,利用Spark這種強大的計算能力,為數據質量做出檢測分析。
2.程序設計模塊
measure:
計算層,使用spark計算用戶制定的數據質量校驗規則,由scala開發。
service:
服務層,對接ui的后端接口,定時調度、向livy提交spark程序的角色。
ui:
展現層,由angular2開發
安裝
一、集群基礎環境
1.JDK (1.8 or later versions)
2.PostgreSQL(version 10.4) or MySQL(version 8.0.11)
3.Hadoop (2.6.0 or later)
4.Hive (version 2.x),安裝參考 :https://www.cnblogs.com/caoxb/p/11333741.html
5.Spark (version 2.2.1) 安裝參考: https://blog.csdn.net/k393393/article/details/92440892
6.Livy 安裝參考:https://www.cnblogs.com/students/p/11400940.html
7.ElasticSearch (5.0 or later versions). 參考https://blog.csdn.net/fiery_heart/article/details/85265585
8.Scala
二、安裝Grigffin
1、MySQL:
1)在MySQL中創建數據庫quartz,
2)然后執行Init_quartz_mysql_innodb.sql腳本初始化表信息:
mysql -u <username> -p <password> quartz < Init_quartz_mysql_innodb.sql
2、Hadoop和Hive:
從Hadoop服務器拷貝配置文件到Livy服務器上,這里假設將配置文件放在/usr/data/conf目錄下。
在Hadoop服務器上創建/home/spark_conf目錄,並將Hive的配置文件hive-site.xml上傳到該目錄下:
#創建/home/spark_conf目錄
hadoop fs -mkdir -p /home/spark_conf
#上傳hive-site.xml
hadoop fs -put hive-site.xml /home/spark_conf/
3、設置環境變量:
#!/bin/bash export JAVA_HOME=/data/jdk1.8.0_192 #spark目錄 export SPARK_HOME=/usr/data/spark-2.1.1-bin-2.6.3 #livy命令目錄 export LIVY_HOME=/usr/data/livy/bin #hadoop配置文件目錄 export HADOOP_CONF_DIR=/usr/data/conf
4、Livy配置:
更新livy/conf下的livy.conf配置文件:
livy.server.host = 127.0.0.1
livy.spark.master = yarn
livy.spark.deployMode = cluster
livy.repl.enable-hive-context = true
啟動livy:
livy-server start
5、Elasticsearch配置:
在ES里創建griffin索引:
curl -H "Content-Type: application/json" -XPUT http://es:9200/griffin?include_type_name=true ' { "aliases": {}, "mappings": { "accuracy": { "properties": { "name": { "fields": { "keyword": { "ignore_above": 256, "type": "keyword" } }, "type": "text" }, "tmst": { "type": "date" } } } }, "settings": { "index": { "number_of_replicas": "2", "number_of_shards": "5" } } }'
源碼打包部署
在這里我使用源碼編譯打包的方式來部署Griffin,Griffin的源碼地址是:https://github.com/apache/griffin.git,這里我使用的源碼tag是griffin-0.4.0
Griffin的源碼結構很清晰,主要包括griffin-doc、measure、service和ui四個模塊,其中griffin-doc負責存放Griffin的文檔,measure負責與spark交互,執行統計任務,service使用spring boot作為服務實現,負責給ui模塊提供交互所需的restful api,保存統計任務,展示統計結果。
源碼導入構建完畢后,需要修改配置文件,具體修改的配置文件如下:
1、service/src/main/resources/application.properties:
# Apache Griffin應用名稱
spring.application.name=griffin_service
# MySQL數據庫配置信息
spring.datasource.url=jdbc:mysql://10.xxx.xx.xxx:3306/griffin_quartz?useSSL=false
spring.datasource.username=xxxxx
spring.datasource.password=xxxxx
spring.jpa.generate-ddl=true
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.jpa.show-sql=true
# Hive metastore配置信息
hive.metastore.uris=thrift://namenode.test01.xxx:9083
hive.metastore.dbname=default
hive.hmshandler.retry.attempts=15
hive.hmshandler.retry.interval=2000ms
# Hive cache time
cache.evict.hive.fixedRate.in.milliseconds=900000
# Kafka schema registry,按需配置
kafka.schema.registry.url=http://namenode.test01.xxx:8081
# Update job instance state at regular intervals
jobInstance.fixedDelay.in.milliseconds=60000
# Expired time of job instance which is 7 days that is 604800000 milliseconds.Time unit only supports milliseconds
jobInstance.expired.milliseconds=604800000
# schedule predicate job every 5 minutes and repeat 12 times at most
#interval time unit s:second m:minute h:hour d:day,only support these four units
predicate.job.interval=5m
predicate.job.repeat.count=12
# external properties directory location
external.config.location=
# external BATCH or STREAMING env
external.env.location=
# login strategy ("default" or "ldap")
login.strategy=default
# ldap,登錄策略為ldap時配置
ldap.url=ldap://hostname:port
ldap.email=@example.com
ldap.searchBase=DC=org,DC=example
ldap.searchPattern=(sAMAccountName={0})
# hdfs default name
fs.defaultFS=
# elasticsearch配置
elasticsearch.host=griffindq02-test1-rgtj1-tj1
elasticsearch.port=9200
elasticsearch.scheme=http
# elasticsearch.user = user
# elasticsearch.password = password
# livy配置
livy.uri=http://10.104.xxx.xxx:8998/batches
# yarn url配置
yarn.uri=http://10.104.xxx.xxx:8088
# griffin event listener
internal.event.listeners=GriffinJobEventHook
2、service/src/main/resources/quartz.properties
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
org.quartz.scheduler.instanceName=spring-boot-quartz
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=5
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
# If you use postgresql as your database,set this property value to org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
# If you use mysql as your database,set this property value to org.quartz.impl.jdbcjobstore.StdJDBCDelegate
# If you use h2 as your database, it's ok to set this property value to StdJDBCDelegate, PostgreSQLDelegate or others
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.useProperties=true
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.tablePrefix=QRTZ_
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
3、service/src/main/resources/sparkProperties.json:
{
"file": "hdfs:///griffin/griffin-measure.jar",
"className": "org.apache.griffin.measure.Application",
"name": "griffin",
"queue": "default",
"numExecutors": 2,
"executorCores": 1,
"driverMemory": "1g",
"executorMemory": "1g",
"conf": {
"spark.yarn.dist.files": "hdfs:///home/spark_conf/hive-site.xml"
},
"files": [
]
}
4、service/src/main/resources/env/env_batch.json:
{
"spark": {
"log.level": "INFO"
},
"sinks": [
{
"type": "CONSOLE",
"config": {
"max.log.lines": 10
}
},
{
"type": "HDFS",
"config": {
"path": "hdfs://namenodetest01.xx.xxxx.com:9001/griffin/persist",
"max.persist.lines": 10000,
"max.lines.per.file": 10000
}
},
{
"type": "ELASTICSEARCH",
"config": {
"method": "post",
"api": "http://10.xxx.xxx.xxx:9200/griffin/accuracy",
"connection.timeout": "1m",
"retry": 10
}
}
],
"griffin.checkpoint": []
}
配置文件修改好后,在idea里的terminal里執行如下maven命令進行編譯打包:
mvn -Dmaven.test.skip=true clean install
命令執行完成后,會在service和measure模塊的target目錄下分別看到service-0.4.0.jar和measure-0.4.0.jar兩個jar,將這兩個jar分別拷貝到服務器目錄下。這兩個jar的使用方式如下:
1、使用如下命令將measure-0.4.0.jar這個jar上傳到HDFS的/griffin文件目錄里:
#改變jar名稱
mv measure-0.4.0.jar griffin-measure.jar
mv service-0.4.0.jar griffin-service.jar
#上傳griffin-measure.jar到HDFS文件目錄里
hadoop fs -put measure-0.4.0.jar /griffin/
這樣做的目的主要是因為spark在yarn集群上執行任務時,需要到HDFS的/griffin目錄下加載griffin-measure.jar,避免發生類org.apache.griffin.measure.Application找不到的錯誤。
2、運行service-0.4.0.jar,啟動Griffin管理后台:
nohup java -jar service-0.4.0.jar>service.out 2>&1 &
幾秒鍾后,我們可以訪問Apache Griffin的默認UI(默認情況下,spring boot的端口是8080)。
http://IP:8080
基於Apache Griffin Kafka源數據計算
http://griffin.apache.org/docs/usecases.html
實時數據檢測目前未有界面配置,可以通過api的方式提交實時數據監控