當我們談論數據治理/元數據管理的時候,我們究竟在討論什么?
談到數據治理,自然離不開元數據。元數據(Metadata),用一句話定義就是:描述數據的數據。元數據打通了數據源、數據倉庫、數據應用,記錄了數據從產生到消費的全過程。因此,數據治理的核心就是元數據管理。
數據的真正價值在於數據驅動決策,通過數據指導運營。通過數據驅動的方法判斷趨勢,幫住我們發現問題,繼而推動創新或產生新的解決方案。隨着企業數據爆發式增長,數據體量越來越難以估量,我們很難說清楚我們到底擁有哪些數據,這些數據從哪里來,到哪里去,發生了什么變化,應該如何使用它們。因此元數據管理(數據治理)成為企業級數據湖不可或缺的重要組成部分。
可惜很長一段時間內,市面都沒有成熟的數據治理解決方案。直到2015年,Hortonworks終於坐不住了,約了一眾小伙伴公司倡議:咱們開始整個數據治理方案吧。然后,包含數據分類、集中策略引擎、數據血緣、安全和生命周期管理功能的Atlas應運而生。
Atlas 是一個可伸縮和可擴展的核心基礎治理服務集合 ,使企業能夠有效地和高效地滿足 Hadoop 中的合規性要求,並允許與整個企業數據生態系統的集成。
Apache Atlas為組織提供開放式元數據管理和治理功能,用以構建其數據資產目錄,對這些資產進行分類和管理,並為數據科學家,數據分析師和數據治理團隊提供圍繞這些數據資產的協作功能。
相關概念
Type
元數據類型定義,這里可以是表,列,視圖,物化視圖等,還可以細分hive表(hive_table),hbase表(hbase_table)等,甚至可以是一個數據操作行為,比如定時同步從一張表同步到另外一張表這個也可以描述為一個元數據類型,atlas自帶了很多類型,但是可以通過調用api自定義類型
Classification
分類,通俗點就是給元數據打標簽,分類是可以傳遞的,比如user_view這個視圖是基於user這個表生成的,那么如果user打上了HR這個標簽,user_view也會自動打上HR的標簽,這樣的好處就是便於數據的追蹤
GLOSSARY
詞匯表,GLOSSARY包含兩個概念,Category(類別)和Term(術語),Category表示一組Term的集合,術語為元數據提供了別名,以便用戶更好的理解數據,舉個例子,有個pig的表,里面有個豬腎的字段,但很多人更習慣叫做豬腰子,那么就可以給豬腎這個字段加一個Term,不僅更好理解,也更容易搜索到
Entity
實體,表示具體的元數據,Atlas管理的對象就是各種Type的Entity
Lineage
數據血緣,表示數據之間的傳遞關系,通過Lineage我們可以清晰的知道數據的從何而來又流向何處,中間經過了哪些操作
基本用法
This Apache Atlas
is built from the 2.1.0-release source tarball and patched to be run in a Docker container.
Atlas is built with embedded HBase + Solr
and it is pre-initialized, so you can use it right after image download without additional steps.
If you want to use external Atlas backends, set them up according to the documentation.
漢化版參考文檔查看:Apache Atlas v1.1 版本
- Pull the latest release image:
docker pull sburn/apache-atlas
- Start Apache Atlas in a container exposing Web-UI port 21000:
docker run -d -p 21000:21000 --name atlas_v2.1.0 sburn/apache-atlas /opt/apache-atlas-2.1.0/bin/atlas_start.py
Please, take into account that the first startup of Atlas may take up to few mins depending on host machine performance before web-interface become available at http://localhost:21000/
Web-UI default credentials: admin / admin
Usage options
Usage options
Gracefully stop Atlas:
docker exec -ti atlas /opt/apache-atlas-2.1.0/bin/atlas_stop.py
Check Atlas startup script output:
docker logs atlas
Check interactively Atlas application.log (useful at the first run and for debugging during workload):
docker exec -ti atlas tail -f /opt/apache-atlas-2.1.0/logs/application.log
Run the example (this will add sample types and instances along with traits):
docker exec -ti atlas /opt/apache-atlas-2.1.0/bin/quick_start.py
Start Atlas overriding settings by environment variables
(to support large number of metadata objects for example):
docker run --detach \
-e "ATLAS_SERVER_OPTS=-server -XX:SoftRefLRUPolicyMSPerMB=0 \
-XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution \
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof \
-Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation \
-XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails \
-XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps" \
-p 21000:21000 \
--name atlas \
sburn/apache-atlas \
/opt/apache-atlas-2.1.0/bin/atlas_start.py
Start Atlas exposing logs directory on the host to view them directly:
docker run --detach \
-v ${PWD}/atlas-logs:/opt/apache-atlas-2.1.0/logs \
-p 21000:21000 \
--name atlas \
sburn/apache-atlas \
/opt/apache-atlas-2.1.0/bin/atlas_start.py
Start Atlas exposing conf directory on the host to place and edit configuration files directly:
docker run --detach \
-v ${PWD}/pre-conf:/opt/apache-atlas-2.1.0/conf \
-p 21000:21000 \
--name atlas \
sburn/apache-atlas \
/opt/apache-atlas-2.1.0/bin/atlas_start.py
Start Atlas with data directory mounted on the host to provide its persistency:
docker run --detach \
-v ${PWD}/data:/opt/apache-atlas-2.1.0/data \
-p 21000:21000 \
--name atlas \
sburn/apache-atlas \
/opt/apache-atlas-2.1.0/bin/atlas_start.py
Tinkerpop Gremlin support
Image contains build-in extras for those who want to play with Janusgraph, and Atlas artifacts using Apache Tinkerpop Gremlin Console (gremlin CLI).
-
You need Atlas container up and running as shown above.
-
Install
gremlin-server
andgremlin-console
into the container by running included automation script:
docker exec -ti atlas /opt/gremlin/install-gremlin.sh
- Start
gremlin-server
in the same container:
docker exec -d atlas /opt/gremlin/start-gremlin-server.sh
- Finally, run
gremlin-console
interactively:
docker exec -ti atlas /opt/gremlin/run-gremlin-console.sh
Gremlin-console usage example:
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
gremlin>:remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/127.0.0.1:8182-[d1b2d9de-da1f-471f-be14-34d8ea769ae8]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[d1b2d9de-da1f-471f-be14-34d8ea769ae8] - type ':remote console' to return to local mode
gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[localhost]], standard]
gremlin> g.V().has('__typeName','hdfs_path').count()
Environment Variables
The following environment variables are available for configuration:
Name | Default | Description |
---|---|---|
JAVA_HOME | /usr/lib/jvm/java-8-openjdk-amd64 | The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path |
ATLAS_OPTS |
|
any additional java opts you want to set. This will apply to both client and server operations |
ATLAS_CLIENT_OPTS |
|
any additional java opts that you want to set for client only |
ATLAS_CLIENT_HEAP |
|
java heap size we want to set for the client. Default is 1024MB |
ATLAS_SERVER_OPTS |
|
any additional opts you want to set for atlas service. |
ATLAS_SERVER_HEAP |
|
java heap size we want to set for the atlas server. Default is 1024MB |
ATLAS_HOME_DIR |
|
What is is considered as atlas home dir. Default is the base location of the installed software |
ATLAS_LOG_DIR |
|
Where log files are stored. Defatult is logs directory under the base install location |
ATLAS_PID_DIR |
|
Where pid files are stored. Defatult is logs directory under the base install location |
ATLAS_EXPANDED_WEBAPP_DIR |
|
Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir. |
Bug Tracker
Bugs are tracked on GitHub Issues.
In case of trouble, please check there to see if your issue has already been reported.
If you spotted it first, help us smash it by providing detailed and welcomed feedback.