Apache Atlas Basic Usage


當我們談論數據治理/元數據管理的時候,我們究竟在討論什么?

  談到數據治理,自然離不開元數據。元數據(Metadata),用一句話定義就是:描述數據的數據。元數據打通了數據源、數據倉庫、數據應用,記錄了數據從產生到消費的全過程。因此,數據治理的核心就是元數據管理。

  數據的真正價值在於數據驅動決策,通過數據指導運營。通過數據驅動的方法判斷趨勢,幫住我們發現問題,繼而推動創新或產生新的解決方案。隨着企業數據爆發式增長,數據體量越來越難以估量,我們很難說清楚我們到底擁有哪些數據,這些數據從哪里來,到哪里去,發生了什么變化,應該如何使用它們。因此元數據管理(數據治理)成為企業級數據湖不可或缺的重要組成部分。

  可惜很長一段時間內,市面都沒有成熟的數據治理解決方案。直到2015年,Hortonworks終於坐不住了,約了一眾小伙伴公司倡議:咱們開始整個數據治理方案吧。然后,包含數據分類、集中策略引擎、數據血緣、安全和生命周期管理功能的Atlas應運而生。

  Atlas 是一個可伸縮和可擴展的核心基礎治理服務集合 ,使企業能夠有效地和高效地滿足 Hadoop 中的合規性要求,並允許與整個企業數據生態系統的集成。

  Apache Atlas為組織提供開放式元數據管理和治理功能,用以構建其數據資產目錄,對這些資產進行分類和管理,並為數據科學家,數據分析師和數據治理團隊提供圍繞這些數據資產的協作功能。

基本架構信息
相關概念
Type
元數據類型定義,這里可以是表,列,視圖,物化視圖等,還可以細分hive表(hive_table),hbase表(hbase_table)等,甚至可以是一個數據操作行為,比如定時同步從一張表同步到另外一張表這個也可以描述為一個元數據類型,atlas自帶了很多類型,但是可以通過調用api自定義類型

Classification
分類,通俗點就是給元數據打標簽,分類是可以傳遞的,比如user_view這個視圖是基於user這個表生成的,那么如果user打上了HR這個標簽,user_view也會自動打上HR的標簽,這樣的好處就是便於數據的追蹤

GLOSSARY
詞匯表,GLOSSARY包含兩個概念,Category(類別)和Term(術語),Category表示一組Term的集合,術語為元數據提供了別名,以便用戶更好的理解數據,舉個例子,有個pig的表,里面有個豬腎的字段,但很多人更習慣叫做豬腰子,那么就可以給豬腎這個字段加一個Term,不僅更好理解,也更容易搜索到

Entity
實體,表示具體的元數據,Atlas管理的對象就是各種Type的Entity

Lineage
數據血緣,表示數據之間的傳遞關系,通過Lineage我們可以清晰的知道數據的從何而來又流向何處,中間經過了哪些操作

基本用法

This Apache Atlas is built from the 2.1.0-release source tarball and patched to be run in a Docker container.

Atlas is built with embedded HBase + Solr and it is pre-initialized, so you can use it right after image download without additional steps.

If you want to use external Atlas backends, set them up according to the documentation.
漢化版參考文檔查看:Apache Atlas v1.1 版本

  1. Pull the latest release image:
docker pull sburn/apache-atlas
  1. Start Apache Atlas in a container exposing Web-UI port 21000:
docker run -d -p 21000:21000 --name atlas_v2.1.0 sburn/apache-atlas /opt/apache-atlas-2.1.0/bin/atlas_start.py

Please, take into account that the first startup of Atlas may take up to few mins depending on host machine performance before web-interface become available at http://localhost:21000/

Web-UI default credentials: admin / admin

Usage options

Usage options

Gracefully stop Atlas:

docker exec -ti atlas /opt/apache-atlas-2.1.0/bin/atlas_stop.py

Check Atlas startup script output:

docker logs atlas

Check interactively Atlas application.log (useful at the first run and for debugging during workload):

docker exec -ti atlas tail -f /opt/apache-atlas-2.1.0/logs/application.log

Run the example (this will add sample types and instances along with traits):

docker exec -ti atlas /opt/apache-atlas-2.1.0/bin/quick_start.py

Start Atlas overriding settings by environment variables
(to support large number of metadata objects for example):

docker run --detach \
    -e "ATLAS_SERVER_OPTS=-server -XX:SoftRefLRUPolicyMSPerMB=0 \
    -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC \
    -XX:+CMSParallelRemarkEnabled -XX:+PrintTenuringDistribution \
    -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=dumps/atlas_server.hprof \
    -Xloggc:logs/gc-worker.log -verbose:gc -XX:+UseGCLogFileRotation \
    -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1m -XX:+PrintGCDetails \
    -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps" \
    -p 21000:21000 \
    --name atlas \
    sburn/apache-atlas \
    /opt/apache-atlas-2.1.0/bin/atlas_start.py

Start Atlas exposing logs directory on the host to view them directly:

docker run --detach \
    -v ${PWD}/atlas-logs:/opt/apache-atlas-2.1.0/logs \
    -p 21000:21000 \
    --name atlas \
    sburn/apache-atlas \
    /opt/apache-atlas-2.1.0/bin/atlas_start.py

Start Atlas exposing conf directory on the host to place and edit configuration files directly:

docker run --detach \
    -v ${PWD}/pre-conf:/opt/apache-atlas-2.1.0/conf \
    -p 21000:21000 \
    --name atlas \
    sburn/apache-atlas \
    /opt/apache-atlas-2.1.0/bin/atlas_start.py

Start Atlas with data directory mounted on the host to provide its persistency:

docker run --detach \
    -v ${PWD}/data:/opt/apache-atlas-2.1.0/data \
    -p 21000:21000 \
    --name atlas \
    sburn/apache-atlas \
    /opt/apache-atlas-2.1.0/bin/atlas_start.py

Tinkerpop Gremlin support

Image contains build-in extras for those who want to play with Janusgraph, and Atlas artifacts using Apache Tinkerpop Gremlin Console (gremlin CLI).

  1. You need Atlas container up and running as shown above.

  2. Install gremlin-server and gremlin-console into the container by running included automation script:

docker exec -ti atlas /opt/gremlin/install-gremlin.sh
  1. Start gremlin-server in the same container:
docker exec -d atlas /opt/gremlin/start-gremlin-server.sh
  1. Finally, run gremlin-console interactively:
docker exec -ti atlas /opt/gremlin/run-gremlin-console.sh

Gremlin-console usage example:

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----

gremlin>:remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/127.0.0.1:8182-[d1b2d9de-da1f-471f-be14-34d8ea769ae8]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[d1b2d9de-da1f-471f-be14-34d8ea769ae8] - type ':remote console' to return to local mode
gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[localhost]], standard]
gremlin> g.V().has('__typeName','hdfs_path').count()

Environment Variables

The following environment variables are available for configuration:

Name Default Description
JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64 The java implementation to use. If JAVA_HOME is not found we expect java and jar to be in path
ATLAS_OPTS any additional java opts you want to set. This will apply to both client and server operations
ATLAS_CLIENT_OPTS any additional java opts that you want to set for client only
ATLAS_CLIENT_HEAP java heap size we want to set for the client. Default is 1024MB
ATLAS_SERVER_OPTS any additional opts you want to set for atlas service.
ATLAS_SERVER_HEAP java heap size we want to set for the atlas server. Default is 1024MB
ATLAS_HOME_DIR What is is considered as atlas home dir. Default is the base location of the installed software
ATLAS_LOG_DIR Where log files are stored. Defatult is logs directory under the base install location
ATLAS_PID_DIR Where pid files are stored. Defatult is logs directory under the base install location
ATLAS_EXPANDED_WEBAPP_DIR Where do you want to expand the war file. By Default it is in /server/webapp dir under the base install dir.

Bug Tracker

Bugs are tracked on GitHub Issues.
In case of trouble, please check there to see if your issue has already been reported.
If you spotted it first, help us smash it by providing detailed and welcomed feedback.

Maintainer


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM