apache-atlas 深度剖析


atlas  是apache下的大數據的元數據管理和數據治理平台,是Hadoop社區為解決Hadoop生態系統的元數據治理問題而產生的開源項目,它為Hadoop集群提供了包括數據分類、集中策略引擎、數據血緣、安全和生命周期管理在內的元數據治理核心能力。支持對hive、storm、kafka、hbase、sqoop等進行元數據管理以及以圖庫的形式展示數據的血緣關系。

    • 各種Hadoop和非Hadoop元數據的預定義類型

    • 為要管理的元數據定義新類型的能力

    • 類型可以具有原始屬性、復雜屬性、對象引用;可以從其他類型繼承。

    • 類型的實例,稱為實體,捕獲元數據對象細節及其關系

    • REST API與類型和實例一起工作更容易集成

    • 動態創建分類的能力,如PII、ExIPRESION、DATAAL質量、敏感

    • 分類可以包括屬性,如EXPIRES_ON分類中的expiry_date 屬性

    • 實體可以與多個分類相關聯,從而能夠更容易地發現和安全執行。

    • 通過譜系傳播分類-自動確保分類跟隨數據經過各種處理

    • 直觀的UI,以查看數據的傳承,因為它通過各種處理

    • REST API訪問和更新血統

    • 通過類型、分類、屬性值或自由文本搜索實體的直觀UI

    • 豐富的 REST API 實現復雜的標准搜索

    • 搜索實體的SQL類查詢語言——領域特定語言(DSL)

    • 用於元數據訪問的細粒度安全性,允許對實體實例和操作(如添加/更新/刪除分類)的訪問進行控制

    • 與Apache Ranger的集成使得基於與Apache Atlas中的實體相關聯的分類的數據訪問的授權/數據屏蔽成為可能。例如:

    • 誰可以訪問被分類為PII、敏感的數據

    • 客戶服務用戶只能看到被列為國家標識的列的最后4位數字

一、架構

整體架構實現如下圖所示:

 

 

 

Type System: Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata objects that are managed. The Type System is a component that allows users to define and manage the types and entities. All metadata objects managed by Atlas out of the box (like Hive tables, for e.g.) are modelled using types and represented as entities. To store new types of metadata in Atlas, one needs to understand the concepts of the type system component.

One key point to note is that the generic nature of the modelling in Atlas allows data stewards and integrators to define both technical metadata and business metadata. It is also possible to define rich relationships between the two using features of Atlas.

Graph Engine: Internally, Atlas persists metadata objects it manages using a Graph model. This approach provides great flexibility and enables efficient handling of rich relationships between the metadata objects. Graph engine component is responsible for translating between types and entities of the Atlas type system, and the underlying graph persistence model. In addition to managing the graph objects, the graph engine also creates the appropriate indices for the metadata objects so that they can be searched efficiently. Atlas uses the JanusGraph to store the metadata objects.

Atlas采用了分布式圖數據庫JanusGraph作為數據存儲( 具體介紹可以參考:https://docs.janusgraph.org/),目的在於用有向圖靈活的存儲、查詢數據血緣關系。Atlas定義了一套atlas-graphdb-api,允許采用不同的圖數據庫引擎來實現api,便於切換底層存儲。所以Atlas讀寫數據的過程可以看作就是將圖數據庫對象映射成Java類的過程,基本流程如下:

JanusGraph 的數據的底層存儲支持Hbase、cassandra、embeddedcassandra、berkeleyje、inmemory(直接存儲在內存中)等。

Ingest / Export: The Ingest component allows metadata to be added to Atlas. Similarly, the Export component exposes metadata changes detected by Atlas to be raised as events. Consumers can consume these change events to react to metadata changes in real time.

atlas  的搜索引擎支持solr和ElasticSearch

Applications:

Atlas Admin UI: This component is a web based application that allows data stewards and scientists to discover and annotate metadata. Of primary importance here is a search interface and SQL like query language that can be used to query the metadata types and objects managed by Atlas. The Admin UI uses the REST API of Atlas for building its functionality.- Atlas Admin UI: 該組件是一個基於 Web 的應用程序,允許數據管理員和科學家發現和注釋元數據。Admin UI提供了搜索界面和 類SQL的查詢語言,可以用來查詢由 Atlas 管理的元數據類型和對象。Admin UI 使用 Atlas 的 REST API 來構建其功能。

Tag Based PoliciesApache Ranger is an advanced security management solution for the Hadoop ecosystem having wide integration with a variety of Hadoop components. By integrating with Atlas, Ranger allows security administrators to define metadata driven security policies for effective governance. Ranger is a consumer to the metadata change events notified by Atlas.

  - Tag Based Policies: Apache Ranger 是針對 Hadoop 生態系統的高級安全管理解決方案,與各種 Hadoop 組件具有廣泛的集成。通過與 Atlas 集成,Ranger 允許安全管理員定義元數據驅動的安全策略,以實現有效的治理。 Ranger 是由 Atlas 通知的元數據更改事件的消費者。

  - Business Taxonomy:從元數據源獲取到 Atlas 的元數據對象主要是一種技術形式的元數據。為了增強可發現性和治理能力,Atlas 提供了一個業務分類界面,允許用戶首先定義一組代表其業務域的業務術語,並將其與 Atlas 管理的元數據實體相關聯。業務分類法是一種 Web 應用程序,目前是 Atlas Admin UI 的一部分,並且使用 REST API 與 Atlas 集成。

    - 在HDP2.5中,Business Taxonomy是提供了Technical Preview版本,需要在Atlas > Configs > Advanced > Custom application-properties中添加atlas.feature.taxonomy.enable=true並重啟atlas服務來開啟

Integration

Users can manage metadata in Atlas using two methods:

API: All functionality of Atlas is exposed to end users via a REST API that allows types and entities to be created, updated and deleted. It is also the primary mechanism to query and discover the types and entities managed by Atlas.

Messaging: In addition to the API, users can choose to integrate with Atlas using a messaging interface that is based on Kafka. This is useful both for communicating metadata objects to Atlas, and also to consume metadata change events from Atlas using which applications can be built. The messaging interface is particularly useful if one wishes to use a more loosely coupled integration with Atlas that could allow for better scalability, reliability etc. Atlas uses Apache Kafka as a notification server for communication between hooks and downstream consumers of metadata notification events. Events are written by the hooks and Atlas to different Kafka topics.

Metadata source

Atlas 支持與許多元數據源的集成,將來還會添加更多集成。目前,Atlas 支持從以下數據源獲取和管理元數據:

  - Hive:通過hive bridge, atlas可以接入Hive的元數據,包括hive_db/hive_table/hive_column/hive_process

  - Sqoop:通過sqoop bridge,atlas可以接入關系型數據庫的元數據,包括sqoop_operation_type/ sqoop_dbstore_usage/sqoop_process/sqoop_dbdatastore

  - Falcon:通過falcon bridge,atlas可以接入Falcon的元數據,包括falcon_cluster/falcon_feed/falcon_feed_creation/falcon_feed_replication/ falcon_process

  - Storm:通過storm bridge,atlas可以接入流式處理的元數據,包括storm_topology/storm_spout/storm_bolt

  Atlas集成大數據組件的元數據源需要實現以下兩點:

  - 首先,需要基於atlas的類型系統定義能夠表達大數據組件元數據對象的元數據模型(例如Hive的元數據模型實現在org.apache.atlas.hive.model.HiveDataModelGenerator);

  - 然后,需要提供hook組件去從大數據組件的元數據源中提取元數據對象,實時偵聽元數據的變更並反饋給atlas;

元數據處理的整體流程入下圖所示:

  • 在Atlas中查詢某一個元數據對象時往往需要遍歷圖數據庫中的多個頂點與邊,相比關系型數據庫直接查詢一行數據要復雜的多,當然使用圖數據庫作為底層存儲也存在它的優勢,比如可以支持復雜的數據類型和更好的支持血緣數據的讀寫。

二、安裝與配置

1、atlas  只提供源碼,不提供打好的安裝包,源碼下載頁面:http://atlas.apache.org/#/Downloads

2、源碼下載完后,按照如下方式進行打包:

tar xvfz apache-atlas-1.0.0-sources.tar.gz

cd apache-atlas-sources-1.0.0/ export MAVEN_OPTS="-Xms2g -Xmx2g" 安裝:mvn clean -DskipTests install
打包:mvn clean -DskipTests package -Pdist
打包時增加 hbase和solr打入: mvn clean -DskipTests package -Pdist,embedded-hbase-solr
打包時增加cassandra和solr打入:mvn clean package -Pdist,embedded-cassandra-solr

3、配置與啟動

tar -xzvf apache-atlas-{project.version}-server.tar.gz

cd atlas-{project.version}/conf,編輯atlas-application.properties配置文件

Graph Persistence engine - HBase配置:

atlas.graph.storage.backend=hbase

atlas.graph.storage.hostname=<ZooKeeper Quorum>

atlas.graph.storage.hbase.table=atlas

Graph Index Search Engine配置:

Graph Search Index - Solr:

atlas.graph.index.search.backend=solr5
atlas.graph.index.search.solr.mode=cloud
atlas.graph.index.search.solr.wait-searcher=true
# ZK quorum setup for solr as comma separated value. Example: 10.1.6.4:2181,10.1.6.5:2181
atlas.graph.index.search.solr.zookeeper-url=
# SolrCloud Zookeeper Connection Timeout. Default value is 60000 ms
atlas.graph.index.search.solr.zookeeper-connect-timeout=60000
# SolrCloud Zookeeper Session Timeout. Default value is 60000 ms
atlas.graph.index.search.solr.zookeeper-session-timeout=60000

Graph Search Index - Elasticsearch (Tech Preview):

atlas.graph.index.search.backend=elasticsearch
atlas.graph.index.search.hostname=<hostname(s) of the Elasticsearch master nodes comma separated>
atlas.graph.index.search.elasticsearch.client-only=true

Notification Configs:

atlas.kafka.auto.commit.enable=false
#Kafka servers. Example: localhost:6667
atlas.kafka.bootstrap.servers=
atlas.kafka.hook.group.id=atlas
#Zookeeper connect URL for Kafka. Example: localhost:2181
atlas.kafka.zookeeper.connect=
atlas.kafka.zookeeper.connection.timeout.ms=30000
atlas.kafka.zookeeper.session.timeout.ms=60000
atlas.kafka.zookeeper.sync.time.ms=20
#Setup the following configurations only in test deployments where Kafka is started within Atlas in embedded mode
#atlas.notification.embedded=true
#atlas.kafka.data={sys:atlas.home}/data/kafka
#Setup the following two properties if Kafka is running in Kerberized mode.
#atlas.notification.kafka.service.principal=kafka/_HOST@EXAMPLE.COM
#atlas.notification.kafka.keytab.location=/etc/security/keytabs/kafka.service.keytab

 Client Configs:

atlas.client.readTimeoutMSecs=60000
atlas.client.connectTimeoutMSecs=60000
# URL to access Atlas server. For example: http://localhost:21000
atlas.rest.address=

SSL config:

atlas.enableTLS=false

High Availability Properties:

# Set the following property to true, to enable High Availability. Default = false.
atlas.server.ha.enabled=true
# Specify the list of Atlas instances
atlas.server.ids=id1,id2
# For each instance defined above, define the host and port on which Atlas server listens.
atlas.server.address.id1=host1.company.com:21000
atlas.server.address.id2=host2.company.com:31000
# Specify Zookeeper properties needed for HA.
# Specify the list of services running Zookeeper servers as a comma separated list.
atlas.server.ha.zookeeper.connect=zk1.company.com:2181,zk2.company.com:2181,zk3.company.com:2181
# Specify how many times should connection try to be established with a Zookeeper cluster, in case of any connection issues.
atlas.server.ha.zookeeper.num.retries=3
# Specify how much time should the server wait before attempting connections to Zookeeper, in case of any connection issues.
atlas.server.ha.zookeeper.retry.sleeptime.ms=1000
# Specify how long a session to Zookeeper should last without inactiviy to be deemed as unreachable.
atlas.server.ha.zookeeper.session.timeout.ms=20000
# Specify the scheme and the identity to be used for setting up ACLs on nodes created in Zookeeper for HA.
# The format of these options is <scheme:identity>.
# For more information refer to 
http://zookeeper.apache.org/doc/r3.2.2/zookeeperProgrammers.html#sc_ZooKeeperAccessControl
# The 'acl' option allows to specify a scheme, identity pair to setup an ACL for.
atlas.server.ha.zookeeper.acl=sasl:client@comany.com
# The 'auth' option specifies the authentication that should be used for connecting to Zookeeper.
atlas.server.ha.zookeeper.auth=sasl:client@company.com
# Since Zookeeper is a shared service that is typically used by many components,
# it is preferable for each component to set its znodes under a namespace.
# Specify the namespace under which the znodes should be written. Default = /apache_atlas
atlas.server.ha.zookeeper.zkroot=/apache_atlas
# Specify number of times a client should retry with an instance before selecting another active instance, or failing an operation.
atlas.client.ha.retries=4
# Specify interval between retries for a client.
atlas.client.ha.sleep.interval.ms=5000

cd atlas-{project.version}

bin/atlas_start.py

 本文作者:張永清,轉載請出名博客園出處。https://www.cnblogs.com/laoqing/p/12674762.html

啟動后,默認端口偉21000,通過http://ip:21000進行訪問:

三、設置Hive Hook

支持的Hive Model:

Hive model includes the following types:

  • Entity types:
    • hive_db
      • super-types: Asset
      • attributes: qualifiedName, name, description, owner, clusterName, location, parameters, ownerName
    • hive_table
      • super-types: DataSet
      • attributes: qualifiedName, name, description, owner, db, createTime, lastAccessTime, comment, retention, sd, partitionKeys, columns, aliases, parameters, viewOriginalText, viewExpandedText, tableType, temporary
      • hive_column
        • super-types: DataSet
        • attributes: qualifiedName, name, description, owner, type, comment, table
      • hive_storagedesc
        • super-types: Referenceable
        • attributes: qualifiedName, table, location, inputFormat, outputFormat, compressed, numBuckets, serdeInfo, bucketCols, sortCols, parameters, storedAsSubDirectories
      • hive_process
        • super-types: Process
        • attributes: qualifiedName, name, description, owner, inputs, outputs, startTime, endTime, userName, operationType, queryText, queryPlan, queryId, clusterName
      • hive_column_lineage
        • super-types: Process
        • attributes: qualifiedName, name, description, owner, inputs, outputs, query, depenendencyType, expression
  • Enum types:
    • hive_principal_type
      • values: USER, ROLE, GROUP
  • Struct types:
    • hive_order
      • attributes: col, order
    • hive_serde
      • attributes: name, serializationLib, parameters

 本文作者:張永清,轉載請出名博客園出處。https://www.cnblogs.com/laoqing/p/12674762.html

 在hive的 hive-site.xml 配置文件中增加如下配置:

<property>
    <name>hive.exec.post.hooks</name>
      <value>org.apache.atlas.hive.hook.HiveHook</value>
  </property>
  • untar apache-atlas-${project.version}-hive-hook.tar.gz

cd apache-atlas-hive-hook-${project.version}

Copy entire contents of folder apache-atlas-hive-hook-${project.version}/hook/hive to <atlas package>/hook/hive

Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh of your hive configuration

Copy <atlas-conf>/atlas-application.properties to the hive conf directory.

atlas-application.properties的配置示例如下:

atlas.hook.hive.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Hive query completion. Default: false
atlas.hook.hive.numRetries=3      # number of retries for notification failure. Default: 3
atlas.hook.hive.queueSize=10000   # queue size for the threadpool. Default: 10000
atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary
atlas.kafka.zookeeper.connect=                    # Zookeeper connect URL for Kafka. Example: localhost:2181
atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000
atlas.kafka.zookeeper.session.timeout.ms=60000    # Zookeeper session timeout. Default: 60000
atlas.kafka.zookeeper.sync.time.ms=20             # Zookeeper sync time. Default: 20

Importing Hive Metadata

Usage 1: <atlas package>/hook-bin/import-hive.sh
Usage 2: <atlas package>/hook-bin/import-hive.sh [-d <database regex> OR --database <database regex>] [-t <table regex> OR --table <table regex>]
Usage 3: <atlas package>/hook-bin/import-hive.sh [-f <filename>]
           File Format:
             database1:tbl1
             database1:tbl2
             database2:tbl1

 未完待續,最近會把后續的補充完整


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM