一、概述
通過前面幾篇文章的學習,相信你對 NiFi 有了一個基礎性的了解。
數據處理和分發系統
是什么概念?
NiFi 系統中數據的傳遞方式是怎樣的?
NiFi 的重要 Processor 有哪些?
Processor 間是以何種方式進行協作的?
上述問題對於閱讀並練習了前幾章內容的你來說,應該都不難理解。
那么對於更加深層次的問題諸如:各個 Processor 是如何運行的?ExecuteScript 是如何對腳本初始化的?整個系統是如何實現對數據進行存儲、分發和處理的?應該更能勾起你的興趣。
本文,就嘗試從架構的簡單剖析給您建立起 NiFi 的進一步的理解。
二、核心概念
http://nifi.apache.org/docs/nifi-docs/html/overview.html#the-core-concepts-of-nifi
NiFi’s fundamental design concepts closely relate to the main ideas of Flow Based Programming [fbp]. Here are some of the main NiFi concepts and how they map to FBP:
NiFi Term FBP Term Description
FlowFile
Information Packet
A FlowFile represents each object moving through the system and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes.
FlowFile Processor
Black Box
Processors actually perform the work. In [eip] terms a processor is doing some combination of data routing, transformation, or mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback.
Connection
Bounded Buffer
Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues can be prioritized dynamically and can have upper bounds on load, which enable back pressure.
Flow Controller
Scheduler
The Flow Controller maintains the knowledge of how processes connect and manages the threads and allocations thereof which all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors.
Process Group
subnet
A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner, process groups allow creation of entirely new components simply by composition of other components.
This design model, also similar to [seda], provides many beneficial consequences that help NiFi to be a very effective platform for building powerful and scalable dataflows. A few of these benefits include:
Lends well to visual creation and management of directed graphs of processors
Is inherently asynchronous which allows for very high throughput and natural buffering even as processing and flow rates fluctuate
Provides a highly concurrent model without a developer having to worry about the typical complexities of concurrency
Promotes the development of cohesive and loosely coupled components which can then be reused in other contexts and promotes testable units
The resource constrained connections make critical functions such as back-pressure and pressure release very natural and intuitive
Error handling becomes as natural as the happy-path rather than a coarse grained catch-all
The points at which data enters and exits the system as well as how it flows through are well understood and easily tracked
三、NiFi 架構
圖片來源:http://nifi.apache.org/docs.html
根據上面的架構圖可知,NiFi 有如下主要部件:
(我就直接粘貼官網的介紹了,因為我覺得官網說得已經很詳細了~翻譯起來反倒更難理解)
-
WebServer
The purpose of the web server is to host NiFi’s HTTP-based command and control API. -
Flow Controller
The flow controller is the brains of the operation. It provides threads for extensions to run on, and manages the schedule of when extensions receive resources to execute. -
Extensions
There are various types of NiFi extensions which are described in other documents. The key point here is that extensions operate and execute within the JVM. -
FlowFile Repository
The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow. The implementation of the repository is pluggable. The default approach is a persistent Write-Ahead Log located on a specified disk partition. -
Content Repository
The Content Repository is where the actual content bytes of a given FlowFile live. The implementation of the repository is pluggable. The default approach is a fairly simple mechanism, which stores blocks of data in the file system. More than one file system storage location can be specified so as to get different physical partitions engaged to reduce contention on any single volume. -
Provenance Repository
The Provenance Repository is where all provenance event data is stored. The repository construct is pluggable with the default implementation being to use one or more physical disk volumes. Within each location event data is indexed and searchable.
Starting with the NiFi 1.0 release, a Zero-Master Clustering paradigm is employed. Each node in a NiFi cluster performs the same tasks on the data, but each operates on a different set of data. Apache ZooKeeper elects a single node as the Cluster Coordinator, and failover is handled automatically by ZooKeeper. All cluster nodes report heartbeat and status information to the Cluster Coordinator. The Cluster Coordinator is responsible for disconnecting and connecting nodes. Additionally, every cluster has one Primary Node, also elected by ZooKeeper. As a DataFlow manager, you can interact with the NiFi cluster through the user interface (UI) of any node. Any change you make is replicated to all nodes in the cluster, allowing for multiple entry points.
此文在我的 Github Pages 上同步發布,地址為:『NiFi學習之路』把握——架構及主要部件