splunk 索引過程

本文轉載自查看原文 2016-12-29 17:29 2253 其他

術語：

Event ：Events are records of activity in log files, stored in Splunk indexes. 簡單說，處理的日志或話單中中一行記錄就是一個Event；
Source type：來源類型，identifies the format of the data，簡單說，一種特定格式的日志，可以定義為一種source type；Splunk默認提供有500多種確定格式數據的type，包括apache log、常見OS的日志、Cisco等網絡設備的日志等；
Index： The index is the repository for Splunk Enterprise data. Splunk transforms incoming data into events, which it stores in indexes. 有兩層含義：一是數據物理存儲上的表達，也是一個數據處理的動作表達：Splunk indexes your data，這個過程會產生兩類數據：
The raw data in compressed form (rawdata)
Indexes that point to the raw data, plus some metadata files (index files)
Indexer： An indexer is a Splunk Enterprise instance that indexes data. 通常說的索引概念，也是對Splunk中“Indexer”這個特定模塊的稱呼，是一種Splunk Enterprise Instance；
Bucket： Index儲存的兩類數據按照age組織為不同的目錄，稱為buckets；

職責——具體再見后文圖：

Search Head：前端搜索；
Deployment Server：相當於配置管理中心，對其它節點統一管理；

Forwarder：負責收集、預處理和前轉數據至Indexer（consume data and forward it on to indexers），配合構成類似Flume的Agent和Collector的機制；動作包括：
· Tagging of metadata (source, sourcetype, and host)
· Configurable buffering
· Data compression
· SSL security
· Use of any available network ports
· Running scripted inputs locally

注意：轉發器可以傳輸三種類型的數據：原始、未解析、已解析。轉發器可以發送的數據類型取決於轉發器類型以及配置方式。通用轉發器和輕型轉發器可以發送原始或未解析

的數據。重型轉發器可以發送原始或解析的數據。

Indexer：負責對數據“索引化”處理，即indexing process，也可稱為event processing；包括：
· Separating the datastream into individual, searchable events.（分行）
· Creating or identifying timestamps. （識別時間戳）
· Extracting fields such as host, source, and sourcetype. （外置公共字段處理）
· Performing user-defined actions on the incoming data, such as identifying custom fields, masking sensitive data, writing new or modified keys, applying breaking rules for multi-line events, filtering unwanted events, and routing events to specified indexes or servers.

Parts of an indexer cluster——分布式部署

An indexer cluster is a group of Splunk Enterprise instances, or nodes, that, working in concert, provide a redundant indexing and searching capability. Each cluster has three types of nodes:

A single master node to manage the cluster.
Several to many peer nodes to index and maintain multiple copies of the data and to search the data.
One or more search heads to coordinate searches across the set of peer nodes.

The master node manages the cluster. It coordinates the replicating activities of the peer nodes and tells the search head where to find data. It also helps manage the configuration of peer nodes and orchestrates remedial activities if a peer goes down.

The peer nodes receive and index incoming data, just like non-clustered, stand-alone indexers. Unlike stand-alone indexers, however, peer nodes also replicate data from other nodes in the cluster. A peer node can index its own incoming data while simultaneously storing copies of data from other nodes. You must have at least as many peer nodes as the replication factor. That is, to support a replication factor of 3, you need a minimum of three peer nodes.

The search head runs searches across the set of peer nodes. You must use a search head to manage searches across indexer clusters.——將搜索請求發給indexer節點，然后合並搜索請求

For most purposes, it is recommended that you use forwarders to get data into the cluster.

Here is a diagram of a basic, single-site indexer cluster, containing three peer nodes and supporting a replication factor of 3:

This diagram shows a simple deployment, similar to a small-scale non-clustered deployment, with some forwarders sending load-balanced data to a group of indexers (peer nodes), and the indexers sending search results to a search head. There are two additions that you don't find in a non-clustered deployment:

The indexers are streaming copies of their data to other indexers.
The master node, while it doesn't participate in any data streaming, coordinates a range of activities involving the search peers and the search head.

How indexing works

Splunk Enterprise can index any type of time-series data (data with timestamps). When Splunk Enterprise indexes data, it breaks it into events, based on the timestamps.

Event processing

Event processing occurs in two stages, parsing and indexing. All data that comes into Splunk Enterprise enters through the parsing pipeline as large (10,000 bytes) chunks. During parsing, Splunk Enterprise breaks these chunks into events which it hands off to the indexing pipeline, where final processing occurs.

While parsing, Splunk Enterprise performs a number of actions, including:

Extracting a set of default fields for each event, including host, source, and sourcetype.
Configuring character set encoding.
Identifying line termination using linebreaking rules. While many events are short and only take up a line or two, others can be long.
Identifying timestamps or creating them if they don't exist. At the same time that it processes timestamps, Splunk identifies event boundaries.
Splunk can be set up to mask sensitive event data (such as credit card or social security numbers) at this stage. It can also be configured toapply custom metadata to incoming events.

In the indexing pipeline, Splunk Enterprise performs additional processing, including:

Breaking all events into segments that can then be searched upon. You can determine the level of segmentation, which affects indexing and searching speed, search capability, and efficiency of disk compression.
Building the index data structures.
Writing the raw data and index files to disk, where post-indexing compression occurs.

The breakdown between parsing and indexing pipelines is of relevance mainly when deploying forwarders. Heavy forwarders can parse data and then forward the parsed data on to indexers for final indexing. Some source types - those that reference structured data - require configuration on the forwarder prior to indexing. See "Extract data from files with headers".

For more information about events and what happens to them during the indexing process, see the chapter "Configure event processing" in the Getting Data In Manual.

Note: Indexing is an I/O-intensive process.

This diagram shows the main processes inherent in indexing:

Note: This diagram represents a simplified view of the indexing architecture. It provides a functional view of the architecture and does not fully describe Splunk Enterprise internals. In particular, the parsing pipeline actually consists of three pipelines: parsing, merging, and typing, which together handle the parsing function. The distinction can matter during troubleshooting, but does not generally affect how you configure or deploy Splunk Enterprise.

How indexer acknowledgment works

In brief, indexer acknowledgment works like this: The forwarder sends data continuously to the receiving peer, in blocks of approximately 64kB. The forwarder maintains a copy of each block in memory until it gets an acknowledgment from the peer. While waiting, it continues to send more data blocks.

If all goes well, the receiving peer:

1. receives the block of data, parses and indexes it, and writes the data (raw data and index data) to the file system.

2. streams copies of the raw data to each of its target peers.

3. sends an acknowledgment back to the forwarder.

The acknowledgment assures the forwarder that the data was successfully written to the cluster. Upon receiving the acknowledgment, the forwarder releases the block from memory.

If the forwarder does not receive the acknowledgment, that means there was a failure along the way. Either the receiving peer went down or that peer was unable to contact its set of target peers. The forwarder then automatically resends the block of data. If the forwarder is using load-balancing, it sends the block to another receiving node in the load-balanced group. If the forwarder is not set up for load-balancing, it attempts to resend data to the same node as before.

Important: To ensure end-to-end data fidelity, you must explicitly enable indexer acknowledgment for each forwarder that's sending data to the cluster, as described earlier in this topic. If end-to-end data fidelity is not a requirement for your deployment, you can skip this step.

For more information on how indexer acknowledgment works, read "Protect against loss of in-flight data" in the Forwarding Data manual.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 pycharm 一直索引或索引過大占用系統盤問題數據庫索引過長(Specified key was too long; max key length is 767 bytes) Navicat for Mysql中錯誤提示索引過長1071-max key length is 767 byte Lucene索引創建過程初學Splunk Elasticsearch索引重建過程參考 Splunk系列：Splunk數據導入篇（二） ES索引文件和數據文件大小對比——splunk索引文件大小遠小於ES，數據文件的壓縮比也較ES更低，有趣的現象：ES數據文件zip壓縮后大小和splunk的數據文件相當！詞典文件tim/tip+倒排doc/pos和cfs文件是索引的大頭 logstash VS splunk Splunk 基本命令介紹