streamsets系列教程4-数据源之概要

本文转载自查看原文 2020-01-15 18:15 236 大数据

1.1. Origins

An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.

You can use different origins based on the execution mode of the pipeline.

In standalone pipelines, you can use the following origins:

Amazon S3 - Reads objects from Amazon S3.
Amazon SQS Consumer - Reads data from queues in Amazon Simple Queue Services (SQS).
Azure IoT/Event Hub Consumer - Reads data from Microsoft Azure Event Hub. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
CoAP Server - Listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Directory - Reads fully-written files from a directory. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Elasticsearch - Reads data from an Elasticsearch cluster. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
File Tail - Reads lines of data from an active file after reading related archived files in the directory.
Google BigQuery - Executes a query job and reads the result from Google BigQuery.
Google Cloud Storage - Reads fully written objects from Google Cloud Storage.
Google Pub/Sub Subscriber - Consumes messages from a Google Pub/Sub subscription. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Hadoop FS Standalone - Reads fully-written files from HDFS. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
HTTP Client - Reads data from a streaming HTTP resource URL.
HTTP Server - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST and PUT requests. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
HTTP to Kafka - Listens on a HTTP endpoint and writes the contents of all authorized HTTP POST requests directly to Kafka.
JDBC Multitable Consumer - Reads database data from multiple tables through a JDBC connection. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
JDBC Query Consumer - Reads database data using a user-defined SQL query through a JDBC connection.
JMS Consumer - Reads messages from JMS.
Kafka Consumer - Reads messages from a single Kafka topic.
Kafka Multitopic Consumer - Reads messages from multiple Kafka topics. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Kinesis Consumer - Reads data from Kinesis Streams. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
MapR DB CDC - Reads changed MapR DB data that has been written to MapR Streams. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
MapR DB JSON - Reads JSON documents from MapR DB JSON tables.
MapR FS - Reads files from MapR FS.
MapR FS Standalone - Reads fully-written files from MapR FS. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
MapR Multitopic Streams Consumer - Reads messages from multiple MapR Streams topics. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
MapR Streams Consumer - Reads messages from MapR Streams.
MongoDB - Reads documents from MongoDB.
MongoDB Oplog - Reads entries from a MongoDB Oplog.
MQTT Subscriber - Subscribes to a topic on an MQTT broker to read messages from the broker.
MySQL Binary Log - Reads MySQL binary logs to generate change data capture records.
Omniture - Reads web usage reports from the Omniture reporting API.
OPC UA Client - Reads data from a OPC UA server.
Oracle CDC Client - Reads LogMiner redo logs to generate change data capture records.
RabbitMQ Consumer - Reads messages from RabbitMQ.
Redis Consumer - Reads messages from Redis.
Salesforce - Reads data from Salesforce.
SDC RPC - Reads data from an SDC RPC destination in an SDC RPC pipeline.
SDC RPC to Kafka - Reads data from an SDC RPC destination in an SDC RPC pipeline and writes it to Kafka.
SFTP/FTP Client - Reads files from an SFTP or FTP server.
SQL Server CDC Client - Reads data from Microsoft SQL Server CDC tables. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
SQL Server Change Tracking - Reads data from Microsoft SQL Server change tracking tables and generates the latest version of each record. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
TCP Server - Listens at the specified ports and processes incoming data over TCP/IP connections. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
UDP Multithreaded Source - Reads messages from one or more UDP ports. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
UDP Source - Reads messages from one or more UDP ports.
UDP to Kafka - Reads messages from one or more UDP ports and writes the data to Kafka.
WebSocket Client - Reads data from a WebSocket server endpoint.
WebSocket Server - Listens on a WebSocket endpoint and processes the contents of all authorized WebSocket client requests. Creates multiple threads to enable parallel processing in a multithreaded pipeline.

In cluster pipelines, you can use the following origins:

Hadoop FS - Reads data from the Hadoop Distributed File System (HDFS). Can read from other file systems using the Hadoop FileSystem interface.
Kafka Consumer - Reads messages from Kafka. Use the cluster version of the origin.
MapR FS - Reads data from MapR FS.
MapR Streams Consumer - Reads messages from MapR Streams.

In edge pipelines, you can use the following origins:

Directory - Reads fully-written files from a directory.
File Tail - Reads lines of data from an active file after reading related archived files in the directory.
HTTP Server - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST and PUT requests.
MQTT Subscriber - Subscribes to a topic on an MQTT broker to read messages from the broker.
Windows Event Log - Reads data from a Microsoft Windows event log located on a Windows machine.

To help create or test pipelines, you can use the following development origins:

Dev Data Generator
Dev Random Source
Dev Raw Data Source
Dev SDC RPC with Buffering
Dev Snapshot Replaying
Sensor Reader

For more information, see Development Stages.

1.1.1. Comparing HTTP Origins

We have several HTTP origins, make sure to use the best one for your needs. Here's a quick breakdown of some key differences:

Origin	Description
HTTP Client	Initiates HTTP requests for an external system. Processes data synchronously. Processes JSON, text, and XML data. Can process a range of HTTP requests. Can be used in a pipeline with processors.
HTTP Server	Listens for incoming HTTP requests and processes them while the sender waits for confirmation. Processes data synchronously. Creates multithreaded pipelines, thus suitable for high throughput of incoming data. Processes virtually all data formats. Processes HTTP POST and PUT requests. Can be used in a pipeline with processors.
HTTP to Kafka	Listens for incoming HTTP requests and writes them immediately to Kafka with no additional processing. Processes data asynchronously. Suitable for very high throughput of incoming data. Writes all data to Kafka, regardless of the data format. Processes HTTP POST requests only. Cannot be used in a pipeline with processors. For more flexibility, use the HTTP Server origin.

1.1.2. Comparing MapR Origins

We have several MapR origins, make sure to use the best one for your needs. Here's a quick breakdown of some key differences:

Origin	Description
MapR DB CDC	Reads change data capture MapR DB data using MapR Streams. Includes CDC information in record header attributes. Use in standalone execution mode pipelines.
MapR DB JSON	Reads JSON documents from MapR DB. Converts each JSON document to a record. Use in standalone execution mode pipelines.
MapR FS	Reads files from MapR FS. Can be used with Kerberos Authentication. Use in cluster execution mode pipelines.
MapR FS Standalone	Reads files from MapR FS. Can use multiple threads to enable the parallel processing of files. Can be used with Kerberos Authentication. Use in standalone execution mode pipelines.
MapR Multitopic Streams	Streams data from MapR Streams. Can use multiple threads to read from multiple topics. Use in standalone execution mode pipelines.
MapR Streams	Streams data from MapR Streams. Reads from a single topic using a single thread. Use in standalone execution mode pipelines.

1.1.3. Comparing UDP Source Origins

The UDP Source and UDP Multithreaded Source origins are very similar. The main differentiator is that the UDP Multithreaded Source can use multiple threads to process data within the pipeline.

The UDP Multithreaded Source has a processing queue that aids multithreaded processing. But use of this queue can slow processing under certain circumstances.

The following table describes some cases when you might want to use each origin:

Origin	Ideally Used When
UDP Multithreaded Source	Epoll support enables the use of multiple receiver threads to pass data to the pipeline. Complex pipeline requires longer processing time. or Lack of epoll support allows only a single receiver thread to pass data to the pipeline. High volumes of data.
UDP Source	Epoll support enables the use of multiple receiver threads to pass data to the pipeline. Relatively simple pipeline enables speedy Data Collector processing.

Origin

Ideally Used When

UDP Multithreaded Source

Epoll support enables the use of multiple receiver threads to pass data to the pipeline.
Complex pipeline requires longer processing time.

Lack of epoll support allows only a single receiver thread to pass data to the pipeline.
High volumes of data.

UDP Source

Epoll support enables the use of multiple receiver threads to pass data to the pipeline.
Relatively simple pipeline enables speedy Data Collector processing.

Data Collector also provides a UDP to Kafka origin for reading large volumes of data from multiple UDP ports and writing the data immediately to Kafka, without additional processing.

1.1.4. Comparing WebSocket Origins

We have two WebSocket origins, make sure to use the best one for your needs. Here's a quick breakdown of some key differences:

Origin	Description
WebSocket Client	Initiates a connection to a WebSocket server endpoint and then waits for the WebSocket server to push data.
WebSocket Server	Listens for incoming WebSocket requests and processes them while the sender waits for confirmation. Creates multithreaded pipelines, thus suitable for high throughput of incoming data.

1.1.5. Batch Size and Wait Time

For origin stages, the batch size determines the maximum number of records sent through the pipeline at one time. The batch wait time determines the time that the origin waits for data before sending a batch. At the end of the wait time, it sends the batch regardless of how many records the batch contains.

For example, a File Tail origin is configured for a batch size of 20 records and a batch wait time of 240 seconds. When data arrives quickly, File Tail fills a batch with 20 records and sends it through the pipeline immediately, creating a new batch and sending it again as soon as it is full. As incoming data slows, a remaining batch contains a few records, gaining an extra record periodically. 240 seconds after creating the batch, File Tail sends the partially-full batch through the pipeline. It immediately creates a new batch and starts a new countdown.

Configure the batch wait time based on your processing needs. You might reduce the batch wait time to ensure all data is processed within a specified time frame or to make regular contact with pipeline destinations. Use the default or increase the wait time if you prefer not to process partial or empty batches.

1.1.6. Maximum Record Size

Most data formats have a property that limits the maximum size of the record that an origin can parse. For example, the delimited data format has a Max Record Length property, the JSON data format has Max Object Length, and the text data format has Max Line Length.

When the origin processes data that is larger than the specified length, the behavior differs based on the origin and the data format. For example, with some data formats, oversized records are handled based on the record error handling configured for the origin. While in other data formats, the origin might truncate the data. For details on how an origin handles size overruns for each data format, see the "Data Formats" section of the origin documentation.

When available, the maximum record size properties are limited by the Data Collector parser buffer size, which is 1048576 bytes by default. So, when raising the maximum record size property in the origin does not change the origin's behavior, you might need to increase the Data Collector parser buffer size by configuring the parser.limit property in the Data Collector configuration file. For more information, see Configuring Data Collector in the Data Collector documentation.

Note that most of the maximum record size properties are specified in characters, while the Data Collector limit is defined in bytes.

1.1.7. File Compression Formats

Origins that read files can read uncompressed, compressed files, archives, and compressed archives.

Hadoop FS reads compressed files automatically. For all other file-based origins, you indicate the compression format in the origin.

The following table lists the supported file types by extension:

Compression Format	Description
Uncompressed	Processes uncompressed files of the configured data format.
Compressed	Processes files compressed by the following compression formats: gzip bgzip2 xz lzma Pack200 DEFLATE Z
Archive	Processes files archived by the following archive formats: 7z ar arj cpio dump tar zip
Compressed Archive	Processes files in compressed archives created by supported compression and archive formats.

1.1.8. Previewing Raw Source Data

Some origins allow you to preview raw source data. Preview raw source data when reviewing the data might help with origin configuration.

When you preview file data, you can use the real directory and actual source file. Or when appropriate, you might use a different file that is similar to the source.

When you preview Kafka data, you enter the connection information for the Kafka cluster.

The data used for the raw source preview in an origin stage is not used when previewing data for the pipeline.

In the Properties panel for the origin stage, click the Raw Preview tab.
For a Directory or File Tail origin, enter a directory and file name.
For a Kafka Consumer or Kafka Multitopic Consumer, enter the following information:

Kafka Raw Preview Property	Description
Topic	Kafka topic to read.
Partition	Partition to read.
Broker Host	Broker host name. Use any broker associated with the partition.
Broker Port	Broker port number.
Max Wait Time (secs)	Maximum amount of time the preview waits to receive data from Kafka.

Click Preview.

The Raw Source Preview area displays the preview.

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 streamsets系列教程4-数据源之文件目录 Spring Boot2 系列教程(二十)Spring Boot 整合JdbcTemplate 多数据源 streamsets系列教程- 目的地之总览 Spring系列之数据源的配置数据库数据源连接池的区别 Spring Boot2 系列教程(二十五)Spring Boot 整合 Jpa 多数据源 StreamSets学习系列之StreamSets是什么？ SpringBoot系列之集成Druid配置数据源监控 Spark SQL 编程API入门系列之SparkSQL数据源 Springboot 系列（九）使用 Spring JDBC 和 Druid 数据源监控 Spring系列之多个数据源配置