streamset 系列教程3 -数据格式


1.   Data Formats

1.1. Data Formats Overview

Data formats - such as Avro, JSON, and log - are methods to encode data that adhere to generally accepted specifications.

The way that stages process data can be similar based on the stage type and the type of data being processed. For example, file-based origins such as Directory and SFTP/FTP Client will typically process data formats the same way. Similarly, message-based destinations such as Kafka Producer and JMS Producer generally process data formats the same way.

The documentation for each stage has a "Data Formats" section of documentation that contains processing details.

This chapter includes the details that are too complex to summarize in the Data Formats sections. If you have browsed and discovered this chapter and want to read on, feel free. But for stage-related details, see the stage documentation.

For information on the different data formats supported by each origin and destination, see Data Format Support.

1.2. Delimited Data Root Field Type

When reading delimited data, Data Collector can create records that use the list or list-map root field type.

When Data Collector creates records for delimited data, it creates a single root field of the specified type and writes the delimited data within the root field.

Use the default list-map root field type to easily process delimited data.

List-Map

Provides easy use of field names or column positions in expressions. Recommended for all new pipelines.

A list-map root field type results in a structure that preserves the order of data, as follows:

/<first header>:<value>
/<second header>:<value>
/<third header>:<value>
...

For example, with the list-map root field type, the following delimited rows:

TransactionID,Type,UserID
0003420303,04,362
0003420304,08,1008

are converted to records as follows:

/TransactionID: 0003420303
/Type: 04
/UserID: 362
 
/TransactionID: 0003420304
/Type: 08
/UserID: 1008

If data does not include a header or if you choose to ignore a header, list-map records use the column position as a header as follows:

0: <value>
1: <value>
2: <value>

For example, when you ignore the header for the same data, you get the following records:

0: 0003420303
1: 04
2: 362
 
0: 0003420304
1: 08
2: 1008

In an expression, you can use the field name or the column position with a standard record function to call a field. For example, you can use either of the following record:value() expressions to return data in the TransactionID field:

${record:value('/TransactionID')}
${record:value('[0]'}

Note: When writing scripts for scripting processors, such as the Jython Evaluator or JavaScript Evaluator, you should treat list-map records as maps.

For more information about standard record functions, see Record Functions.

List

Provides continued support for pipelines created before version 1.1.0. Not recommended for new pipelines.

A list root field type results in list with an index for the header position and a map with each header and associated value, as follows:

0
   /header = <first header>
   /value = <value for first header>
1
   /header = <second header>
   /value = <value for second header>
   /header = <third header>
   /value = <value for third header>
...

For example, the same delimited rows described above are converted to records as follows:

0
   /header = TransactionID
   /value = 0003420303
1
   /header = Type
   /value = 04
2
   /header = UserID
   /value = 362
 
0
   /header = TransactionID
   /value = 0003420304
1
   /header = Type
   /value = 08
2
   /header = UserID
   /value = 1008

If the data does not include a header or if you choose to ignore a header, the list records omit the header from the map as follows:

0
   /value = <value>
1
   /value = <value>
2
   /value = <value>
...

For example, when you ignore the header for the same sample data, you get the following records:

0
   /value = 0003420303
1
   /value = 04
2
   /value = 362
 
0
   /value = 0003420304
1
   /value = 08
2
   /value = 1008

For data in the list records, you should either use the delimited data functions or include the full field path in standard record functions. For example, you can use the record:dValue() delimited data function to return the value associated with the specified header.

Tip: You can use the record:dToMap() function to convert a list record to a map, and then use standard functions for record processing.

For more information about record:dToMap and full list of delimited data record functions and their syntax, see Delimited Data Record Functions.

1.3. Log Data Format

When you use an origin to read log data, you define the format of the log files to be read.

You can read log files that use the following log formats:

Common Log Format

A standardized text format used by web servers to generate log files. Also known as the NCSA (National Center for Supercomputing Applications) Common Log format.

Combined Log Format

A standardized text format based on the common log format that includes additional information. Also known as the Apache/NCSA Combined Log Format.

Apache Error Log Format

The standardized error log format generated by the Apache HTTP Server 2.2.

Apache Access Log Custom Format

A customizable access log generated by the Apache HTTP Server 2.2. Use the Apache HTTP Server version 2.2 syntax to define the format of the log file.

Regular Expression

Use a regular expression to define the structure of log data, and then assign the field or fields represented by each group.

Use any valid regular expression.

Grok Pattern

Use a grok pattern to define the structure of log data. You can use the grok patterns supported by Data Collector. You can also define a custom grok pattern and then use it as part of the log format.

For more information about supported grok patterns, see Defining Grok Patterns.

log4j

A customizable format generated by the Apache Log4j 1.2 logging utility. You can use the default format or specify a custom format. Use the Apache Log4j version 1.2 syntax to define the format of the log file.

You can also specify the action to take when the origin encounters an error when parsing a line. You can skip the line and optionally log an error. If you know that the unparsable information is part of a stack trace, you can have the origin include the unparsable information as a stack trace to the previous parsable line.

Common Event Format (CEF)

A customizable event format used by security devices to generate log events. CEF is the native format for HP ArcSight.

Log Event Extended Format (LEEF)

A customizable event format used by security devices to generate log events. LEEF is the native format for IBM Security QRadar.

1.4. NetFlow Data Processing

You can use Data Collector to process NetFlow 5 and NetFlow 9 data.

When processing NetFlow 5 data, Data Collector processes flow records based on information in the packet header. Data Collector expects multiple packets with header and flow records sent on the same connection, with no bytes in between. As a result, when processing NetFlow 5 messages, you have no data-related properties to configure.

When processing template-based NetFlow 9 messages, Data Collector generates records based on cached templates, information in the packet header, and NetFlow 9 configuration properties in the stage. The NetFlow 9 properties display in different locations depending on the type of stage that you use:

  • For origins that process messages directly from the network, such as the UDP Source origin, you configure the NetFlow 9 properties on a NetFlow 9 tab.
  • For most origins and processors that process other types of data, such as JSON or protobuf, you configure NetFlow 9 properties on a Data Formats tab after you select Datagram or NetFlow as the data format.
  • For the TCP Server, you specify the NetFlow TCP mode, and then configure NetFlow 9 properties on a NetFlow 9 tab.

When processing NetFlow 5 messages, the stage ignores any configured NetFlow 9 properties.

Caching NetFlow 9 Templates

Processing NetFlow 9 data requires caching the templates used to process the messages. When you configure NetFlow 9 properties, you can specify the maximum number of templates to cache and how long to allow an unused template to remain in the cache. You can also configure the stage to allow an unlimited number of templates in the cache for an unlimited amount of time.

When you configure caching limitations, templates can be ejected from the cache under the following conditions:

  • When the cache is full and a new template appears.
  • When a template exceeds the specified idle period.

Configure NetFlow 9 caching properties to allow the stage to retain templates for processing in a logical way. When a record requires the use of a template that is not in the cache, the record is passed to the stage for error handling.

For example, say you use the UDP Source origin to process NetFlow 9 data from five servers. Each server sends data using a different template, so to process data from these five servers, you can set the cache size to five templates. But to allow for additional servers that might be added later, you might set the template cache to a higher number.

Most servers resend templates periodically, so you might take this refresh interval into account when you configure the cache timeout.

For example, say your server resends templates every three minutes. If you set the cache timeout for two minutes, then a template that hasn't been used in two minutes gets evicted. If the server sends a packet that requires the evicted template, the stage generates an error record because the template is not available. If you set the cache timeout for four minutes and an unlimited cache size, then the templates from all servers remain in the cache until replaced by a new version of the template.

Note: Data Collector keeps the cached templates in memory. If you need to cache large numbers of templates, you might want to increase the Data Collector heap size accordingly. For more information, see Java Heap Size in the Data Collector documentation.

1.4.1.             NetFlow 5 Generated Records

When processing NetFlow 5 records, Data Collector ignores any configured NetFlow 9 configuration properties.

The generated NetFlow 5 records include processed data as fields in the record, with no additional metadata, as follows:

 {
      "tcp_flags" : 27,
      "last" : 1503089880145,
      "length" : 360,
      "raw_first" : 87028333,
      "flowseq" : 0,
      "count" : 7,
      "proto" : 6,
      "dstaddr" : 1539135649,
      "seconds" : 1503002821,
      "id" : "27a647b5-9e3a-11e7-8db3-874a63bd401c",
      "engineid" : 0,
      "srcaddr_s" : "172.17.0.4",
      "sender" : "/0:0:0:0:0:0:0:1",
      "srcas" : 0,
      "readerId" : "/0:0:0:0:0:0:0:0:9999",
      "src_mask" : 0,
      "nexthop" : 0,
      "snmpinput" : 0,
      "dPkts" : 11214,
      "raw_sampling" : 0,
      "timestamp" : 1503002821000,
      "enginetype" : 0,
      "samplingint" : 0,
      "dstaddr_s" : "91.189.88.161",
      "samplingmode" : 0,
      "srcaddr" : -1408172028,
      "first" : 1503089849333,
      "raw_last" : 87059145,
      "dstport" : 80,
      "nexthop_s" : "0.0.0.0",
      "version" : 5,
      "uptime" : 0,
      "dOctets" : 452409,
      "nanos" : 0,
      "dst_mask" : 0,
      "packetid" : "b58f5750-7ccd-1000-8080-808080808080",
      "srcport" : 51156,
      "snmponput" : 0,
      "tos" : 0,
      "dstas" : 0
   }

1.4.2.             NetFlow 9 Generated Records

NetFlow 9 records are generated based on the Record Generation Mode that you select for the NetFlow 9 stage properties. You can include "interpreted" or processed values, raw data, or both in NetFlow 9 records.

NetFlow 9 records can include the following fields:

NetFlow 9 Field Name

Description

Included...

flowKind

Indicates the type of flow to be processed:

  • FLOWSET for data from a flowset.
  • OPTIONS for data from an options flow.

In all NetFlow 9 records.

values

A map field with field names and values as processed by the stage based on the template specified in the packet header.

In NetFlow 9 records when you configure the Record Generation Mode property to include “interpreted” data in the record.

packetHeader

A map field containing information about the packet. Typically includes information such as the source ID and the number of records in the packet.

In all NetFlow 9 records.

rawValues

A map field with the fields defined by the associated template and the raw, unprocessed, bytes for those fields.

In NetFlow 9 records when you configure the Record Generation Mode property to include raw data in the record.

Sample Raw and Interpreted Record

When you set the Record Generation Mode property to Raw and Interpreted Data, the resulting record includes all of the possible NetFlow 9 fields, as follows:

{
      "flowKind" : "FLOWSET",
      "values" : {
         "ICMP_TYPE" : 0,
         "L4_DST_PORT" : 9995,
         "TCP_FLAGS" : 0,
         "L4_SRC_PORT" : 52767,
         "INPUT_SNMP" : 0,
         "FIRST_SWITCHED" : 86400042,
         "PROTOCOL" : 17,
         "IN_BYTES" : 34964,
         "OUTPUT_SNMP" : 0,
         "LAST_SWITCHED" : 86940154,
         "IPV4_SRC_ADDR" : "127.0.0.1",
         "SRC_AS" : 0,
         "IN_PKTS" : 29,
         "IPV4_DST_ADDR" : "127.0.0.1",
         "DST_AS" : 0,
         "SRC_TOS" : 0,
         "FORWARDING_STATUS" : 0
      },
      "packetHeader" : {
         "flowRecordCount" : 8,
         "sourceIdRaw" : "AAAAAQ==",
         "version" : 9,
         "sequenceNumber" : 0,
         "unixSeconds" : 1503002821,
         "sourceId" : 1,
         "sysUptimeMs" : 0
      },
      "rawValues" : {
         "OUTPUT_SNMP" : "AAA=",
         "IN_BYTES" : "AACIlA==",
         "LAST_SWITCHED" : "BS6Z+g==",
         "IPV4_SRC_ADDR" : "fwAAAQ==",
         "SRC_AS" : "AAA=",
         "IPV4_DST_ADDR" : "fwAAAQ==",
         "IN_PKTS" : "AAAAHQ==",
         "DST_AS" : "AAA=",
         "FORWARDING_STATUS" : "AA==",
         "SRC_TOS" : "AA==",
         "ICMP_TYPE" : "AAA=",
         "TCP_FLAGS" : "AA==",
         "L4_DST_PORT" : "Jws=",
         "L4_SRC_PORT" : "zh8=",
         "INPUT_SNMP" : "AAA=",
         "FIRST_SWITCHED" : "BSZcKg==",
         "PROTOCOL" : "EQ=="
      }
   }

Sample Interpreted Record

When you set the Record Generation Mode property to Interpreted Only, the resulting record omits the rawValues field from the record, as follows:

{
      "flowKind" : "FLOWSET",
      "values" : {
         "ICMP_TYPE" : 0,
         "L4_DST_PORT" : 9995,
         "TCP_FLAGS" : 0,
         "L4_SRC_PORT" : 52767,
         "INPUT_SNMP" : 0,
         "FIRST_SWITCHED" : 86400042,
         "PROTOCOL" : 17,
         "IN_BYTES" : 34964,
         "OUTPUT_SNMP" : 0,
         "LAST_SWITCHED" : 86940154,
         "IPV4_SRC_ADDR" : "127.0.0.1",
         "SRC_AS" : 0,
         "IN_PKTS" : 29,
         "IPV4_DST_ADDR" : "127.0.0.1",
         "DST_AS" : 0,
         "SRC_TOS" : 0,
         "FORWARDING_STATUS" : 0
      },
      "packetHeader" : {
         "flowRecordCount" : 8,
         "sourceIdRaw" : "AAAAAQ==",
         "version" : 9,
         "sequenceNumber" : 0,
         "unixSeconds" : 1503002821,
         "sourceId" : 1,
         "sysUptimeMs" : 0
      },
   }

Sample Raw Record

When you set the Record Generation Mode property to Raw Only, the resulting record omits the values field that contains processed data, as follows:

{
      "flowKind" : "FLOWSET",
       "packetHeader" : {
         "flowRecordCount" : 8,
         "sourceIdRaw" : "AAAAAQ==",
         "version" : 9,
         "sequenceNumber" : 0,
         "unixSeconds" : 1503002821,
         "sourceId" : 1,
         "sysUptimeMs" : 0
      },
      "rawValues" : {
         "OUTPUT_SNMP" : "AAA=",
         "IN_BYTES" : "AACIlA==",
         "LAST_SWITCHED" : "BS6Z+g==",
         "IPV4_SRC_ADDR" : "fwAAAQ==",
         "SRC_AS" : "AAA=",
         "IPV4_DST_ADDR" : "fwAAAQ==",
         "IN_PKTS" : "AAAAHQ==",
         "DST_AS" : "AAA=",
         "FORWARDING_STATUS" : "AA==",
         "SRC_TOS" : "AA==",
         "ICMP_TYPE" : "AAA=",
         "TCP_FLAGS" : "AA==",
         "L4_DST_PORT" : "Jws=",
         "L4_SRC_PORT" : "zh8=",
         "INPUT_SNMP" : "AAA=",
         "FIRST_SWITCHED" : "BSZcKg==",
         "PROTOCOL" : "EQ=="
      }
   }

1.5. Protobuf Data Format Prerequisites

Perform the following prerequisites before reading or writing protobuf data.

Data Collector processes data based on a protobuf descriptor file. The descriptor file (.desc) describes one or more message types. When you configure a stage to process data, you specify the message type to use.

Before processing protobuf data, perform the following tasks:

Generate the protobuf descriptor file

When you generate the descriptor file, you need the .proto files that define the message type and any related dependencies.

To generate the descriptor file, use the protobuf protoc command with the descriptor_set_out flag and the .proto files to use, as follows:

protoc --include_imports --descriptor_set_out=<filename>.desc <filename>.proto <filename>.proto <filename>.proto...

For example, the following command creates an Engineer descriptor file that describes the Engineer message type based on information from the Engineer, Person, and Extension proto files:

protoc --include_imports --descriptor_set_out=Engineer.desc Engineer.proto Person.proto Extensions.proto

For more information about protobuf and the protoc command, see the protobuf documentation.

Store the descriptor file in the Data Collector resources directory

Save the generated descriptor file in the $SDC_RESOURCES directory. For more information about environment variables, see Data Collector Environment Configuration in the Data Collector documentation..

For a list of origins and destinations that process protobuf data, see Data Format Support.

1.6. SDC Record Data Format

SDC Record is a proprietary data format that Data Collector uses to generate error records. Data Collector can also use the data format to read and write data.

Use the SDC Record data format to process data from another pipeline or to generate data to be processed by another pipeline.

For a list of origins and destinations that process SDC Record data, see Data Format Support.

1.7. Text Data Format with Custom Delimiters

By default, the text data format creates records based on line breaks, creating a record for each line of text. You can configure origins to create records based on custom delimiters.

Use custom delimiters when the origin system uses delimiters to separate logical sections of data that you want to use as records. A custom delimiter might be as simple as a semicolon or might be a set of characters. You can even use an XML tag as a custom delimiter to read XML data.

Note: When using a custom delimiter, the origin uses the delimiter characters to create records, ignoring new lines.

For most origins, you can include the custom delimiters in records or you can remove them. For the Hadoop FS and MapR FS origins, you cannot include the custom delimiters in records.

For example, say you configure the Directory origin to process a file with the following text, using a semicolon as a delimiter, and discarding the delimiter:

8/12/2016 6:01:00 unspecified error message;8/12/2016 
6:01:04 another error message;8/12/2016 6:01:09 just a warning message;

The origin generates the following records, with the data in a single text field:

TEXT

8/12/2016 6:01:00 unspecified error message

8/12/2016

6:01:04 another error message

8/12/2016 6:01:09 just a warning message

Note that the origin retains the line break, but does not use it to create a separate record.

1.7.1.             Processing XML Data with Custom Delimiters

You can use custom delimiters with the text data format to process XML data. You might use the text data format to process XML data with no root element, which cannot be processed with the XML data format.

When using the text data format in the origin to read XML data, you can use the XML Parser processor downstream to parse the XML data.

For example, the following XML document is valid and is best processed using the XML data format:

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <msg>
    <time>8/12/2016 6:01:00</time>
    <request>GET /index.html 200</request>
 </msg>
 <msg>
    <time>8/12/2016 6:03:43</time>
    <request>GET /images/sponsored.gif 304</request>
 </msg>
</root>

However, the following XML document does not include an XML prolog or root element, so it is invalid:

<msg>
    <time>8/12/2016 6:01:00</time>
    <request>GET /index.html 200</request>
</msg>
<msg>
    <time>8/12/2016 6:03:43</time>
    <request>GET /images/sponsored.gif 304</request>
</msg>

You can use the text data format with a custom delimiter to process the invalid XML document. To do so, use </msg> as the custom delimiter to separate data into records, and make sure to include the delimiter in the record as follows:

 

 

 

When origins process text data, they write record data into a single text field named "text". When Directory processes the invalid XML document, it creates two records:

TEXT

<msg> <time>8/12/2016 6:01:00</time> <request>GET /index.html 200</request> </msg>

<msg> <time>8/12/2016 6:03:43</time> <request>GET /images/sponsored.gif 304</request> </msg>

You can configure the XML Parser to parse the XML data as follows:

 

 

 

The XML Parser converts the time and request attributes to list fields within the text map field, as shown. The table displays data types in angle brackets ( < > ):

TEXT <MAP>

- time <list>:

  • 0 <map>:

- value <string>: 8/12/2016 6:01:00

- request <list>:

  • 0 <map>:

- value <string>: GET /index.html 200

- time <list>:

  • 0 <map>:

- value <string>:: 8/12/2016 6:03:43

- request <list>:

  • 0 <map>:

- value <string>: GET /images/sponsored.gif 304

1.8. Whole File Data Format

You can use the whole file data format to move entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file. You cannot perform additional processing on whole file data in the pipeline.

When moving whole files, Data Collector streams file data from the origin system and writes the data to the destination system. Data Collector writes the files based on the directory and file name defined in the destination.

You can limit the resources used to transfer data by specifying a transfer rate. By default, whole file transfers use available resources as needed.

When the origin system provides checksum metadata, you can configure the origin to verify the checksum. Destinations that generate events can include checksum information in event records.

Most destinations allow you to define access permissions for written files. By default, written files use the default permissions of the destination system.

Note: During data preview, a whole file pipeline displays a single record instead of the number of records configured for the preview batch size.

You can use the following origins to read whole files:

  • Amazon S3
  • Directory
  • Google Cloud Storage
  • Hadoop FS Standalone
  • MapR FS Standalone
  • SFTP/FTP Client

You can use the following destinations to write whole files:

  • Amazon S3
  • Azure Data Lake Store
  • Google Cloud Storage
  • Hadoop FS
  • Local FS
  • MapR FS

1.8.1.             Basic Pipeline

A pipeline that processes whole files include an origin and one or more destinations that support the whole file data format.

You can include certain processors to read the file data or to modify file information included in the record, such as the file name or size. You cannot use processors to transform data in the file.

Here's a basic pipeline that moves whole files from Amazon S3 to HDFS:

 

 

 

1.8.2.             Whole File Records

When reading whole files, the origin produces a record with two fields:

  • fileref - Contains a reference that enables streaming the file from the origin system to the destination system. You can use scripting processors to read the fileref field. You cannot modify information in the fileref field.
  • fileInfo - A map of file properties, such as file path, file name, and file owner. The details differ based on the origin system. You can use processors to modify information in the fileInfo field as needed.

Tip: You can use data preview to determine the information and field names that are included in the fileInfo field.

1.8.3.             Additional Processors

You can use certain processors in a whole file pipeline to read the file data in the fileref field or to modify file information in the fileInfo field, such as the file owner or permissions.

Tip: You can use data preview to determine the information and field names that are included in the fileInfo field. The information and field names can differ based on the origin system.

You can use the following additional processors in a whole file pipeline:

Expression Evaluator

Use to update fileInfo fields.

For example, you might use the Expression Evaluator to update the owner of the file.

Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator

Use to access the fileref field by creating an input stream in the code using the getInputStream() API.

For example, you might use the Groovy Evaluator to use Groovy code that reads the file data in the fileref field and then creates new records with the data.

Note: Be sure to close the input stream in the code when the processor has finished reading the stream.

For information about processing whole files with the Groovy Evaluator, see Accessing Whole File Format Records.

For information about processing whole files with the JavaScript Evaluator, see Accessing Whole File Format Records.

For information about processing whole files with the Jython Evaluator, see Accessing Whole File Format Records.

Stream Selector

Use to route files based on information in the fileInfo fields.

For example, you might use the Stream Selector to route records to different destinations based on the size of the file. The following pipeline routes files under 2 MB to a local file system and uses the default stream to route all larger files to HDFS:

 

 

 

1.8.4.             Defining the Transfer Rate

By default, the pipeline uses all available resources to transfer whole file data. Define a transfer rate to limit the resources that a whole file pipeline uses. For example, you might specify a transfer rate to enable running multiple whole file pipelines simultaneously or to reserve resources for other processing.

Specify a transfer rate by configuring the Rate per Second property in the origin, in the whole file data format properties.

The Rate per Second property is not used by default, allowing the pipeline to use all available resources. If you specify a transfer rate, the unit of measure is bytes per second by default. You can use a different unit of measure per second by using the unit of measure in an expression.

For example, if you enter 1000, then the pipeline uses a transfer rate of 1000 bytes/second. To specify a rate of 10 MB/second, you can use the following expression: ${10 * MB}.

1.8.5.             Writing Whole Files

When writing whole files, you configure a File Name Expression property in the destination. The expression defines the name for the output file.

Each whole file origin includes file information in the fileInfo fields. So you can easily base the output file names on the original file names from the source system.

The following table lists the field names that hold the input file name for each origin, and a basic expression that names the output file based on the input file name:

Origin

File Name Field Path

Base Expression

Directory

/fileInfo/filename

${record:value('/fileInfo/filename')}

SFTP/FTP

/fileInfo/filename

${record:value('/fileInfo/filename')}

Amazon S3

/fileInfo/objectKey

${record:value('/fileInfo/objectKey')}*

* Note that the objectKey field can include a field path as well as a file name. Use this expression when the objectKey is just a file name.

Example

You want a pipeline to pass whole files from a local directory to Amazon S3. For the output file name, you want to append the .json file extension to the original file name.

The Directory origin stores the original file name in the /fileInfo/filename field, so you can use the following expression for the Amazon S3 File Name Expression property:

${str:concat(record:value('/fileInfo/filename'), ".json")}

Or, more simply...

${record:value('/fileInfo/filename’)}.json

Access Permissions

By default, when using the whole file data format, output files use the default access permissions defined in the destination system. Most destinations allow you to specify access permissions for output files. Amazon S3 does not allow it.

You can enter an expression to define access permissions. Expressions should evaluate to a symbolic or numeric/octal representation of the permissions you want to use. For example, to make files read-only for all users, the symbolic representation is -r--r--r--. The numeric or octal representation is 0444.

To use the original source file permissions for each output file, you can use the following expression:

${record:value('/fileInfo/permissions')}

This ensures, for example, that a source file with execute permission for only the file owner is written to the destination system with the exact same set of permissions.

Including Checksums in Events

Destinations that generate events can include a checksum for each file.

When you enable checksum use, the destination includes the checksum and the checksum algorithm in the whole file event record. Whole file event records are generated each time the destination completes writing a whole file.

You can use the following algorithms to generate checksums:

  • MD5
  • SHA1
  • SHA256
  • SHA512
  • MURMUR3_32
  • MURMUR3_128

For details about event generation and event records for a specific destination, see the destination documentation. For generation information about the event framework, see Dataflow Triggers Overview.

1.9. Reading and Processing XML Data

You can parse XML documents from an origin system with an origin enabled for the XML data format. You can also parse XML documents in a field in a Data Collector record with the XML Parser processor.

You can use the XML data format and the XML Parser to process well-formed XML documents. If you want to process invalid XML documents, you can try using the text data format with custom delimiters. For more information, see Processing XML Data with Custom Delimiters.

Data Collector uses a user-defined delimiter element to determine how it generates records. When processing XML data, you can generate a single record or multiple records from an XML document, as follows:

Generate a single record

To generate a single record from an XML document, do not specify a delimiter element.

When you generate a single record from an XML document, the entire document is written to the record as a map.

Generate multiple records using an XML element

You can generate multiple records from an XML document by specifying an XML element as the delimiter element.

You can use an XML element when the element resides directly under the root element.

Generate multiple records using a simplified XPath expression

You can generate multiple records from an XML document by specifying a simplified XPath expression as the delimiter element.

Use a simplified XPath expression to access data below the first level of elements in the XML document, to access namespaced elements, elements deeper in complex XML documents.

 

 

1.9.1.             Creating Multiple Records with an XML Element

You can generate records by specifying an XML element as a delimiter.

When the data you want to use is in an XML element directly under the root element, you can use the element as a delimiter. For example, in the following valid XML document, you can use the msg element as a delimiter element:

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <msg>
    <time>8/12/2016 6:01:00</time>
    <request>GET /index.html 200</request>
 </msg>
 <msg>
    <time>8/12/2016 6:03:43</time>
    <request>GET /images/sponsored.gif 304</request>
 </msg>
</root>

Processing the document with the msg delimiter element results in two records.

Note: When configuring the Delimiter Element property, enter just the XML element name, without surrounding carat characters. That is, use msg instead of <msg>.

Using XML Elements with Namespaces

When you use an XML element as a delimiter, Data Collector uses that exact element name that you specify to generate records.

If you include a namespace prefix in the XML element, you must also define the namespace in the stage. Then, Data Collector can process the specified XML element with the prefix.

For example, you use the a:msg element as the delimiter element and define the Company A namespace in the stage. Then, Data Collector processes only the a:msg element in the Company A namespace. It generates one record for the following document, ignoring data in the c:msg element:

<?xml version="1.0" encoding="UTF-8"?>
<root>
        <a:msg xmlns:a="http://www.companyA.com">
               <time>8/12/2016 6:01:00</time>
               <request>GET /index.html 200</request>
        </a:msg>
        <c:msg xmlns:c="http://www.companyC.com">
               <item>Shoes</item>
               <item>Magic wand</item>
               <item>Tires</item>
        </c:msg>
        <c:msg xmlns:c="http://www.companyC.com">
               <time>8/12/2016 6:03:43</time>
               <request>GET /images/sponsored.gif 304</request>
        </c:msg>
</root>

In the stage, you define the Namespace property using the prefix "a" and the namespace URI: http://www.companyA.com.

The following image shows a Directory origin configured to process this data:

 

1.9.2.             Creating Multiple Records with an XPath Expression

You can generate records from an XML document using a simplified XPath expression as the delimiter element.

Use a simplified XPath expression to access data below the first level of elements in the XML document. You can also use an XPath expression to access namespaced elements or elements deeper in complex XML documents.

For example, say an XML document has record data in a second level msg element, as follows:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <data>
        <msg>
            <time>8/12/2016 6:01:00</time>
            <request>GET /index.html 200</request>
        </msg>
    </data>
    <data>
        <msg>
            <time>8/12/2016 6:03:43</time>
            <request>GET /images/sponsored.gif 304</request>
        </msg>
    </data>
</root>

Since the msg element is not directly under the root element, you cannot use it as a delimiter element. But you can use the following simplified XPath expression to access the data:

/root/data/msg

Or, if the first level data element can sometimes be info, you can use the following XPath expression to access any data in the msg element that is two levels deep:

/root/*/msg

Using XPath Expressions with Namespaces

When using an XPath expression to process XML documents, you can process data within a namespace. To access data in a namespace, define the XPath expression, then use the Namespace property to define the prefix and definition of the namespace.

For example, the following XML document includes two namespaces, one for Company A and one for Company C:

<?xml version="1.0" encoding="UTF-8"?>
<root>
        <a:data xmlns:a="http://www.companyA.com">
               <msg>
                       <time>8/12/2016 6:01:00</time>
                       <request>GET /index.html 200</request>
               </msg>
        </a:data>
        <c:data xmlns:c="http://www.companyC.com">
               <sale>
                       <item>Shoes</item>
                       <item>Magic wand</item>
                       <item>Tires</item>
               </sale>
        </c:data>
        <a:data xmlns:a="http://www.companyA.com">
               <msg>
                       <time>8/12/2016 6:03:43</time>
                       <request>GET /images/sponsored.gif 304</request>
               </msg>
        </a:data>
</root>

To create records from data in the msg element in the Company A namespace, you can use either of the following XPath expressions:

/root/a:data/msg
/root/*/msg

Then define the Namespace property using the prefix "a" and the namespace URI: http://www.companyA.com.

The following image shows a Directory origin configured to process this data:

 

 

 

Simplified XPath Syntax

When using an XPath expression to generate records from an XML document, use a simplified version of the abbreviated XPath syntax.

Use the abbreviated XPath syntax with the following restrictions:

Operators and XPath functions

Do not use operators or XPath functions in the XPath expression.

Axis selectors

Use only the single slash ( / ) child selector. The descendant-or-self double slash selector ( // ) is not supported.

Node tests

Only node name tests are supported. Note the following details:

  • You can use namespaces with node names, defined with an XPath namespace prefix. For more information, see Using XPath Expressions with Namespaces.
  • Do not use namespaces for attributes.
  • Elements can include predicates.

Predicates

  • You can use the position predicate or attribute value predicate with elements, not both.
  • Use the following syntax to specify a position predicate:
  • Use the following syntax to specify an attribute value predicate:
  • You can use the asterisk wildcard as the attribute value. Surround the value in single quotation marks.
/<element>[<position number>]
/<element>[@<attribute name>='<attribute value>']

For more information, see Predicates in XPath Expressions.

Wildcard character

You can use the asterisk ( * ) to represent a single element, as follows:

/root/*/msg

You can also use the asterisk to represent any attribute value. Use the asterisk to represent the entire value, as follows:

/root/info[@attribute='*']/msg

Sample XPath Expressions

Here are some examples of valid and invalid XPath expressions:

Valid expressions

The following expression selects every element beneath the first top-level element.

/*[1]/*

The following expression selects every value element under an allvalues element with a source attribute set to "XYZ". The allvalues element below a top-level element named root. Each element is in the abc namespace:

/abc:root/abc:allvalues[@source='XYZ']/xyz:value

Invalid expressions

The following expressions are not valid:

  • /root//value - Invalid because the descendent-or-self axis (“//”) is not supported.
  • /root/collections[last()]/value - Invalid because functions, e.g. last, are not supported.
  • /root/collections[@source='XYZ'][@sequence='2'] - Invalid because multiple predicates for an element are not supported.
  • /root/collections[@source="ABC"] - Invalid because attribute the attribute value should be in single quotation marks.
  • /root/collections[@source] - Invalid because the expression uses an attribute without defining the attribute value.

Predicates in XPath Expressions

You can use predicates in XPath expressions to process a subset of element instances. You can use a position predicate or attribute values predicate with an element, but not both. You can also use a wildcard to define the attribute value.

Position predicate

The position predicate indicates the instance of the element to use in the file. Use a position predicate when the element appears multiple times in a file, and you want to use a particular instance based on the position of the instances in the file, e.g. the first, second, or third time the element appears in the file.

Use the following syntax to specify a position predicate:

/<element>[<position number>]

For example, say the contact element appears multiple times in the file, but you only care about the address data in the first instance in the file. Then you can use a predicate for the element as follows:

/root/contact[1]/address

Attribute value predicate

The attribute value predicate limits the data to elements with the specified attribute value. Use the attribute value predicate when you want to specify an element with a particular attribute values or an element that simply has an attribute value defined.

Use the following syntax to specify an attribute value predicate:

/<element>[@<attribute name>='<attribute value>']

You can use the asterisk wildcard as the attribute value. Surround the value in single quotation marks.

For example, if you only wanted server data with a region attribute set to "west", you can add the region attribute as follows:

/*/server[@region='west']

Predicate Examples

To process all data in a collections element under the apps second-level element, you would use the following simplified XPath expression:

/*/apps/collections

If you only wanted the data under the first instance of apps in the XML document, you would add a position predicate as follows:

/*/apps[1]/collections

To only process data from all app elements in the document where the collections element has a version attribute set to 3, add the version attribute and value as follows:

/*/apps/collections[@version='3']

If you don't care what the value of the attribute is, you can use a wildcard for the attribute value, as follows:

/root/apps/collections[@version='*']

1.9.3.             Including Field XPaths and Namespaces

You can include field XPath expressions and namespaces in the record by enabling the Include Field XPaths property.

When enabled, the record includes the XPath expression for each field as a field attribute and includes each namespace in an xmlns record header attribute. By default, this information is not included in the record.

For example, say you have the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<bookstore xmlns:prc="http://books.com/price">
   <b:book xmlns:b="http://books.com/book">
      <title lang="en">Harry Potter</title>
      <prc:price>29.99</prc:price>
   </b:book>
   <b:book xmlns:b="http://books.com/book">
      <title lang="en_us">Learning XML</title>
      <prc:price>39.95</prc:price>
   </b:book>
</bookstore>

When you use /*[1]/* as the delimiter element and enable the Include Field XPaths property, Data Collector generates the following records with the highlighted field XPath expressions and namespace record header attributes:

 

 

 

Note: Field attributes and record header attributes are written to destination systems automatically only when you use the SDC RPC data format in destinations. For more information about working with field attributes and record header attributes, and how to include them in records, see Field Attributes and Record Header Attributes.

1.9.4.             XML Attributes and Namespace Declarations

Parsed XML includes XML attributes and namespace declarations in the record as individual fields by default. You can use the Output Field Attributes property to place the information in field attributes instead.

Place the information in field attributes to avoid adding unnecessary information in the record fields.

Note: Field attributes are automatically included in records written to destination systems only when you use the SDC RPC data format in the destination. For more information about working with field attributes, see Field Attributes.

1.9.5.             Parsed XML

When parsing XML documents with the XML data format or the XML Parser processor, Data Collector generates a field that is a map of fields based on nested elements, text nodes, and attributes. Comment elements are ignored.

For example, say you have the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <a:info xmlns:a="http://www.companyA.com">  
  <sale>  
    <item>Apples</item>  
    <item>Bananas</item> 
  </sale>
</a:info>
<c:info xmlns:c="http://www.companyC.com"> 
  <sale>
    <item>Shoes</item>
    <item>Magic wand</item>  
    <item>Tires</item>
  </sale>
 </c:info>
</root>

To create records for the data in the sale element in both namespaces, you can use a wildcard to represent the second level info element, as follows:

/root/*/sale

Then, you define both namespaces in the origin.

When processing the XML document using default XML properties, Data Collector produces two records, as shown in the following data preview of the origin:

 

 

 

1.10.   Writing XML Data

When writing XML data, destinations create a valid XML document for each record. The destination requires the record to have a single root field that contains the rest of the record data.

When writing XML data, you can configure the destination to perform the following tasks:

  • Produce a "pretty" output - The destination can add indentation to make the XML data human-readable. This adds additional bytes to the record size.
  • Validate the schema - The destination can validate that the generated XML conforms to the specified schema definition. Records with invalid schemas are handled based on the error handling configured for the destination.

1.10.1.           Record Structure Requirement

When writing XML data, the destination expects all record data under a single root field. When necessary, merge record data into a root field earlier in the pipeline. You can use the Expression Evaluator and Field Remover processors to perform this task.

For example, in the following pipeline, the Expression Evaluator uses the expression, ${record:value('/')}, to create a root field and copy the entire record under the root field:

 

 

 

Then, you can use the Field Remover to keep only the root field, which removes all other fields:

 

 

 

 


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM