Amazon S3

Most recent version: v1.0.0

See the changelog of this Data sink type here.

Overview

Onum supports integration with Amazon S3.

Amazon S3 is an object storage service that stores and protects any amount of data for a wide range of use cases, including data lakes, websites, cloud-native applications, backups, archives, machine learning, and analytics.

Select Amazon S3 from the list of Data sink types and click Configuration to start.

Required AWS permissions

Before starting to configure the Amazon S3 Data sink, note that the following Amazon S3 IAM permissions are required:

s3:ListBucket
s3:PutObject
s3:AbortMultipartUpload
s3:PutObjectAcl (This is only required if you set the Canned ACL option in the Data sink configuration)

Data sink configuration

Now you need to specify how and where to send the data, and how to establish a connection with Amazon S3.

Metadata

Enter the basic information for the new Data sink.

Parameters

Description

Name*

Enter a name for the new Data sink.

Description

Optionally, enter a description for the Data sink.

Tags

Add tags to easily identify your Data sink. Hit the Enter key after you define each tag.

Metrics display

Decide whether or not to include this Data sink info in the metrics and graphs of the Home area.

Configuration

Now, add the configuration to establish the connection.

AWS

Enter the specific configuration for AWS. You'll find this data in the General purpose buckets area of your Amazon S3 account.

Parameter

Description

Bucket*

The AWS bucket your data is stored in. This is the bucket Name found in your General purpose buckets area.

Region*

Choose the region the cloud server is found in, also found in your General purpose buckets area, next to the name.

S3 object

S3 objects are files or data sets that are stored in a bucket. Each object is identified by a key that uses prefixes to simulate a folder structure. Click the bucket name to view its Objects and properties. Click an object to open it and see the following parameters.

Parameter

Description

Storage class

The desired S3 storage class. See this in the main objects table, or by clicking the object and going to Storage Class.

Canned ACL

Choose the S3 Access Control List.

Global prefix

Add a static prefix for all the object keys.

Auth

Only if your Bucket requires authorization.

Parameter

Description

Access key ID*

Add the access key from your Secrets or create one. The Access Key ID is found in the IAM Dashboard of the AWS Management Console.

In the left panel, click on Users.
Select your IAM user.
Under the Security Credentials tab, scroll to Access Keys and you will find existing Access Key IDs (but not the secret access key).

Secret access key*

Add the secret access key from your Secrets or create one.

Under Access keys, you can see your Access Key IDs, but AWS will not show the Secret Access Key. You must have it saved somewhere. If you don't have the secret key saved, you need to create a new one

Advanced options

Parameter

Description

Raw format - Max object size / Parquet format - Input size

Enter the maximum size of each object (in MB) that is sent to the S3 bucket.

Use Raw format - Max object size if you select Raw as the format in the output configuration.
Use Parquet format - Input size if you select Parquet as the format in the output configuration.

Instead of partitioning by time, you can partition by the size of the message. Assign here the object's maximum size (in MB). If you do not select a Partition by value, a new object is created upon reaching this limit.

For both options, the minimum value is 1 , and the maximum value is 5243000. The default value is 100.

Custom endpoint

If you have one, enter your custom endpoint.

If your edge services are deployed on-premises, make sure to check your available disk space. This is because setting an Input size greater than the disk space available may lead to technical issues with workers or infrastructure.

Click Finish when complete. Your new Data sink will appear in the Data sinks area list.

Pipeline configuration

When it comes to using this Data sink in a Pipeline, you must configure the following output parameters. To do it, simply click the Data sink on the canvas and select Configuration.

Output configuration

Format

Choose whether the event Format is Raw or Parquet. Depending on the format selected, you'll be prompted to fill in the corresponding parameters:

Parameter

Description

Event field*

This is the name of the input event field.

Framing method*

This parameter defines how events are separated within an S3 object (further defined in the S3 object section of the Data sink). Choose between the various options:

Newline - Uses a newline character ('\n') to separate individual records in the output.
Length - The S3 framing method length is 10 bytes.
No framing - All events are contained in one line, leading to a long line until the maximum size is reached, with only one region.

Compress data?

Choose between true/false to enable/disable compression.

Key format

Choose the format for the name of the objects:

Parameter

Description

Prefix

The prefix used to organize your S3 data.

Partition by

This indicates the frequency with which to generate a new S3 object e.g. every year, month, day hour, minute. If left blank, the value used will be the Max object size / Input size entered in the Data sink configuration.

Click Save to save your configuration.

PreviousAmazon Kinesis Data Stream NextAmazon SQS

Last updated 15 days ago

Was this helpful?