Send data to Amazon S3

Most recent version: v1.0.1

See the changelog of the Amazon S3 Data Sink type here.

Overview

The following article outlines a basic data flow from Onum to Amazon Simple Storage Service (S3).

Amazon S3 is an object storage service that stores and protects any amount of data for a wide range of use cases, including data lakes, websites, cloud-native applications, backups, archives, machine learning, and analytics.

Prerequisites

Before configuring and starting to receive data with the Amazon S3 Data Sink, you need to take into consideration the following requirements:

Your Amazon user needs at least permission to use the GetObject operation (S3) and the ReceiveMessage and DeleteMessageBatch operations (SQS Bucket) to make this Data SInk work.
Cross-Region Configurations: Ensure that your S3 bucket and SQS queue are in the same AWS Region, as S3 event notifications do not support cross-region targets.
Permissions: Confirm that the AWS Identity and Access Management (IAM) roles associated with your S3 bucket and SQS queue have the necessary permissions.
Object Key Name Filtering: If you use special characters in your prefix or suffix filters for event notifications, ensure they are URL-encoded.

Required AWS permissions

Before starting to configure the Amazon S3 Data sink, note that the following Amazon S3 IAM permissions are required:

s3:ListBucket
s3:PutObject
s3:AbortMultipartUpload
s3:PutObjectAcl (This is only required if you set the Canned ACL option in the Data sink configuration)

Amazon S3 Setup

You need to configure your Amazon S3 bucket to receive notifications to an Amazon Simple Queue Service (SQS) queue when new files are added.

Create an S3 Bucket

Log in to your AWS Management Console.
Go to S3 and select Create bucket. Enter the following details and remember them, you will need them later.
- Bucket name (e.g., my-data-receiver-bucket).
- AWS Region (choose one close to your data source).

Set the Bucket permissions

Go to IAM (Identity and Access Management) to manage users, groups, roles and permissions.

Under Permissions Policies, make sure you have assigned the policy FullAccess to give full access to S3 resources. Alternatively, if you have custom permissions, go to Policies - Create Policy and in the JSON tab, paste your custom JSON e.g.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:PutObjectAcl"],
      "Resource": "arn:aws:s3:::my-data-receiver-bucket/*"
    }
  ]
}

Save the changes. This policy grants your S3 bucket permission to send messages to your SQS queue.

Download the Access Key ID and Secret Access Key — you’ll need these later.

Onum Setup

Double-click the Amazon S3 Sink.

Enter a Name for the new Data Sink. Optionally, add a Description and some Tags to identify the Sink.

Decide whether or not to include this Data sink info in the metrics and graphs of the Home area.

In the AWS authentication section, enter the Bucket* your data will be stored in. This is the bucket Name found in your General purpose buckets area.

In the AWS authentication section, enter the Region* of your AWS console, also found in your General purpose buckets area, next to the name.

S3 object

S3 objects are files or data sets that are stored in a bucket. Each object is identified by a key that uses prefixes to simulate a folder structure. Click the bucket name to view its Objects and properties. Click an object to open it and see the following parameters.

Storage class - The desired S3 storage class. See this in the main objects table, or by clicking the object and going to Storage Class.
Canned ACL - Choose the S3 Access Control List.
Global prefix - Add a static prefix for all the object keys.

Select the Access Key ID from your Secrets or click New secret to generate a new one.

The Access Key ID is found in the IAM Dashboard of the AWS Management Console.

In the left panel, click on Users.
Select your IAM user.
Under the Security Credentials tab, scroll to Access Keys, and you will find existing Access Key IDs (but not the secret access key).

Select the Secret Access Key from your Secrets or click New secret to generate a new one.

Under Access keys, you can see your Access Key IDs, but AWS will not show the Secret Access Key. You must have it saved somewhere. If you don't have the secret key saved, you need to create a new one.

Click New secret to create a new one:

Give the secret a Name.
Turn off the Expiration date option.
Click Add new value and paste the secret corresponding to the JWT token you generated before. Remember that the token will be added in the Zscaler configuration.
Click Save.

Learn more about secrets in Onum in this article.

Enter the maximum size of each object (in MB) that is sent to the S3 bucket.

Use Raw format - Max object size if you select Raw as the format in the output configuration.
Use Parquet format - Input size if you select Parquet as the format in the output configuration.

Instead of partitioning by time, you can partition by the size of the message. Assign here the object's maximum size (in MB). If you do not select a Partition by value, a new object is created upon reaching this limit.

For both options, the minimum value is 1 , and the maximum value is 5243000. The default value is 100.

If your edge services are deployed on-premises, make sure to check your available disk space. This is because setting an Input size greater than the disk space available may lead to technical issues with workers or infrastructure.

If you have non-default URL that directs API requests to a specific Kinesis service endpoint, enter it here in the Custom endpoint.

Click Create data sink when complete.

Your new Data sink will appear in the Data sinks area list.

Pipeline configuration

When it comes to using this Data sink in a Pipeline, you must configure the following output parameters. To do it, simply click the Data sink on the canvas and select Configuration.

Output configuration

Format

Choose whether the event Format is Raw or Parquet. Depending on the format selected, you'll be prompted to fill in the corresponding parameters:

Parameter

Description

Event field*

This is the name of the input event field.

Framing method*

This parameter defines how events are separated within an S3 object (further defined in the S3 object section of the Data sink). Choose between the various options:

Newline - Uses a newline character ('\n') to separate individual records in the output.
Length - The S3 framing method length is 10 bytes.
No framing - All events are contained in one line, leading to a long line until the maximum size is reached, with only one region.

Compress data?

Choose between true/false to enable/disable compression.

Key format

Choose the format for the name of the objects:

Parameter

Description

Prefix

The prefix used to organize your S3 data.

Partition by

This indicates the frequency with which to generate a new S3 object e.g. every year, month, day hour, minute. If left blank, the value used will be the Max object size / Input size entered in the Data sink configuration.

Click Save to save your configuration.

PreviousSend data to Amazon Kinesis NextSend data to Amazon SQS

Last updated 19 days ago

Was this helpful?

Good afternoon

Overview

Prerequisites

Required AWS permissions

Amazon S3 Setup

Create an S3 Bucket

Set the Bucket permissions

Onum Setup

Pipeline configuration

Output configuration

Format

Key format