Try StarTree Cloud: 30-day free trial

AWS S3 connector

Amazon Web Services (AWS) S3 is an object storage service that is built to store any amount of data from anywhere and can handle massive amounts of batch data.

StarTree customers can use the AWS S3 connector to ingest data from AWS S3 into Apache Pinot.

To begin, from the StarTree Data Manager overview page, click Create a dataset. Then, on the Select Connection Type page, click Amazon S3.

Connect StarTree to Amazon S3

Use one of the following methods to create a connection to Amazon S3:

  • AWS access and secret keys
  • AWS cross-account Identity and Access Management (IAM) roles

Configure AWS IAM policy

  1. Open the AWS IAM dashboard, click Policies, and click Create policy.

  2. Click to open the JSON tab and create the following policy, replacing my-bucket-name with the bucket name that you will use to establish the connection.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:List*",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::my-bucket-name",
"arn:aws:s3:::my-bucket-name/*"
]
       }
]
}
  1. Provide the Policy name, such as s3-my-bucket-policy, and click Create policy.

Select one of the connection creation options to continue.

Option 1: Create a connection using access and secret keys

You will need the following information to create a connection with Amazon S3 when using your AWS Access Key and Secret Key pair.

  • Region: AWS region where the S3 bucket resides. You can find this information in the S3 dashboard.
  • Bucket: The name of the bucket from which the data will be ingested. Optional: To avoid giving access to the entire bucket, create a folder within the bucket; you will need to use this folder as the input path.
  • Access & Secret Key: You can generate AWS Account and Access Keys in the AWS IAM console (opens in a new tab).
  • Bucket Prefix (Optional): A prefix string for input data.
    • Input: s3://my-bucket-name/aaa/*.avro, Prefix: aaa/
    • Input: s3://my-bucket-name/aaa/bbb/*.avro, Prefix: aaa/bbb/

Create an AWS access and secret key pair

If you do not have an access key and secret key pair, follow these steps to generate one.

  1. Open the AWS IAM console (opens in a new tab), click Users, and click Add users.

  2. Enter a User name, such as startree-my-stream-user, and click Next.

  3. Under Permission options, select Attach policies directly.

  4. Use search to find the policy name you created earlier, it was s3-my-bucket-policy in our example. Click Next and click Create user.

  5. Return to the AWS IAM Dashboard, click Users, then select the user name that you just created, like startree-my-user-policy in our example.

  6. Select the Security credentials tab and click Create access key.

  7. Click Application running on an AWS compute service, click Next, and click Create access key.

  8. Copy the Access key and the Secret access key from here and enter them in the Data Manager. S3 connection settings for access and secret key auth in the Data Manager UI

Continue in the Create dataset section.

Option 2: Create a connection using AWS IAM roles

You will need the following information to create a connection with Amazon S3 when using AWS IAM Roles:

  • Region: The AWS region where the S3 bucket resides. You can find this information in the Amazon S3 dashboard.
  • Bucket: The name of the bucket from which the data will be ingested.
  • IAM Role ARN: The Customer-provided IAM Role ARN that will be assumed by the StarTree Data Plane.
  • Bucket Prefix (Optional): A prefix string for input data.
    • Input: s3://my-bucket-name/aaa/*.avro, Prefix: aaa/
    • Input: s3://my-bucket-name/aaa/bbb/*.avro, Prefix: aaa/bbb/

Create an AWS IAM role

If you do not have an AWS IAM role, follow these steps to generate one.

  1. Find the following information. You will need it in later steps.

    • StarTree Account ID and External ID from the Data Manager UI.
    • StarTree environment name from the StarTree Cloud Portal. StarTree environment name in the StarTree cloud portal
  2. Check your Security Token Service (STS) configuration

    • Open the AWS IAM console (opens in a new tab) and click Account settings.
    • If the Security Token Service (STS) is not currently in Active status for the region that you have S3 bucket, set it to Active.
  3. In the AWS IAM console, click Roles in the left menu bar, click Create role.

  4. In the AWS IAM console, click AWS account for Trusted entity type.

    • If your S3 bucket and StarTree Data Plane run on the same account you are currently using to access the AWS IAM console, click This Account.
    • If your S3 bucket and StarTree Data Plane use a different account, from the StarTree Data Manager copy the StarTree AWS Account information and paste it into the Another AWS Account field. Then copy the External ID and paste it into the External ID field.
  5. Use search to find the policy name you created earlier, it was s3-my-bucket-policy in our example, and then click Next.

  6. If the External ID referred to previously has the env-* prefix, skip to step 7. Otherwise, enter the following:

  • Role name: Enter a role name, such as startree-my-stream-role.
  • Tags: Enter a key-value pair.
  1. Click Create role. Navigate to the Summary page. Copy the ARN field and enter it into the IAM Role ARN field in the Data Manager UI. Click Create Connection. S3 connection settings for IAM auth in the Data Manager UI

Continue in the Create dataset section.

Create dataset

  1. Select a Topic Name and Data Format to map the topic to the correct table. Everything else is optional.

  2. (Optional) File Include Pattern lets you specify a regex and only files matching this pattern will be included from inputDirUri. Here are some glob pattern examples. Here, we assume that the input path is abc/ while the bucket name is my_bucket. (i.e. s3://my_bucket/abc/):

    • glob:**/abc/* - matches all files within abc/.
    • glob:**/abc/*.avro - matches all files ending with .avro in the abc/ directory.
    • glob:**/abc/**.avro - matches all files ending with .avro that are located at any subdirectory under abc/
    • glob:**/abc/?.avro - matches all files match with ?.avro (e.g. a.avro, b.avro)
  3. (Optional) File Exclude Pattern lets you specify a regex and files that do not this pattern will be excluded from inputDirUri. Glob patterns are similar to those written for File Include Pattern.

  4. (Optional) Record reader config is for Advanced Users to provide additional details about the source schema.

  5. (Optional) Improve query performance by adding indexes to the appropriate columns and choose encoding types for each column.

  6. (Optional) Configure unique details such as tenants, scheduling, data retention, and a primary key for upsert.

  7. Click Next.

  8. Check the details and preview data. When ready, click Create Dataset.