Kinesis
In this guide we're going to learn how to configure Kinesis as a data source in the StarTree Dataset Manager. You should have created an environment and will also need access to an AWS Kinesis account.
The StarTree Data Manager can ingest messages from an AWS Kinesis Stream within your AWS account.
Create dataset
Enter a name for your dataset and select Provide my own data.
Select data source
Click the NEXT button.
Data Source
You'll see the following screen where you should select Streaming and then Kinesis:
Select data source
Provide a new connection name. Enter the credentials for Basic authentication type, namely, Access Key, Secret Key and Region for your account.
Click TEST CONNECTION to check that StarTree Cloud can access the Kinesis account.
Test Connection
You will see a success message if the account has been configured correctly. Click NEXT to go to the next screen.
Data Modeling
Next we need to select the Kinesis stream that we'd like to connect to Pinot and its format, as shown in the screenshot below:
Select Topic
We'll select the github-events
topic and JSON
for its data format.
The messages that we publish to the github-events
stream will have the following structure:
{
"requestedReviewers": [
"saneDG"
],
"requestedTeams": [],
"repo": "ratatiedot-extranet",
"numAuthors": 1,
"assignees": [],
"numCommenters": 0,
"title": "Chore: RTENU-8 Added missing configs, TODO formatting, updated package-lock.jsn",
"elapsedTimeMillis": 2089000,
"numLinesDeleted": 165,
"committers": [
"mehtis"
],
"numFilesChanged": 4,
"authorAssociation": "NONE",
"numReviewers": 0,
"numCommitters": 1,
"numCommits": 3,
"count": null,
"numReviewComments": 0,
"mergedTimeMillis": 1663654063000,
"userId": "mehtis",
"reviewers": [],
"labels": [],
"numComments": 0,
"createdTimeMillis": 1663651974000,
"numLinesAdded": 158,
"organization": "finnishtransportagency",
"userType": "User",
"mergedBy": "NinaDang97",
"authors": [
"mehtis"
],
"commenters": []
}
The Dataset Manager will then make an educated guess at the field and data types for each of the fields in the messages on this stream.
Columns and field/data types
Time Column
We'll change the createdTimeMillis
field type to be DATETIME
. This will the primary time column for this Pinot
table, which will be used by Pinot to maintain the time boundary between offline and realtime data in a hybrid table
and also, for retention management. (Read more here).
Time column
(Optional) Enable Upserts
We can add primary key to enable upsert to this realtime table. We are not going to select a primary key for upsert in this demo. Read more about upserts here.
Primary key
Once you're happy with the data transformations, scroll down, and click on the NEXT button.
Additional Configuration
On this screen you'll be able to configure indexes, tenants, ingestion scheduling, and data retention on this data source.
Configure indexes, tenants, ingestion scheduling, and data retention
For more information on the different types of indexes and when to use them, see the Apache Pinot Indexing Documentation.
Once you're happy with the configuration, scroll down, and click on the NEXT button.
Review
You'll now see the review and submit screen, where you can review everything that we've configured in the previous steps.
Review Data Source
Click on the toggle next to Preview Data to see how the data will look once it's imported. If anything doesn't look right, click on the BACK button to go back to the previous screen.
Once you're happy ready to create the data source, click on the SUBMIT button. You'll then see the following screen:
Data Source Created
Query Data Source
To have a look at the data that we've imported, click on the Query Console link, which will open the Pinot Data Explorer. Click on the kinesis_github_data table and then click RUN QUERY to run a basic query against the data source.