Skip to main content

Kafka

info

In this guide we're going to learn how to configure Kafka as a data source in the StarTree Dataset Manager. You should have created an environment and will also need access to a Kafka cluster.

The StarTree Data Manager can ingest messages from your own Kafka cluster or hosted Kafka services like Confluent Cloud or Amazon MSK.

Create dataset

Enter a name for your dataset and select Provide my own data.

Select data source Select data source

Click the NEXT button.

Data Source

You'll see the following screen where you should select Streaming and then Kafka:

Select data sourceSelect data source

Enter the URL of your Kafka cluster. Click on the appropriate authentication type and provide your username and password. You can also optionally specify the schema registry.

Authentication TypeAuthentication Type

Click TEST CONNECTION to check that StarTree Cloud can access the Kafka cluster.

Test Connection Test Connection

You will see a success message if the bucket has been configured correctly. Click NEXT to go to the next screen.

Data Modeling

Next we need to select the topic that we'd like to connect to Pinot and its format, as shown in the screenshot below:

Select Topic Select Topic

We'll select the events topic.

tip

The messages that we publish to the events topic have the following structure:

{
"eventId": "f5551a6f-df87-46ab-8a4b-b9d2de0fb943",
"userId": "343",
"eventType": "Comment",
"ts": "1635149671000"
}

The Dataset Manager will then make an educated guess at the field and data types for each of the fields in the messages on the topic.

Data Transformation Columns and field/data types

Time Column

We'll change the ts field type to be DATETIME. This will the primary time column for this Pinot table, which will be used by Pinot to maintain the time boundary between offline and realtime data in a hybrid table and also, for retention management. (Read more here).

Time column Time column

(Optional) Enable Upserts

We can add primary key to enable upsert to this realtime table. Read more about upserts here.

Primary key Primary key

Once you're happy with the data transformations, scroll down, and click on the NEXT button.

Additional Configuration

On this screen you'll be able to configure indexes, tennats, ingestion scheduling, and data retention on this data source.

Configure indexes, tenants, scheduling, and data retention Configure indexes, tenants, ingestion scheduling, and data retention

For more information on the different types of indexes and when to use them, see the Apache Pinot Indexing Documentation.

Once you're happy with the configuration, scroll down, and click on the NEXT button.

Review

You'll now see the review and submit screen, where you can review everything that we've configured in the previous steps.

![Review Data Source](/img/dataset-manager/review .png) Review Data Source

Click on the toggle next to Preview Data to see how the data will looked once it's imported. If anything doesn't look right, click on the BACK button to go back to the previous screen.

Once you're happy ready to create the data source, click on the SUBMIT button. You'll then see the following screen:

Data Source Created Data Source Created

Query Data Source

To have a look at the data that we've imported, click on the Query Console link, which will open the Pinot Data Explorer. Click on the events table and then click RUN QUERY to run a basic query against the data source:

Query events Data Source Query events Data Source