StarTree Cloud Coding Competition With Grand Prizes

Configure Indexes

In this guide you will learn how to apply indexes to a Pinot table. This guide assumes that you have followed the Ingesting data from a stream data source guide.

Pinot supports a series of different indexes that can be used to optimise query performance. In this guide we'll learn how to add indexes to the events table that we set up in the Ingesting data from a stream data source guide.

Why do we need indexes?

If no indexes are applied to the columns in a Pinot segment, the query engine needs to scan through every document, checking whether that document meets the filter criteria provided in a query. This can be a slow process if there are a lot of documents to scan.

When indexes are applied, the query engine can more quickly work out which documents satisfy the filter criteria, reducing the time it takes to execute the query.

What indexes does Pinot support?

By default, Pinot creates a forward index for every column. The forward index generally stores documents in insertion order.

However, before flushing the segment, Pinot does a single pass over every column to see whether the data is sorted. If it is then Pinot instead creates a sorted (forward) index for that column.

💡

For real-time tables you can also explicitly tell Pinot that one of the columns should be sorted. For more details, see the Sorted Index Documentation (opens in a new tab).

For filtering documents within a segment, Pinot supports the following indexing techniques:

Inverted index - Used for exact lookups
Range index - Used for range queries.
Text index - Used for phrase, term, boolean, prefix, or regex queries.
Geospatial index - Based on H3, a hexagon-based hierarchical gridding. Used for finding points that exist within a certain distance from another point.
JSON index - Used for querying columns in JSON documents.
Star-Tree index - Pre-aggregates results across multiple columns.

The events table

Let's see how we can apply these indexing techniques to our data. To recap, the events table has the following fields:

Date Time Fields	Dimensions Fields	Metric Fields
`ts`	`uuid`	`count`

We might want to write queries that filter on the ts and uuid columns, so these are the columns on which we would want to configure indexes.

Since the data we're ingesting into the Kafka topic is all implicitly ordered by timestamp, this means that the ts column already has a sorted index. This means that any queries that filter on this column are already optimised.

So that leaves us with the uuid column.

Adding an inverted index

We're going to add an inverted index to the uuid column so that queries that filter on that column will return quicker. We need to add the following line:

"invertedIndexColumns": ["uuid"]

To the tableIndexConfig section.

Copy the following to the clipboard:

/tmp/pinot/table-config-stream.json

{
  "tableName": "events",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "ts",
    "schemaName": "events",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "invertedIndexColumns": ["uuid"],
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.topic.name": "events",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.broker.list": "kafka:9092",
      "realtime.segment.flush.threshold.rows": "0",
      "realtime.segment.flush.threshold.time": "24h",
      "realtime.segment.flush.threshold.segment.size": "50M",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

Navigate to localhost:9000/#/tenants/table/events_REALTIME (opens in a new tab), click on Edit Table, paste the next table config, and then click Save.

Once you've done that, you'll need to click Reload All Segments and then Yes to apply the indexing change to all segments.

Checking the index has been applied

We can check that the index has been applied to all our segments by querying Pinot's REST API. You can find Swagger documentation at localhost:9000/help (opens in a new tab).

The following query will return the indexes defined on the uuid column:

curl -X GET "http://localhost:9000/segments/events/metadata?columns=uuid" \
  -H "accept: application/json" 2>/dev/null | 
  jq '.[] | [.segmentName, .indexes]'

Output

[
  "events__0__1__20220214T1106Z",
  {
    "uuid": {
      "bloom-filter": "NO",
      "dictionary": "YES",
      "forward-index": "YES",
      "inverted-index": "YES",
      "null-value-vector-reader": "NO",
      "range-index": "NO",
      "json-index": "NO"
    }
  }
]
[
  "events__0__0__20220214T1053Z",
  {
    "uuid": {
      "bloom-filter": "NO",
      "dictionary": "YES",
      "forward-index": "YES",
      "inverted-index": "YES",
      "null-value-vector-reader": "NO",
      "range-index": "NO",
      "json-index": "NO"
    }
  }
]

We can see from looking at the inverted-index property that the index has been applied.

💡

We're using the jq command line JSON processor (opens in a new tab) to extract the fields that we're interested in.

Querying

You can now run some queries that filter on the uuid column., as shown below:

SELECT * 
FROM events 
WHERE uuid = 'f4a4f'
LIMIT 10

You'll need to change the actual uuid value to a value that exists in your database, since the UUIDs are generated randomly by our script.

Next Steps

You've now learnt how to add an index to a real-time table and refresh all the segments so that they all recognise that index.

It's probably now time to try ingesting your own dataset. There are recipes to help you solve common problems and you can always join the community to get any questions answered.

Ingest streaming data Recipes