In this guide you will learn how to apply indexes to a Pinot table. This guide assumes that you have followed the Ingesting data from a stream data source guide.
Pinot supports a series of different indexes that can be used to optimise query performance.
In this guide we'll learn how to add indexes to the
events table that we setup in the Ingesting data from a stream data source guide.
Why do we need indexes?
If no indexes are applied to the columns in a Pinot segment, the query engine needs to scan through every document, checking whether that document meets the filter criteria provided in a query. This can be a slow process if there are a lot of documents to scan.
When indexes are applied, the query engine can more quickly work out which documents satisfy the filter criteria, reducing the time it takes to execute the query.
What indexes does Pinot support?
By default, Pinot creates a forward index for every column. The forward index generally stores documents in insertion order.
However, before flushing the segment, Pinot does a single pass over every column to see whether the data is sorted. If it is then Pinot instead creates a sorted (forward) index for that column.
For real-time tables you can also explicitly tell Pinot that one of the columns should be sorted. For more details, see the Sorted Index Documentation.
For filtering documents within a segment, Pinot supports the following indexing techniques:
- Inverted index - Used for exact lookups
- Range index - Used for range queries.
- Text index - Used for phrase, term, boolean, prefix, or regex queries.
- Geospatial index - Based on H3, a hexagon-based hierarchical gridding. Used for finding points that exist within a certain distance from another point.
- JSON index - Used for querying columns in JSON documents.
- Star-Tree index - Pre-aggregates results across multiple columns.
The events table
Let's see how we can apply these indexing techniques to our data.
To recap, the
events table has the following fields:
|Date Time Fields||Dimensions Fields||Metric Fields|
We might want to write queries that filter on the
uuid columns, so these are the columns on which we would want to configure indexes.
Since the data we're ingesting into the Kafka topic is all implicitly ordered by timestamp, this means that the
ts column already has a sorted index.
This means that any queries that filter on this column are already optimised.
So that leaves us with the
Adding an inverted index
We're going to add an inverted index to the
uuid column so that queries that filter on that column will return quicker.
We need to add the following line:
Copy the following to the clipboard:
Navigate to localhost:9000/#/tenants/table/events_REALTIME, click on Edit Table, paste the next table config, and then click Save.
Once you've done that, you'll need to click Reload All Segments and then Yes to apply the indexing change to all segments.
Checking the index has been applied
We can check that the index has been applied to all our segments by querying Pinot's REST API. You can find Swagger documentation at localhost:9000/help.
The following query will return the indexes defined on the
curl -X GET "http://localhost:9000/segments/events/metadata?columns=uuid" \
-H "accept: application/json" 2>/dev/null |
jq '. | [.segmentName, .indexes]'
We can see from looking at the
inverted-index property that the index has been applied.
We're using the jq command line JSON processor to extract the fields that we're interested in.
You can now run some queries that filter on the
uuid column., as shown below:
WHERE uuid = 'f4a4f'
You'll need to change the actual
uuid value to a value that exists in your database, since the UUIDs are generated randomly by our script.
You've now learnt how to add an index to a real-time table and refresh all the segments so that they all recognise that index.