Skip to main content

How to use Google Cloud Storage as a Deep Store

In this recipe we'll learn how to use Google Cloud Storage as a Deep Store for Apache Pinot segments. The deep store (or deep storage) is the permanent store for segment files and is used for backup and restore operations.


You will need to install Docker and the Google Cloud CLI locally to follow the code examples in this guide.

You will also need to create a GCP project and a user or service account that has permission to list and create buckets. Once you've done that, navigate to and create a bucket e.g.

Download Recipe

First, clone the GitHub repository to your local machine and navigate to this recipe:

git clone
cd pinot-recipes/recipes/google-cloud-storage

If you don't have a Git client, you can also download a zip file that contains the code and then navigate to the recipe.

Launch Pinot Cluster

You can spin up a Pinot Cluster by running the following command:

docker-compose up

This command will run a single instance of the Pinot Controller, Pinot Server, Pinot Broker, Kafka, and Zookeeper. You can find the docker-compose.yml file on GitHub.

Controller configuration

We need to provide configuration parameters to the Pinot Controller to configure MinIO as the Deep Store. This is done in the following section of the Docker Compose file:

image: apachepinot/pinot:0.10.0
command: "StartController -zkAddress zookeeper-gcs:2181 -config /config/controller-conf.conf"

The configuration is specified in /config/controller-conf.conf, the contents of which are shown below:



Let's go through some of these properties:

  • contains the name of our bucket.
  • contains the name of our GCP project.
  • contains the path to our GCP JSON key file.

You'll need to update the following lines:<bucket-name><project-id>
  • Replace <bucket-name> with the name of your bucket.
  • Replace <project-id> with the name of your GCP project.

You should also paste the contents of your GCP JSON key file into config/service-account.json.

Pinot Schema and Tables

Now let's create a Pinot Schema and real-time table.


Our schema is going to capture some simple events, and looks like this:

"schemaName": "events",
"dimensionFieldSpecs": [
"name": "uuid",
"dataType": "STRING"
"metricFieldSpecs": [
"name": "count",
"dataType": "INT"
"dateTimeFieldSpecs": [{
"name": "ts",
"dataType": "TIMESTAMP",
"format" : "1:MILLISECONDS:EPOCH",
"granularity": "1:MILLISECONDS"

You can create the schema by running the following command:

docker exec -it pinot-controller-gcs bin/ AddSchema   \
-schemaFile /config/schema.json \

Real-Time Table

And the real-time table is defined below:

"tableName": "events",
"tableType": "REALTIME",
"segmentsConfig": {
"timeColumnName": "ts",
"schemaName": "events",
"replication": "1",
"replicasPerPartition": "1",
"retentionTimeUnit": "DAYS",
"retentionTimeValue": "1"
"tableIndexConfig": {
"loadMode": "MMAP",
"streamConfigs": {
"streamType": "kafka",
"": "events",
"": "kafka-gcs:9093",
"stream.kafka.consumer.type": "lowlevel",
"": "smallest",
"": "",
"": "",
"realtime.segment.flush.threshold.rows": "10000",
"realtime.segment.flush.threshold.time": "1h",
"realtime.segment.flush.threshold.segment.size": "5M"
"tenants": {},
"metadata": {},
"task": {
"taskTypeConfigsMap": {

The realtime.segment.flush.threshold.rows config is intentionally set to an extremely small value so that the segment will be committed after 10,000 records have been ingested. In a production system this value should be set much higher, as described in the configuring segment threshold guide.

You can create the table by running the following command:

docker exec -it pinot-controller-gcs bin/ AddTable   \
-tableConfigFile /config/table-realtime.json \

Ingesting Data

Let's ingest data into the events Kafka topic, by running the following:

while true; do
ts=`date +%s%N | cut -b1-13`;
uuid=`cat /proc/sys/kernel/random/uuid | sed 's/[-]//g'`
count=$[ $RANDOM % 1000 + 0 ]
echo "{\"ts\": \"${ts}\", \"uuid\": \"${uuid}\", \"count\": $count}"
done |
docker exec -i kafka-minio /opt/kafka/bin/ \
--bootstrap-server localhost:9092 \
--topic events

Data will make its way into the real-time table. We can see how many records have been ingested by running the following query:

SELECT count(*)
FROM events

Exploring Deep Store

Now we're going to check what segments we have and where they're stored.

You can get a list of all segments by running the following:

curl -X GET \
"http://localhost:9000/segments/events" \
-H "accept: application/json" 2>/dev/null |
jq '.[] | .REALTIME[]'

The output is shown below:


Let's pick one of these segments, events__0__3__20220505T1343Z and get its metadata, by running the following:

curl -X GET \
"http://localhost:9000/segments/${tableName}/${segmentName}/metadata" \
-H "accept: application/json" 2>/dev/null |
jq '.'

The output is shown below:

"segment.crc": "532660340",
"segment.creation.time": "1651758198369",
"": "gs://pinot-events/events/events__0__3__20220505T1343Z",
"segment.end.time": "1651758238283",
"segment.flush.threshold.size": "10000",
"segment.index.version": "v3",
"segment.realtime.endOffset": "40000",
"segment.realtime.numReplicas": "1",
"segment.realtime.startOffset": "30000",
"segment.realtime.status": "DONE",
"segment.start.time": "1651758188443",
"segment.time.unit": "MILLISECONDS",
"": "10000"

We can see from the highlighted line that this segment is persisted at gs://pinot-events/events/events__0__3__20220505T1343Z. Let's go back to the terminal and return a list of all the segments in the bucket:

gsutil ls -l gs://${bucketName}/events/

The output is shown below:

    256712  2022-05-05T13:42:07Z  gs://pinot-events/events/events__0__0__20220505T1339Z
256817 2022-05-05T13:42:32Z gs://pinot-events/events/events__0__1__20220505T1342Z
257174 2022-05-05T13:43:15Z gs://pinot-events/events/events__0__2__20220505T1342Z
257224 2022-05-05T13:44:05Z gs://pinot-events/events/events__0__3__20220505T1343Z
TOTAL: 4 objects, 1027927 bytes (1003.83 KiB)