Skip to main content

Setup guide

Instructions as of October 2022

Steps to enable Tiered Storage on an existing table

Step 1

Decide the segment age boundary for splitting across local and remote tier

For example, if you have 30 days of data, if you pick segment age boundary as 7 days, segments less than 7 days old will be on the local tier, and segments older than 7 days will be moved to the remote tier.

Step 2

Choose your tenant setup

  1. Use same tenant as table

You could use the same tenant as what you've set in tableConfig->tenants->server, in which case, this step is a no-op for you.


Note that while this is a great way to get started quickly, this is not a recommended setup for production scenaios. Processing the segments from cloud object store will cause more than usual memory/cpu utilization on the nodes, and might affect the performance of the local data queries. If your local data queries are serving latency sensitive, user-facing traffic, we recommend using setup 2.2.

  1. Use dedicated different tenant (Recommended)

You can create a new tenant, which will serve as compute-only nodes. This tenant will be dedicated to serving the segments which are on S3, and hence do not need to have as much local storage as your table's default local tenant.

  • Add servers (need at least as many as table replication) .
  • Tag these as RemoteTenant_OFFLINE (or anything that is appropriate) using this updating tags API

Step 3

Add cluster configs and restart servers

Using the POST /cluster/configs API from Swagger, add following cluster-level tier storage configurations.


In future releases, many of these properties will be made default, and will not have to be set.

"pinot.server.instance.tier.enabled": "true",
"": "24G",
"pinot.server.instance.buffer.reference.manager.ondemand.reserved.size": "4G",
"pinot.helix.instance.state.maxStateTransitions": "20",
"pinot.server.instance.use.local.tier.migrator" : "true",
"pinot.server.instance.local.tier.migrator.schedule.interval.sec" : "3600"


PropertyDescriptionDefault value
tier.enabledConfig to enable tier storage for the clusterfalse memory available for OnDemand cache. Ensure this is smaller than direct memory available on server (else you’ll get OutOfDirectMemory exceptions)2G
ondemand.reserved.sizeThis is a portion of the above total.size. Ensure this is larger than size of largest uncompressed segment500M
pinot.helix.instance.state.maxStateTransitionsA good idea to cap this if you will be doing frequent restarts, re-indexing, capacity changesunlimited
use.local.tier.migratorNeeded if using same tenant for local and remote (setup 2.1). This is the periodic task that facilitates the movement of segments to S3false
ondemand.use.async.clientWhether to use async client lib to fetch data, e.g. AsyncS3Client. Usually the sync client works just fine. If there is abundant ondemand space to allow hundreds of concurrent I/O requests to happen, using async client can avoid creating too many I/O threads as controlled by the next config.false
ondemand.fetch.worker.threadsSize of the I/O thread pool. With sync client, threads do the I/O themselves; with async client, threads process I/O completion callbacks5 * numCore

After adding these configs, restart any existing servers for changes to take effect on them.

Step 4

Update table config

  1. Ensure you have set timeColumn in the tableConfig.
  2. Ensure you don’t have any startree-index or text-index, as this is not supported yet (you can still enable tiered storage for such tables, but these 2 indexes will be ignored)
  3. Add tierConfigs to the tableConfig. The segmentAge will be calculated using the primary timeColumnName as set in tableConfig:
"tierConfigs": [
"name": "s3Tier",
"segmentSelectorType": "time",
"segmentAge": "7d",
"storageType": "pinot_server",
"serverTag": "RemoteTenant_OFFLINE",
"tierBackend": "s3",
"tierBackendProperties": {
"region": "<s3-region>",
"bucket": "<s3-bucket>"

On adding this configuration, a periodic task will move segments to the right tier, as and when the the segments cross the segmentAge. For setup 2.1, the periodic task is called LocalTierMigrator and it runs on the servers (configs to enable and change frequency in Step 3). For setup 2.2, the periodic task is called SegmentRelocator and it runs on the controller (configs to enable and change frequency here. Both run at hourly cadence by default.

nameSet this to anything you want, this is more for referring to this tier uniquely.
segmentSelectorTypeCan be time for moving segments by age, or fixed to move fixed segments by name
segmentAgeApplicable if segmentSelectorType is time. Set this to the segment age boundary you picked in Step 1. This takes any period string, examples - 0s, 12h, 30m, 60d
segmentListApplicable if segmentSelectorType is fixed. Set this to a comma separated list of segments you wish to move to S3. This is a good way to do a quick test.
storageTypeHas to be pinot_server
serverTagIf using setup 2.1, set this to table's server tag. If using setup 2.2, set this to the tag of the remote tier created in Step 2.2
tierBackendHas to be s3
tierBackendPropertiesSet these as per your S3 details. Supported properties are region, accessKey, secretKey, bucket (can be same as deep store bucket, but use pathPrefix to avoid collision), pathPrefix (a path prefix to use in the bucket)

This segmentAge setting is independent of the table's retention. Segments will continue to be forgotten from Pinot once the table's retention is reached, regardless of what tier the segments are on.

Step 5

Upload new segments + migrate existing segments

New segments - When you upload segments, they first land on the DefaultTenant_OFFLINE servers, and are only periodically moved to the new tenant and S3 by the periodic tasks.

Existing segments - When the periodic task runs, it will also move the existing segments over to the right tier.

Step 6

Validate and Query

After uploading segments, for those that move to remote tenant, verify the following:

  1. Segments are not present on the pinot-server’s local disk (in dataDir and indexDir)
  2. Segments are present in the S3 bucket used, in uncompressed form

When querying, increase timeout of your query

Operational considerations

  1. Restart/reload of your server will take much longer than it did before this feature, depending on the number of segments on S3. This is because when loading the segment on the server, it needs to make a call to S3.
  2. Make sure to set the pinot.helix.instance.state.maxStateTransitions if doing any schema evolution. Change to the schema will result in the servers downloading the segments onto the servers for the segment processing.
  3. As of now, we do not have the ability to automatically delete segments from S3 after retention/deletion. This is different from deep store segments, which will continue to follow the segment lifecycle as it would.

Fine tunings

S3 client configs

Sometime, there may be need to fine tune the S3 client, like to tolerate longer request latency or use less threads for I/O. Below is a list of current configs to customize S3 clients when fetching data from remote segments. All those configs are put inside tierBackendProperties. Most of them need to reload the table to get into effect. Restarting servers also works, but overall we would like to make configs able to apply more dynamically at runtime.

We support both sync and async s3 client. By default, sync s3 client is used and a thread pool is created to fetch data via the sync s3 clients in parallel without blocking the query processing threads. The configs about sync s3 client are prefixed with s3client.http. The async s3 client can reduce the size of the thread pool aforementioned a lot. It uses async I/O under the hood to fetch data in parallel in a non-blocking manner; and use the thread pool mentioned above to process the I/O completion callbacks only. The configs about async s3 client are prefixed with s3client.asynchttp. Configs prefixed with s3client.general apply to both kinds of clients.

In general, we try to expose the configs supported by S3lient or AsyncS3Client as many and directly as possible, and leave their default values and behaviors as is. According to the use cases, like how many segments may be accessed during a query execution, the maxConnections or maxConcurrency might be increased, together with the size of I/O thread pool controlled by ondemand.fetch.worker.threads explained earlier on, in order to reduce query latency. The configs on request timeout are usually left to default values, tolerating delays to some extent. But they can be decreased to fail query faster to be more responsive, although as failure.

ConfigDescriptionDefault value
s3client.general.apiCallTimeoutTimeout for e2e API call, which may involve a few http request limit
s3client.general.apiCallAttemptTimeoutTimeout for a single http limit
s3client.general.numRetriesHow many times to retry the http request, with exponential backoff and jitter.3
s3client.general.readBufferSizeSize a buffer to stream data into buffers for query processing.8KB
ConfigDescriptionDefault value
s3client.http.maxConnectionsThe max number of connections to pool for reuse50
s3client.http.socketTimeoutTimeout to read/write data30s
s3client.http.connectionTimeoutTimeout to establish connection2s
s3client.http.connectionTimeToLiveTimeout to reclaim the idle connection0s
s3client.http.connectionAcquireTimeoutTimeout to get a connection from pool10s
s3client.http.connectionMaxIdleTimeoutTimeout to mark a connection as idle to be candidate to reclaim60s
s3client.http.reapIdleConnectionsReclaim the idle connection in pooltrue
ConfigDescriptionDefault value
s3client.asynchttp.maxConcurrencyThe max number of concurrent requests allowed. We use HTTP/1.1, so this controls max connections.50
s3client.asynchttp.readTimeoutTimeout to read data30s
s3client.asynchttp.writeTimeoutTimeout to write data30s
s3client.asynchttp.connectionTimeoutTimeout to establish connection2s
s3client.asynchttp.connectionTimeToLiveTimeout to reclaim the idle connection0s
s3client.asynchttp.connectionAcquireTimeoutTimeout to get a connection from pool10s
s3client.asynchttp.connectionMaxIdleTimeoutTimeout to mark a connection as idle to be candidate to reclaim60s
s3client.asynchttp.reapIdleConnectionsReclaim the idle connection in pooltrue



Below are metrics we emit to help understand how tiered storage works. The metrics mainly cover: 1) how segments get uploaded to the tier backend, like data volume and operation duration, rates or failures; 2) how queries fetch data from the remote segments, like data volume and query duration breakdown over query operators, query rates or failures; 3) how segment data is kept on servers temporarily, like data volume from segments temporarily held in memory or on disk.

pinot_server_s3Segments_UploadedCountHow many segments uploaded to S3 tier backend when reloading segment.
pinot_server_s3Segments_UploadedSizeBytesHow many bytes in total uploaded to S3 tier backend when reloading segment.
pinot_server_s3SegmentUploadTimeMsHow long segment uploading takes when reloading segment.
pinot_server_s3Segments_DownloadedCountHow many segments downloaded to server when reloading segment.
pinot_server_s3Segments_DownloadedSizeBytesHow many bytes in total downloaded to server when reloading segment.
pinot_server_s3SegmentDownloadTimeMsHow long segment downloading takes when reloading segment (this is not on query path).
pinot_server_totalPrefetch_99thPercentile/OneMinuteRateRate and time spent to kick off data prefetching, and should be low. The prefetching is done in async. The metrics is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer.
pinot_server_totalAcquire_99thPercentile/OneMinuteRateRate and time spent to wait for segment data to be available and may be varying based on the effectiveness of prefetching. The metrics is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer.
pinot_server_totalReleas_99thPercentile/OneMinuteRateRate and time spent to release the segment data, and should be low. The metrics is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer.
pinot_server_bufferPrefetchExceptions_OneMinuteRateError rate when prefetching segment data. This metrics is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer.
pinot_server_bufferAcquireExceptions_OneMinuteRateError rate when waiting to acquire segment data. This metrics is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer.
pinot_server_bufferReleaseExceptions_OneMinuteRateError rate when releasing the prefetched segment data. This metrics is also broken down for cache layers: mmapCacheLayer and ondemandCacheLayer.
pinot_server_ondemandCacheLayerReserve_99thPercentileTime for prefetching threads to wait for ondemand memory space to be available to fetch data. If this is high, try to config more ondemand memory or adjust the query to access less segments like making query predicates more selective.
pinot_server_ondemandCacheLayerBufferDownloadBytes_OneMinuteRateBytes rate of data prefetching.
pinot_server_ondemandCacheLayerSingleBufferFetchersCount_OneMinuteRateNumber of column index pairs being prefetched.
pinot_server_ondemandCacheLayerBufferDownloadBytes_OneMinuteRateBytes rate of data prefetching.

Other than those emit from tiered storage modules, the cpu util and network bandwidth util are also very important to fine tune the system. The set of metrics are updated frequently as we are enhancing the system. You can try to explore the metrics from tiered storage via Grafana.

Coming soon

  1. GCS support
  2. Support for cleanup of segments in S3 past the table retention
  3. More optimizations for query latency
  4. Star-tree index support