StarTree Cloud Coding Competition With Grand Prizes

Release version 0.10.1: February 2025

Key Product Features

Pauseless Ingestion (Beta)

This feature makes real-time ingestion in Pinot truly pauseless, thus improving the data freshness even further. In the original design, there was a small pause in real-time ingestion (partition level) when the consuming segment was being committed.

Pauseless consumption removes this constraint, thereby allowing to resume the process of consuming the next segment while the current consuming segment is being committed.

Current caveats: Dedup and partial upserts are not currently supported (support for these is planned in a future release).

Alter Table Task (ATT / SRT) (Beta)

StarTreeAlterTableTask is an improved alternative to SegmentRefreshTask (opens in a new tab), designed for more efficient, batch-wise segment refreshes.

Unlike SegmentRefreshTask, where the map phase and segment generation are local, leading to inefficiencies due to redundant segment creation and merging, StarTreeAlterTableTask optimizes this process by writing all mapper output files to a single location, allowing segments to be generated directly from the output. This eliminates the extra merge step, reducing overhead and improving performance.

The task follows a structured Map → Reduce → Upload workflow, ensuring better resource utilization, fault tolerance, and deep store integration. With built-in retries that dynamically adjust parameters, it provides a more scalable and reliable approach for large table refreshes.

CTE Reuse (Alpha/Experimental)

In the Multi-Stage Query Engine (MSQE), there are several cases in which the same table is scanned twice or the same join is executed again. Some of these parts can be quite expensive (eg: table scan) and therefore doubles the cost of the query.

This feature allows parts of the query subtree to be reused, resulting in optimized query execution. More details in this PR (opens in a new tab).

This feature needs a hint to enable. Described in this PR (opens in a new tab)

Example:

SET useSpools = true;
select * from userAttributes as a1
join userGroups as a2
on a1.userUUID = a2.userUUID
join userAttributes as a3
on a1.userUUID = a3.userUUID
limit 10

Snapshot support for offheap Dedup in real-time tables

Improved the offheap deduplication feature in Pinot real-time tables by adding support for snapshots. This refers to the metadata snapshot that can be preloaded into the server, thus dramatically improving the restart time. These snapshots are handled by a minion task called DedupSnapshotCreationTask. For more details, see the documentation (opens in a new tab).

New Stability Features

Group by trimming in MSQE: Parity with v1 engine regarding group by trimming support (needs to be explicitly configured in the query).
Rebalance Hardening:
- Added a dry-run mode with summary that provides important stats such as total segments to be moved, number of servers and so on (see this PR (opens in a new tab) for details).
- Added several pre-checks (eg: minimizeDataMovement, needsReload).
Correctly handle the case where the number of partitions in the input stream (eg: Kafka) have been reduced since the real-time table was created.
Trigger an alert if the streaming consumer is not updating the ingestion delay metric correctly. This will help detect data freshness issues proactively.
Bug fixes:
- Ensuring segments in the deep store are cleaned up properly (post retention).
- Prevent creating tables with a replication factor > number of available servers. This was allowed this earlier, which resulted in ingestion pause issues.

Default Configuration Flags for Pinot

Configuration	Old Value	New Value	Description	Impact
`Server starter Freshness checker`	Disabled Timeout = 10 minutes Freshness threshold = 10 seconds	Enabled Timeout = 70 minutes Freshness threshold = 1 minute	Enable the freshness checker (opens in a new tab) by default. The server will wait for the specified timeout for it to catch up with the corresponding real-time stream so that freshness is below a certain threshold.	This will avoid servers to start serving queries before they’re fully caught up and avoid stale results. One caveat is this will lead to longer server restart time.
`Server storage capacity (to pause ingestion)`	< None >	95% of server storage capacity	Block ingestion for all relevant tables when a given server in the tenant reaches 95% storage capacity. For real-time tables this means pausing the ingestion. For offline tables, this means stop scheduling new ingestion tasks. Note: This is a safeguard to protect the cluster. The user should be already alerted at this point (disk full threshold of 90%).	Reduces the bytes written to the ZK log, especially useful in clusters where a large number of segments exist. This in turn, reduces the load on Zookeeper.
`V4 Forward Index`	rawIndexWriterVersion = 2	rawIndexWriterVersion = 4	TEnable v4 forward index by default which is more scalable. Earlier version (v2) had a hard limit on the number of documents per segment.	Prevent segment build failures (eg: out of bounds exception) especially in cases where there are a lot of values per segment. In addition, V4 is also more memory efficient (Direct Memory).
`Index rebuild throttle (Server level)`	< None >	Index throttle: 25% CPU cores StarTree Index throttle: 1	Control how many indexes we can build in parallel on the server (eg: during startup). There are different limits imposed for the StarTree index as compared to other indexes. Note: All the configs related to segment level throttling can be changed dynamically by updating Cluster config. Documentation (opens in a new tab)	These limits will help avoid excessive load on the Pinot server during segment rebuilds. One caveat is this will lead to a longer server restart time.
`Max parallel reloads (Server level)`	1	Max refresh threads: Max(10, 25% CPU cores)	Controls how many parallel segment reloads can be done on a given server.	This limit has been raised, which will speed up the total segment reload time. However, also added limits to how fast indexes can be built during reload. In general, with all these settings, server restart time will be longer.
`Segment download throttle (Server level)`	< None >	25% of CPU cores	Controls how many segments can be downloaded in parallel on a given server. Works in conjunction with the existing table level throttle config (pinot.server.instance.table. level.max.parallel. segment.downloads). Note: All the configs related to segment level throttling can be changed dynamically by updating Cluster config. Documentation (opens in a new tab)	This helps prevent an excessive number of segments being downloaded on a given server, thus preventing failures. One caveat is this will lead to a longer server restart time.

Zookeeper Configuration Changes

Configuration	Old Value	New Value	Description	Impact
`Limit state transitions per instance ZK`	100000	1000	Whenever a Helix state transition is sent to a server, writes are performed on ZooKeeper (current state Znode update and new state transition Znode). This config limits the number of state transitions sent to any instance at once, which helps limit the number of state transition Znodes and current state updates at any given time.	Helps throttle the writes to ZooKeeper during operations on tables with a large number of segments. This reduces the load on ZooKeeper and hence reduces session timeouts.
`Compression threshold`	1 MB	30 KB	The size threshold beyond which Znodes will be compressed before being written to ZK.	Reduces the bytes written to the ZK log, especially useful in clusters where a large number of segments exist. This in turn, reduces the load on Zookeeper.
`ZK Session timeout (ZK Client)`	30 seconds	300 seconds	The period after which a client will timeout when establishing a session with ZK.	ZK client session timeouts can be very expensive. This increase in timeout will help avoid that, especially when Zookeeper is under heavy load.
`PV Size increase`	100 GB	200 GB	The disk attached with every Zookeeper instance.	A larger disk size will allow more room to grow ZooKeeper logs, and prevent outages related to disk failures.
`jute.maxbuffer`	10 MB	60 MB	This will help in cases where there are a large number of segments and to prevent ZK client exceptions.	No negative impact - it will help in use cases with a large number of segments. For example, in the case of tiered storage.

Default Configuration Flags for Cluster

publishNotReadyAddresses is set to True by default. This will help address DNS resolution issues during restarts/upgrades.

Release notes 0.10.0