StarTree Cloud Coding Competition With Grand Prizes

Release version 0.8.0: February 2024

Apache Pinot updates since last StarTree release

For details on Pinot changes, see Releases (opens in a new tab).

Permit defining NULL handling at the table level or at each individual column level. Link (opens in a new tab)
Add lastUsed option in resumeConsumption endpoint in the broker API to improve UX. Link (opens in a new tab)
Improve ingestion validation with TimeValidationTransformer to mark a record as invalid if the primary time column is out of range (1971 inclusive to 2071 exclusive). Link (opens in a new tab)
Update the tables API endpoint to list only the dimension tables by specifying dimension as the type, for example:
```
curl -X 'GET' \
    'http://localhost:9000/tables?type=dimension; \
    - H 'accept: application/json'
```
Link (opens in a new tab)
Enhance consuming segment handling to avoid an under-counting error with upsert tables. Link (opens in a new tab)
Improve the logic for taking snapshots by making them more atomic and in an order that permits correct table preloading. Link (opens in a new tab)
Add the following new metrics:
- pinot_server_tableRebalanceInProgress_Value{table=${tableName},tabletype=${tableType}} indicates whether a table is being rebalanced. 1 indicates rebalancing is in progress and 0 when it's not. Link (opens in a new tab)
- pinot_server_tableDisabled_Value{table=${tableName},tableType=${tableType}} indicates whether a table is disabled. It uses 1 to indicate the table is disabled and 0 when it is not. Link (opens in a new tab)
- pinot_server_tableConsumptionPaused_Value{table=${tableName},tableType=${tableType}} indicates whether table consumption is paused. 1 indicates the table is consumption is paused and 0 when it's not. Link (opens in a new tab)
Add a set of catch-all regexes for JMX -> Prometheus Exporter for when a regex used does not match a metric. Link (opens in a new tab)
Add compression configuration for aggregation in a star-tree index. Link (opens in a new tab)
Add a new flag to indicate whether the query result is partial or full. Link (opens in a new tab)
Add DATETIMECONVERTWINDOWHOP transformation function. Link (opens in a new tab)
Enable tracking of out of order events in an upsert-enabled table using a new configuration outOfOrderRecordColumn. Link (opens in a new tab)
Enable support for leveraging a star-tree index in conjunction with filtered aggregations, including filtered group-by aggregations. Link (opens in a new tab)
Add a new MV dictionary-encoded forward index format that only stores the unique MV entries, reducing storage footprint for indexes. Link (opens in a new tab)
Introduce low disk mode to table rebalance, which is set to false by default. When set to true, the server will first offload segments before loading the new segments during rebalance. Link (opens in a new tab)
Introduce a new configuration controller.realtime.segment.deepStoreUploadRetry.parallelism (the default setting is 1) to increase the size of the thread pool used for retrying segment uploads. Also the upload retry is now an asynchronous operation. Link (opens in a new tab)
Enable SegmentGenerationAndPushTask to push segment(s) to a realtime table, supporting bootstrapping an upsert enabled table. Link (opens in a new tab)
Add the ability to specify a custom Lucene analyzer used by text index for indexing and search on an individual column basis. Link (opens in a new tab)
Add murmur3 support as partition function
Enhance DistinctCountThetaSketch aggregation function by adding new parameters to give the end-user more control over how sketches are aggregated at query time. Link (opens in a new tab)
Add new configuration in upsert, deletedKeysTTL, which when set will remove deleted keys and mark the validDocID as invalid after the deletedKeysTTL threshold period, improving memory utilization. Link (opens in a new tab)
Add support for vector index using Hierarchical Navigable Small World (HNSW). Link (opens in a new tab)
Add the ability to initialize broker tags from configuration and automatically update the broker resource when broker joins the cluster for the first time. Link (opens in a new tab)
Enable partition level force-commit functionality, expanding the endpoint to accept a comma-separated list of partitions or consuming segment names. Link (opens in a new tab)
The following updates are specific to the multi-stage query engine:
- Optimize partition-based query performance when using the multi-stage query engine. The engine is now able to determine table partitioning and apply the best data shuffle mechanism automatically. Link (opens in a new tab)
- Enable the multi-stage query engine to run multiple operator chains, provided there is no requirement for distributed data shuffling. Link (opens in a new tab)
- Enable multiple SEMI-JOINs in the multi-stage query engine to use index lookup within the same node for a left-table scan. Link (opens in a new tab)
- Add support in the multi-stage query engine for early termination and direct error, warning, or stats return in the multi-stage query engine. Link (opens in a new tab)
Bug fix: Segments created in realtime tables are guided by the parameter realtime.segment.flush.threshold.segment.size if it is set. Link (opens in a new tab)

StarTree Cloud

StarTree Extensions for Apache Pinot

Enable bootstrapping of upsert-enabled tables by supporting batch ingestion using fileingestiontask into a realtime table.
StarTree Upsert on by default for all StarTree deployments, providing enhanced scalability and stability when using upsert.
- Improved server restart times when using StarTree upserts
- Ability to take snapshot for improved recoverability of upsert tables
Provide visibility into the health of various components (Server, Broker, Controller, Tables, etc.) using the Cluster Health Dashboard in Pinot Control Panel. The dashboard is updated every 20 minutes and can be triggered on-demand by using the /periodictask/run API call.
Ability to gate access to Pinot tables using a new Role Based Access Control (RBAC) system. Roles can be assigned to individual users, IDP groups or Pinot API tokens. Access can be controlled at a table-level granularity along with the ability to allow/deny specific APIs on Pinot clusters and tables. Alpha Release

Data Manager

Add Custom Connector option that lets you create a dataset using a JSON connection configuration to a Google Cloud Storage (GCS) data source.
Enhance interface to select a directory or multiple directories in an AWS S3 bucket.
Add SSL certificate support for Kafka. Now, you can enter details to connect with Kafka under SSL Authentication Type in Kafka Source.
Enhance Delta Lake connector to support IAM role access in AWS.

ThirdEye

Link anomaly alerts to PagerDuty for instant notification and efficient incident management. Link (opens in a new tab)
Enable customizable bounds for precise anomaly detection, enhancing decision-making accuracy. Link (opens in a new tab)
Protect sensitive information with automated data masking during the automated anomaly detection alert creation process.Link (opens in a new tab)
UI/UX improvements
- View related tasks for each alert, including success, failure, and access to logs for troubleshooting.
- View a list of notifications sent per subscription group for better insight into alert distribution.
- Easily identify which subscription groups are receiving specific alerts.

0.8.1 0.7.2