Release Version 0.4.0: April~May 2022
Allow pushing segments to real-time table
Users can now push segments to a real-time table, thus simplifying onboarding when ingesting from a hybrid source (real-time and offline). This makes it very easy to bootstrap or backfill a real-time table. For more details, see the GitHub issue.
Deduplication support in real-time Pinot table
Added the ability to remove duplicates in the streaming data sources based on a primary key. For more information, see the Stream Ingestion with Dedup documentation.
Server Failure Detector
Added a new Failure Detector module in the Pinot Broker that can take failed servers out of rotation in order to prevent further query failures. More details in the Github issue. For more details in the Github issue
Minion observability enhancements.
Added health endpoint for minions for proactively identifying ingestion issues (offline ingestion). For more details, see github.com/apache/pinot/pull/8669.
New ingestion minion tasks/endpoints (metadata) to enable ease of debugging for users. For more details, see github.com/apache/pinot/pull/8551.
Added smart functions to automatically switch to approximate data structure when cardinality is high for DISTINCT_COUNT and PERCENTILE.
Add broker config pinot.broker.use.approximate.function to turn the feature on (off by default) Add query config useApproximateFunction to override the broker level config. For more details, see github.com/apache/pinot/pull/8189.
Added support to configure a new TIMESTAMP index on columns of type TIMESTAMP. This will automatically pre-aggregate column values based on the specified time granularities. For more details, see docs.pinot.apache.org/basics/indexing/timestamp-index
Spark 3.x support
Added support for running offline Pinot ingestion jobs in Spark 3.x. For more details, see github.com/apache/pinot/pull/8560
Real-Time text search support
Added support for Mutable FST Index that enables text search use cases on real-time data. The older Lucene indexes were created on segment flush and hence not available for the most recent data hosted in consuming segments. For more details, see github.com/apache/pinot/pull/8861
Distinct on Multi-value columns
Added support to use DISTINCT query operator on multi-value columns. More details in this github.com/apache/pinot/issues/8850
Enhanced aggregation support during ingestion
Ingestion Pre-Aggregation is now supported for MIN, MAX, and COUNT, in addition to SUM.
To enable the feature, add an aggregationConfig to the ingestionConfigs of a real-time table config. The format of the config is (with example)
The destColumn must be in the schema and the srcColumn must not be in the schema. Additionally, all destColumns must be noDictionaryColumns. For more details, see github.com/apache/pinot/pull/8611
Added a new function to enable users to fill gaps in timeseries data using previous or default values. For more details, see docs.pinot.apache.org/users/user-guide-query/gap-fill-functions
Support for building all indexes in batch ingestion job
Ability to create all indexes during segment generation, reducing the processing during segment load on the server. For more details, see github.com/apache/pinot/issues/8334
StarTree Extensions for Apache Pinot
Available only in StarTree Cloud
Added a new endpoint to enable external services like Presto to be able to connect to internal Pinot servers (not exposed outside the k8 cluster) in a secure manner. For more details, refer to this doc
Offline Ingestion: Auto Infer source partition column on sub directory
Added ability to derive columns in Pinot schema from the source file path. This is very useful when the source directory is partitioned on a dimension (eg: time with day as the smallest bucket). This partition column present in the file path is then automatically treated as one of the Pinot columns. For more details, refer to this doc.
Offline Ingestion: Auto partition source data on sub directory
Added ability to repartition source data on a particular sub-directory defined by its level in the path. This feature is useful to group data from different files into the same segment or set of segments. For more details, refer to this doc.
Google PubSub connector for Pinot improvements
Added improvements to the PubSub connector such as retry mechanism, configurable timeout, fixed bugs in stream recreation, improved reliability of snapshots on checkpoint.
StarTree Cloud - includes BYOC (Bring Your Own Cloud) and SaaS
Soc2 Type1 Certification
Achieved Soc2 Type1 certification. For more information, see the blog post.
Authentication on Pinot APIs
Announcing Alpha availability of authenticated Pinot APIs. Customers can use a generated token to get secure access to pinot apis.
OIDC Security provider
StarTree admins can now configure any OIDC compliant IDP e.g. Okta to provide authenticated access to their data plane.
Data Manager: Self-Service Ingestion tool
Enhanced user experience
Launched simplified ingestion flow for Data Manager with guided experience. Now users can upload large files and can configure schema with more customizations at ease.
Confluent Schema Registry support
Added support for using Confluent Schema registry with basic auth during real-time Kafka based ingestion.
Confluent Schema Registry Json data format support
Added support for Confluent Schema registry Json data format to be used during Kafka based real-time ingestion
File upload size limit increased
Previous version only supported uploading 1 MB of files. Increased this limit to 30 MB.
Offline Ingestion Improvements
Trigger offline ingestion job immediately after new dataset creation. Previously there was a considerable delay for the async ingestion job to start.
Kafka upsert support self serve
Added support for configuring upserts in your Kafka based real-time dataset through the UI. For more details, see this doc
IAM role based S3 ingestion
Added support to ingest data from S3 using IAM role based access. Previously, users had to enter the access key and secret key which was not ideal.
For more details, see this doc
ThirdEye: Anomaly Detection and Root Cause Analysis Tool
Now Users can configure the timezone during alert creation. For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/concepts/alert-configuration#timezone
Anomaly Summary and Investigate
Now users can self-serve root-cause analysis,give feedback, add comments and save the investigation associated with a given anomaly. For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/how-tos/perform-root-cause-analysis#find-anomalies
In-app help and support
Now users can access helpful tips and documentation within the ThirdEye application for quicker task completion or onboarding to ThirdEye.
HTTP Detector (API)
Now users can plug detection algorithms into ThirdEye platform to detect anomalies in near real-time. (Example: Prophet is now supported to detect anomalies using HTTP Detector (API). For more details, see https://dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/reference/operators/anomaly-detector/http
Timezone and timeformat support for ThirdEye Anomaly Reports
ThirdEye anomaly reports are now sent on the local timezone. For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/troubleshooting/faq_tips
Users can now save comments and update the status for each anomaly saying it is “unexpected” or not.
Pre-configured anomaly detection techniques (low code)
User can now use pre-configured alert templates (low code) created using the existing anomaly detection techniques supported by ThirdEye to detect anomalies in the metrics data. For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/concepts/anomaly-detection-algorithms.
Now supports integrations with different channels of notifications to users (Email, Slack and Webhook). For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/how-tos/notification/