Skip to main content

Release Version 0.4.0: April~May 2022

Apache Pinot

Allow pushing segments to RealTime table

Users can now push segments to a real-time table, thus simplifying onboarding when ingesting from a hybrid source (realtime+offline). This makes it very easy to bootstrap or backfill a real-time table. For more details, see the GitHub issue.

Deduplication support in real-time Pinot table

Added the ability to remove duplicates in the streaming data sources based on a primary key. For more information, see the Stream Ingestion with Dedup documentation.

Server Failure Detector

Added a new Failure Detector module in the Pinot Broker that can take failed servers out of rotation in order to prevent further query failures. More details in the Github issue. For more details in the Github issue

Minion observability enhancements.

Added health endpoint for minions for proactively identifying ingestion issues (offline ingestion). For more details, see github.com/apache/pinot/pull/8669.

New ingestion minion tasks/endpoints (metadata) to enable ease of debugging for users. For more details, see github.com/apache/pinot/pull/8551.

Smart Approximations

Added smart functions to automatically switch to approximate data structure when cardinality is high for DISTINCT_COUNT and PERCENTILE.

Add broker config pinot.broker.use.approximate.function to turn the feature on (off by default) Add query config useApproximateFunction to override the broker level config. For more details, see github.com/apache/pinot/pull/8189.

Timestamp index

Added support to configure a new TIMESTAMP index on columns of type TIMESTAMP. This will automatically pre-aggregate column values based on the specified time granularities. For more details, see docs.pinot.apache.org/basics/indexing/timestamp-index

Spark 3.x support

Added support for running offline Pinot ingestion jobs in Spark 3.x. For more details, see github.com/apache/pinot/pull/8560

Real-Time text search support

Added support for Mutable FST Index that enables text search use cases on real-time data. The older Lucene indexes were created on segment flush and hence not available for the most recent data hosted in consuming segments. For more details, see github.com/apache/pinot/pull/8861

Distinct on Multi-value columns

Added support to use DISTINCT query operator on multi-value columns. More details in this github.com/apache/pinot/issues/8850

Enhanced aggregation support during ingestion

Ingestion Pre-Aggregation is now supported for MIN, MAX, and COUNT, in addition to SUM.

To enable the feature, add an aggregationConfig to the ingestionConfigs of a realtime table config. The format of the config is (with example)

  "aggregationConfigs": [
{
"columnName": "destColumn",
"aggregationFunction": "MIN(srcColumn)"
}
],

The destColumn must be in the schema and the srcColumn must not be in the schema. Additionally, all destColumns must be noDictionaryColumns. For more details, see github.com/apache/pinot/pull/8611

GapFill function

Added a new function to enable users to fill gaps in timeseries data using previous or default values. For more details, see docs.pinot.apache.org/users/user-guide-query/gap-fill-functions

Support for building all indexes in batch ingestion job

Ability to create all indexes during segment generation, reducing the processing during segment load on the server. For more details, see github.com/apache/pinot/issues/8334

StarTree Extensions for Apache Pinot

info

Only available in StarTree Cloud service

Pinot Proxy

Added a new endpoint to enable external services like Presto to be able to connect to internal Pinot servers (not exposed outside the k8 cluster) in a secure manner. For more details, please refer to this doc

Offline Ingestion: Auto Infer source partition column on sub directory

Added ability to derive columns in Pinot schema from the source file path. This is very useful when the source directory is partitioned on a dimension (eg: time with day as the smallest bucket). This partition column present in the file path is then automatically treated as one of the Pinot columns. For more details please refer to this doc.

Offline Ingestion: Auto partition source data on sub directory

Added ability to repartition source data on a particular sub-directory defined by its level in the path. This feature is useful to group data from different files into the same segment or set of segments. For more details please refer to this doc.

Google PubSub connector for Pinot improvements

Added improvements to the PubSub connector such as retry mechanism, configurable timeout, fixed bugs in stream recreation, improved reliability of snapshots on checkpoint.

StarTree Cloud - includes BYOC (Bring Your Own Cloud) and SaaS

Soc2 Type1 Certification

Achieved Soc2 Type1 certification. For more information, see the blog post.

Authentication on Pinot APIs

Announcing Alpha availability of authenticated Pinot APIs. Customers can use a generated token to get secure access to pinot apis.

OIDC Security provider

StarTree admins can now configure any OIDC compliant IDP e.g. Okta to provide authenticated access to their data plane.

Data Manager: Self Service Ingestion tool

Enhanced user experience

Launched simplified ingestion flow for Data Manager with guided experience. Now users can upload large files and can configure schema with more customizations at ease.

Confluent Schema Registry support

Added support for using Confluent Schema registry with basic auth during real-time Kafka based ingestion.

Confluent Schema Registry Json data format support

Added support for Confluent Schema registry Json data format to be used during Kafka based real-time ingestion

File upload size limit increased

Previous version only supported uploading 1 MB of files. Increased this limit to 30 MB.

Offline Ingestion Improvements

Trigger offline ingestion job immediately after new dataset creation. Previously there was a considerable delay for the async ingestion job to start.

Kafka upsert support self serve

Added support for configuring upserts in your Kafka based real-time dataset through the UI. For more details, see this doc

IAM role based S3 ingestion

Added support to ingest data from S3 using IAM role based access. Previously, users had to enter the access key and secret key which was not ideal.
For more details, see this doc

ThirdEye: Anomaly Detection and Root Cause Analysis Tool

Timezone support

May 2022

Now Users can configure the timezone during alert creation. For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/concepts/alert-configuration#timezone

Anomaly Summary and Investigate

May 2022

Now users can self-serve root-cause analysis,give feedback, add comments and save the investigation associated with a given anomaly. For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/how-tos/perform-root-cause-analysis#find-anomalies

In-app help and support

Now users can access helpful tips and documentation within the ThirdEye application for quicker task completion or onboarding to ThirdEye.

HTTP Detector (API)

Now users can plug detection algorithms into ThirdEye platform to detect anomalies in near real-time. (Example: Prophet is now supported to detect anomalies using HTTP Detector (API). For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/concepts/anomaly-detection-algorithms#remote-http

Timezone and timeformat support for ThirdEye Anomaly Reports

ThirdEye anomaly reports are now sent on the local timezone. For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/troubleshooting/faq_tips

Anomaly status

Users can now save comments and update the status for each anomaly saying it is “unexpected” or not.

Pre-configured anomaly detection techniques (low code)

User can now use pre-configured alert templates (low code) created using the existing anomaly detection techniques supported by ThirdEye to detect anomalies in the metrics data. For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/concepts/anomaly-detection-algorithms.

Slack notifications

Now supports integrations with different channels of notifications to users (Email, Slack and Webhook). For more details, see dev.startree.ai/docs/startree-enterprise-edition/startree-thirdeye/how-tos/notification/