Release version 0.9.0: June 2024
Apache Pinot updates since last StarTree release
For details on Pinot changes, see Releases (opens in a new tab).
- Pinot's database support leaps forward! A new way to include database context within API requests. Link (opens in a new tab)
- Support ListAgg WITHIN GROUP clause Link (opens in a new tab)
- Ordering Within Groups: Pinot now enables in-group sorting. Users can define the GROUP BY clause as usual and sort the data within each group based on specific columns.
- Pinot empowers users with ListAgg, a powerful function that overcomes limitations. It allows for aggregating multiple values into a single, comma-separated list for each group defined by the GROUP BY clause.
- Distinct Control for Lists: Want to ensure unique values in your aggregated lists? No problem! Pinot's ListAgg now supports a DISTINCT flag, allowing control of whether duplicate values are included when building lists.
- StarTree indexes can now be leveraged for queries with NOT following a predicate, improving query performance. Link (opens in a new tab)
- Replacing the default DynamicMessage decoder with an option to leverage code generated by protoc. This optimization can lead to 3-4x faster deserialization, enhancing overall query performance when dealing with Protobuf data. Link (opens in a new tab)
- Empowering minion nodes with the ability to download segments directly from servers, enhancing data availability. This functionality is controlled by a new task-level configuration: allowDownloadFromServer (defaults to false). Link (opens in a new tab)
- Pinot to Offer ValueWindowFunctions [LEAD, LAG, FIRST_VALUE, LAST_VALUE) for Enhanced Analytics
- Add support for the Postgres SQL date_bin (opens in a new tab) function. Link (opens in a new tab)
- Advanced phrase search capabilities for Lucene-indexed tables. Link (opens in a new tab)
- Wildcard Matching: Find variations of phrases using wildcards (e.g., pache pino to match "Apache Pinot" or "apache pinot").
- Prefix Matching: Search for terms starting with a specific prefix (e.g., "pino" to find "Pinot" or "Pinot Noir").
- This feature is configurable through a new option in Lucene text-indexed columns (disabled by default).
- Pinot now supports GZIP compression for raw forward indexes, allowing you to squeeze more data into less storage space in certain scenarios. Link (opens in a new tab)
- Addressed inconsistencies in how exclusive predicates (e.g., !=, NOT IN) were handled, leading to more intuitive and predictable filtering behavior. Link (opens in a new tab)
- Consistent Wildcard Matching: Previously, wildcard (*) behavior differed between inclusive and exclusive predicates. Now, both types of predicates use the same logic, ensuring consistent results.
- No More False Positives: Documents lacking a specific key will no longer be included in results for exclusive predicates, eliminating unexpected matches.
- Nested Exclusive Predicates: Craft even more intricate filtering logic with the newfound support for nested exclusive predicates within JSON paths.
- These enhancements deliver a significant boost to the accuracy and power of your JSON data filtering in Pinot.
- Allows support for multi-value fields to jsonExtractIndex Link (opens in a new tab)
- Introducing a trio of powerful UDFs (user-defined functions): These UDFs generate derived columns containing prefixes, postfixes, and n-grams. These derived columns persisted, allowing Pinot to leverage its efficient inverted indexes for faster filtering. Link (opens in a new tab)
- Prefix & Postfix UDFs: Find terms starting/ending with specific characters (e.g., "data*", "*data").
- Ngram UDF: Extract character sequences (n-grams) for granular matching (e.g., 3-grams for "data*").
- Pinot is adding minion tags to isolate tasks. This allows assigning specific tasks to designated minion groups for better resource management and control. Link (opens in a new tab)
- The introduction of CLP compression for string columns simplifies data handling by hiding internal encoding details and providing a clean interface for string retrieval. Link (opens in a new tab)
- Pinot now lets you define custom logic for merging data during partial updates. This goes beyond merging individual columns, allowing you to control how entire rows are combined for more complex update scenarios. Link (opens in a new tab)
- Pinot now optimizes data distribution when adding new server pools. This ensures existing replica groups remain stable while leveraging the new pool for future deployments. Link (opens in a new tab)
- Pinot empowers you to ingest data at high rates while maintaining a consistent view of your information. This ensures accurate queries by preventing issues like missing or duplicate primary keys. Link (opens in a new tab)
- Pinot gains support for complex data structures during batch ingestion, allowing more flexible data processing for batch and minion tasks. Link (opens in a new tab)
- Pinot swallows more data formats! This update allows Avro readers to process int96 data types commonly found in Parquet files, resolving compatibility issues. Link (opens in a new tab)
- Pinot now simplifies secure connections by allowing keystore and truststore swaps on the fly. This eliminates the need to recreate entire security contexts, making certificate management smoother. Link (opens in a new tab)
- Pinot now handles null values more intelligently when using "mode", “minmaxrange”, “first_with_time”, “last_with_time”, “percentilles“ (most frequent value) within groups. This update follows PostgreSQL logic, where nulls are ignored during calculation, ensuring more accurate aggregation results. Link (opens in a new tab)
- Pinot introduces two new metrics(MULTI_STAGE_QUERIES_EXECUTED, MULTI_STAGE_QUERIES_BY_TABLE) to monitor multi-stage queries Link (opens in a new tab)
- Pinot's table consumption info endpoint is receiving an upgrade. A new field(serversFailingToRespond) now indicates servers that failed to respond during data gathering. This clarifies situations where information might be incomplete due to server issues. As a result, the endpoint provides a more comprehensive picture of data consumption. Link (opens in a new tab)
- Pinot gets a serialization upgrade to Protobuf! This improves stability across versions and ensures compatibility for future changes. Link (opens in a new tab)
StarTree Cloud
StarTree Extensions for Apache Pinot
- Improved Upsert Performance:
- Upsert metadata cleanup is now faster, avoiding delays during critical table state transitions.
- A new minion task UpsertSnapshotCreationTask accelerates server restarts and table rebalancing by building upsert metadata quickly.
- Streamlined Data Management: Re-streaming allows for table recreation from streaming sources without impacting existing workloads.
- Pagination support for large query results: Cursors empower clients to retrieve results in smaller portions and navigate forward/backward within the result set while querying data in Pinot.
- Enhanced Monitoring and Control:
- New metrics provide deeper insights into various ingestion tasks, including task duration, data ingested, input file size, and import failures (for alerting).
- The desiredSegmentSize configuration for SegmentImportTask offers more control over segment granularity.
- Optimized default configurations for SegmentRefreshTask and FileIngestionTask to improve performance.
- Consistent push support for FileIngestionTask simplifies data ingestion workflows.
- Advanced Functionality:
- Complex data structures (ComplexTypes) can now be handled within the SegmentProcessorFramework.
- FileIngestionTask can now skip checkpointing and watermarks for specific use cases.
- A configuration-driven SegmentToRowsRatio for checkpoints to prevent unnecessary recalculations during restarts.
- Segment refresh can be triggered based on StarTreeIndex configuration changes.
- Bug Fixes and Stability Improvements:
- FileIngestionTask now gracefully fails when encountering permission issues.
- Improved validation for various ingestion tasks ensures data integrity.
- Adhoc tasks in SqlConnectorBatchPushTask no longer silently drop tasks due to resource limitations.
- Semaphore handling in the checkpoint manager is fixed.
- MaxNumRecordsPerTask for DeltaTableIngestionTask is now correctly enforced.
- Enhanced Health Checks:
- New health checks monitor table-to-broker resource associations and instance pool configurations for large clusters, ensuring overall system health.
- Security:
- Pinot certificates can now be automatically updated in runtime without downtime, keeping your deployment secure.
- Batch Restarts: Pinot receives a restart optimization for clusters with server pools and all tables using instance pools and replica groups. The Pinot operator will now restart servers in batches of 5, expediting the restart process and minimizing disruption during maintenance or upgrades.
Data Manager
- Data Manager gets Workspace with Serverless!: Pinot is integrated with a database management system. This update introduces the concept of "workspaces" for data access control. Workspaces in Data Manager have a one-to-one mapping with the Database in Pinot.
- Upsert Efficiency Boost: Off-heap upsert processing becomes the default for all streams, accelerating upsert operations and maximizing resource utilization. On-demand mode in Kinesis is not supported.
ThirdEye
- Never Miss a Resolution: Receive timely Slack notifications when anomalies are fixed.
- Personalized Tracking: Get alerts for every anomaly, keeping track of the issues that matter the most.
- Organized and Actionable: Use tags to organize Slack notifications and quickly see who resolves each anomaly.
- Custom Workspaces: Enjoy personalized workspaces in ThirdEye that combine alerts, notification groups, anomalies, data sources, and datasets for a streamlined workflow.