Try StarTree Cloud: 30-day free trial
Segment Refresh Task

Segment refresh

The SegmentRefreshTask reads the latest table config, refreshes the segments if they are not consistent with the table config. When inconsistency between segments and table config is detected, it will download the segments from the deep store, process and regenerate the segments and then push them back to replace the old segments atomically.

Supported operations

The following operations can be applied to the segments to match the table config:

  • Time partitioning: Re-partition the segments to be time partitioned (all the records within a segment are in the same time bucket)
  • Value partitioning: Re-partition the segments per the partitioning config
  • Merge/Split: Merge small segments or split large segments (with rollup support) to ensure segments are properly sized
  • Other table config changes that cannot be applied on the server side with segment reload
    • Change time column
    • Change sorted column
    • Change column data type
    • Change column encoding

How to config SegmentRefreshTask

Configure the SegmentRefreshTask under the taskConfig in the table config.

Property NameRequiredDescription
bucketTimePeriodYesTime bucket for segments (e.g. 1d).
maxNumRecordsPerSegmentNo (default 5M)Max (desired) number of records in each segment. The task will try to resize all segments to this size after applying the partitioning constraints.
skipSegmentIndexCheckNo (default false)If set to true, the index check (see the next section) will be skipped. This check requires pulling all segments' metadata from the servers, which can be costly for large table.
tableMaxNumTasksNo (default 10)Max number of parallel tasks a table can run at each schedule. This value can be tuned based on the Minion instances in the cluster. It has to be positive.
maxNumRecordsPerTaskNo (default 50M)Max number of records processed in a single task. Each task is executed by a single Minion instance, so we want to limit the records processed to prevent Minion run out of resource. It has to be positive.
maxDataSizePerTaskNo (default 5 GB)Max size of data provided to a single task.
desiredSegmentSizeNo (default 500 MB)User specified size for a segment.
mergeTypeNoSame definition as in the MergeRollupTask (opens in a new tab).
roundBucketTimePeriodNoSame definition as in the MergeRollupTask (opens in a new tab).
*.aggregationTypeNoSame definition as in the MergeRollupTask (opens in a new tab).

Segment index check

When segment index check is enabled, the task generator will pull the segment metadata for each segment from the servers, and compare it with the table config. If the segment metadata is not consistent with the table config, the segment will be refreshed. The following properties are compared between the segment metadata and the table config:

  • Whether the time column is the same
  • Whether the partitioning info matches:
    • Same partition column
    • Same partition function
    • Same partition count
    • Segment belong to a single partition
  • Sorted column in the table config is sorted in the segment
  • Checks for all columns:
    • Column is added (while this can be handled via server reload, we have extended support to handle addition to this task, so that we can do it in one shot and via Minions)
    • Column is deleted
    • Column field type change
    • Column data type change
    • Column SV/MV change
    • Column encoding change

Example Configuration

    "task": {
      "taskTypeConfigsMap": {
        "SegmentRefreshTask": {
          "bucketTimePeriod": "1d",
          "maxNumRecordsPerSegment": "2000000",
          "maxNumRecordsPerTask": "10000000"
          "schedule": "0 */30 * * * ?"
        }
      }
    }

Limitation

For real-time tables, the SegmentRefreshTask considers segments which are in COMPLETED state. The consuming segments are left untouched. Besides, this task currently doesn't work with real-time table enabled with Upsert or Dedup.