StarTree Cloud Coding Competition With Grand Prizes

Manage missing data

Most of the time[^1], to create buckets of a given granularity, the data fetching step will query with GROUP BY timeBucket. Let's consider a granularity of 1 day. What happens if there is no data on a Sunday? There will be no line for this Sunday. There is a great chance this will make things break later. After all, you expect your detection pipeline to know when there is no data. This case can seem non-likely to happen, but as soon as you filter on specific dimensions, the probability of this happening skyrockets.

To manage missing data, use a TimeIndexFiller node. A usual pipeline will look like this:

Here is the corresponding configuration:

{
  "name": "simple-threshold-template",
  "description": "Sample threshold alert. Runs every hour. Change the template properties to run on your data",
  "cron": "0 * * * * ? *",
  "template": {
    "nodes": [
      {
        "name": "root",
        "type": "AnomalyDetector",
        "params": {
          "type": "THRESHOLD",
          "component.monitoringGranularity": "${monitoringGranularity}",
          "component.timestamp": "ts",
          "component.metric": "met",
          "component.max": "${max}",
          "component.min": "${min}",
          "anomaly.metric": "${aggregateFunction}(${metric})"
        },
        "inputs": [
          {
            "targetProperty": "current",
            "sourcePlanNode": "missingDataManager",
            "sourceProperty": "currentData"
          }
        ]
      },
      {
        "name": "missingDataManager",
        "type": "TimeIndexFiller",
        "params": {
          "component.timestamp": "ts"
        },
        "inputs": [
          {
            "sourcePlanNode": "currentDataFetcher",
            "sourceProperty": "currentData"
          }
        ],
        "outputs": [
          {
            "outputName": "currentData"
          }
        ]
      },
      {
        "name": "currentDataFetcher",
        "type": "DataFetcher",
        "params": {
          "component.dataSource": "${dataSource}",
          "component.query": "SELECT __timeGroup(\"${timeColumn}\", '${timeColumnFormat}', '${monitoringGranularity}') as ts, ${aggregateFunction}(${metric}) as met FROM ${dataset} WHERE __timeFilter(\"${timeColumn}\", '${timeColumnFormat}') GROUP BY ts ORDER BY ts LIMIT 1000"
        },
        "outputs": [
          {
            "outputKey": "pinot",
            "outputName": "currentData"
          }
        ]
      }
    ]
  },
  "templateProperties": {
    "dataSource": "pinotQuickStartAzure",
    "dataset": "pageviews",
    "aggregateFunction": "sum",
    "metric": "views",
    "monitoringGranularity": "P1D",
    "timeColumn": "date",
    "timeColumnFormat": "yyyyMMdd",
    "max": "850000",
    "min": "250000"
  }
}

threshold_with_missing_data_management.json

By default, missing buckets are created with a value of 0.

Configuration and behavior

The TimeIndexFiller takes an input and returns it with the time index filled.

  {
    "name": "missingDataManager",
    "type": "TimeIndexFiller",
    "params": {
      "component.timestamp": "ts"
    },
    "inputs": [
      {
        "sourcePlanNode": "currentDataFetcher",
        "sourceProperty": "currentData"
      }
    ],
    "outputs": [
      {
        "outputName": "currentData"
      }
    ]
  }

Notice there is only one parameter.

Because the DataFetcher uses the ThirdEye macros __timeGroup(...) and __timeFilter(...), metadata about the granularity and the time predicate is directly given to the TimeIndexFiller.

Manual configuration

If the input does not use macros, the TimeIndexFiller requires the following parameters:

component.monitoringGranularity: the granularity in ISO 8601 format. Eg: P1D.
component.metric: the name of the metric column.
component.timestamp: the name of the time column.
component.minTimeInference: the strategy to infer the minimum time.
component.maxTimeInference: the strategy to infer the maximum time.
component.lookback: Used when time inference uses a lookback time. In ISO 8601 format.

The possible strategies are:

FROM_DATA: the minimum (resp maximum) time is the minimum (resp maximum) time observed in the input. Does not work well if data is missing at the beginning or at the end.
FROM_DETECTION_TIME: the minimum (resp maximum) time is the the minimum (resp maximum) of the analysis timeframe
FROM_DETECTION_TIME_WITH_LOOKBACK: same as the previous one, with an offset applied of value component.lookback.

[^1] For classic SQL databases. Timeseries-focused database often introduce group-by-time-bucket capabilities, with empty buckets materialized. Still these bucketings require a range, or use the first and last value as range, which is not correct in our case.

Multiple inputs alerts Derived metric alert