Skip to main content

Manage missing data

Most of the time1, to create buckets of a given granularity, the data fetching step will query with GROUP BY timeBucket.
Let's consider a granularity of 1 day. What happens if there is no data on a Sunday? There will be no line for this Sunday. There is a great chance this will make things break later. After all, you expect your detection pipeline to know when there is no data. This case can seem non-likely to happen, but as soon as you filter on specific dimensions, the probability of this happening skyrockets.

To manage missing data, use a TimeIndexFiller node. A usual pipeline will look like this:

Here is the corresponding configuration:

threshold_with_missing_data_management.json
{
"name": "simple-threshold-template",
"description": "Sample threshold alert. Runs every hour. Change the template properties to run on your data",
"cron": "0 * * * * ? *",
"template": {
"nodes": [
{
"name": "root",
"type": "AnomalyDetector",
"params": {
"type": "THRESHOLD",
"component.monitoringGranularity": "${monitoringGranularity}",
"component.timestamp": "ts",
"component.metric": "met",
"component.max": "${max}",
"component.min": "${min}",
"anomaly.metric": "${aggregateFunction}(${metric})"
},
"inputs": [
{
"targetProperty": "current",
"sourcePlanNode": "missingDataManager",
"sourceProperty": "currentData"
}
]
},
{
"name": "missingDataManager",
"type": "TimeIndexFiller",
"params": {
"component.timestamp": "ts"
},
"inputs": [
{
"sourcePlanNode": "currentDataFetcher",
"sourceProperty": "currentData"
}
],
"outputs": [
{
"outputName": "currentData"
}
]
},
{
"name": "currentDataFetcher",
"type": "DataFetcher",
"params": {
"component.dataSource": "${dataSource}",
"component.query": "SELECT __timeGroup(\"${timeColumn}\", '${timeColumnFormat}', '${monitoringGranularity}') as ts, ${aggregateFunction}(${metric}) as met FROM ${dataset} WHERE __timeFilter(\"${timeColumn}\", '${timeColumnFormat}') GROUP BY ts ORDER BY ts LIMIT 1000"
},
"outputs": [
{
"outputKey": "pinot",
"outputName": "currentData"
}
]
}
]
},
"templateProperties": {
"dataSource": "pinotQuickStartAzure",
"dataset": "pageviews",
"aggregateFunction": "sum",
"metric": "views",
"monitoringGranularity": "P1D",
"timeColumn": "date",
"timeColumnFormat": "yyyyMMdd",
"max": "850000",
"min": "250000"
}
}

By default, missing buckets are created with a value of 0.

Configuration and behavior

The TimeIndexFiller takes an input and returns it with the time index filled.

  {
"name": "missingDataManager",
"type": "TimeIndexFiller",
"params": {
"component.timestamp": "ts"
},
"inputs": [
{
"sourcePlanNode": "currentDataFetcher",
"sourceProperty": "currentData"
}
],
"outputs": [
{
"outputName": "currentData"
}
]
}

Notice there is only one parameter.
Because the DataFetcher uses the ThirdEye macros __timeGroup(...) and __timeFilter(...), metadata about the granularity and the time predicate is directly given to the TimeIndexFiller.

Manual configuration

If the input does not use macros, the TimeIndexFiller requires the following parameters:

  • component.monitoringGranularity: the granularity in ISO 8601 format. Eg: P1D.
  • component.metric: the name of the metric column.
  • component.timestamp: the name of the time column.
  • component.minTimeInference: the strategy to infer the minimum time.
  • component.maxTimeInference: the strategy to infer the maximum time.
  • component.lookback: Used when time inference uses a lookback time. In ISO 8601 format.

The possible strategy are:

  • FROM_DATA: the minimum (resp maximum) time is the minimum (resp maximum) time observed in the input. Does not work well if data is missing at the beginning or at the end.
  • FROM_DETECTION_TIME: the minimum (resp maximum) time is the the minimum (resp maximum) of the analysis timeframe
  • FROM_DETECTION_TIME_WITH_LOOKBACK: same as the previous one, with an offset applied of value component.lookback.

See reference documentation.


1 For classic SQL databases. Timeseries-focused database often introduce group-by-time-bucket capabilities, with empty buckets materialized. Still these bucketings require a range, or use the first and last value as range, which is not correct in our case.