Skip to main content

Alert configuration and execution

The alert configuration is a JSON that contains all the logic to fetch and analyse data in order to detect anomalies.
Consider the alert configuration from the getting started tutorial:

quickstart_complexWebsite_alert.json
{
"name": "my-first-detection-configuration",
"description": "Detect anomalies in pageviews.",
"cron": "0 0 4 * * ? *",
"template": {
"nodes": [
{
"name": "root",
"type": "AnomalyDetector",
"params": {
"type": "THRESHOLD",
"component.monitoringGranularity": "P1D",
"component.timestamp": "ts",
"component.metric": "met",
"component.max": "${max}",
"component.min": "${min}",
"anomaly.metric": "${metric}"
},
"inputs": [
{
"targetProperty": "current",
"sourcePlanNode": "missingDataManager",
"sourceProperty": "currentOutput"
}
],
"outputs": []
},
{
"name": "missingDataManager",
"type": "TimeIndexFiller",
"params": {
"component.timestamp": "ts"
},
"inputs": [
{
"sourcePlanNode": "currentDataFetcher",
"sourceProperty": "currentOutput"
}
],
"outputs": [
{
"outputName": "currentOutput"
}
]
},
{
"name": "currentDataFetcher",
"type": "DataFetcher",
"params": {
"component.dataSource": "${dataSource}",
"component.query": "SELECT __timeGroup(hoursSinceEpoch, '${timeColumnFormat}', '${monitoringGranularity}') as ts, ${metric} as met FROM ${dataset} WHERE __timeFilter(hoursSinceEpoch, '${timeColumnFormat}') GROUP BY ts ORDER BY ts LIMIT 10000"
},
"inputs": [],
"outputs": [
{
"outputKey": "pinot",
"outputName": "currentOutput"
}
]
}
],
"metadata": {
"datasource": {
"name": "${dataSource}"
},
"dataset": {
"name": "${dataset}"
},
"metric": {
"name": "views"
}
}
},
"templateProperties": {
"dataSource": "pinotQuickStart",
"dataset": "complexWebsite",
"metric": "sum(views)",
"monitoringGranularity": "P1D",
"timeColumn": "date",
"timeColumnFormat": "EPOCH_HOURS",
"max": "15000",
"min": "5000"
}
}

Let’s step through this configuration key by key.

Root structure

Name

"name": "my-first-detection-configuration"

The name of the alert. Keep it short.

Description.

"description": "Detect anomalies in pageviews."

The description of the alert.

Cron

"cron": "0 0/1 * 1/1 * ? *"

The cron schedule. Defines how often and when the detection runs.
The format used is the 6-7 fields Quartz cron format. See Quartz documentation.
A cron expression builder is available here.

Template

"template": {
"nodes": [ .. .. .. ],
"metadata": { ... }
}

The analysis logic.

  • "nodes" contains the detection logic: a list of operators that fetch, clean and analyze data.
  • "metadata" contains metadata used for root cause analysis, search, and data delay management.

The field is called "template" because it is meant to be reused. Templates can be defined independently of alerts and reused by multiple alerts. See create templates.

Inside the template, you can use variables in the format ${myVariable}. Variables are defined in the templateProperties.

Template properties

"templateProperties": {
"key": "value",
}

Values available for the template. You can see them as environment variables.

For instance:

"template": {
"nodes": [ {
"type": "AnomalyDetector",
"params": {
"type": "THRESHOLD",
"component.max": "${max}",
}}]
},
"templateProperties": {
"max": 15000
}

Will be replaced at runtime by:

"template": {
"nodes": [ {
"type": "AnomalyDetector",
"params": {
"type": "THRESHOLD",
"component.max": 15000, // REPLACED HERE
}}]
}

Nodes

Execution

Nodes are the tasks that form the detection plan. They are also called Plan Nodes. Nodes can have multiple inputs and outputs. The inputs of nodes are the outputs of other nodes.

plan node

The list of nodes with their input and outputs form an execution DAG. Let’s have a look at our example:

"nodes": [
{
"name": "root",
"type": "AnomalyDetector",
"inputs": [
{
"targetProperty": "current",
"sourcePlanNode": "missingDataManager",
"sourceProperty": "currentOutput"
}
],
"outputs": []
},
{
"name": "missingDataManager",
"type": "TimeIndexFiller",
"inputs": [
{
"sourcePlanNode": "currentDataFetcher",
"sourceProperty": "currentOutput"
}
],
"outputs": [
{
"outputName": "currentOutput"
}
]
},
{
"name": "currentDataFetcher",
"type": "DataFetcher",
"inputs": [],
"outputs": [
{
"outputKey": "pinot",
"outputName": "currentOutput"
}
]
}
]

Graphically, the DAG looks like this:

plan node

An input is defined by:

  • a sourcePlanNode: the name of another node
  • a sourceProperty: the name of the output we want in the other node

An output is defined by:

  • an outputKey: the name of the output to get from the node. Used when a Node can return multiple, distinct outputs.
  • an outputName: the name by which other nodes can call this output

To understand how node dependencies are resolved, consider the 2 first nodes: The root node requires one input. This input is from the plan node missingDataManager and has the name currentOuput. The missingDataManager node exists, and it returns one output called currentOuput: the dependency is correct, root depends on missingDataManager. missingDataManager will be executed before root.

The execution DAG is constructed with the following algorithm:

  1. Start from the root
  2. Resolve the node dependencies
  3. The node dependencies can in turn have dependencies. Repeat 2. recursively.
  4. Stop when all node dependencies are resolved.

In this example, nodes only have one input and one output. An example of a node with multiple inputs is the AnomalyDetector of type "PERCENTAGE_CHANGE". It compares 2 inputs: a baseline table and a current table. See multiple inputs alerts.

Nodes behaviour

Now that you know how the detection plan is built, let’s have a look at what’s happening.

DataFetcher

The plan starts with the DataFetcher. Take the time to read the SQL query.

{
"name": "currentDataFetcher",
"type": "DataFetcher",
"params": {
"component.dataSource": "${dataSource}",
"component.query": "SELECT __timeGroup(hoursSinceEpoch, '${timeColumnFormat}', ${monitoringGranularity}) as ts, ${metric} as met FROM ${dataset} WHERE __timeFilter(hoursSinceEpoch, '${timeColumnFormat}') GROUP BY ts ORDER BY ts LIMIT 10000"
},
"inputs": [],
"outputs": [
{
"outputKey": "pinot",
"outputName": "currentOutput"
}
]
}

The DataFetcher executes the given query and returns the result in the output named currentOutput. Notice the usage of 2 functions:

__timeGroup(hoursSinceEpoch, '${timeColumnFormat}', '${monitoringGranularity}') as ts
__timeFilter(hoursSinceEpoch, '${timeColumnFormat}')

These functions are ThirdEye macros: they are translated to correct SQL, depending on the SQL engine. They make it easier to build detection pipelines. They are similar to Grafana macros.

  • __timeGroup transforms a timecolumn in bins of a given granularity, in milliseconds. It makes it easy to group by a timecolumn with a specific granularity. In our example, it transforms the column to bins of 1 day. If we ever want to monitor per hour, we can just change monitoringGranularity to P1H.

  • __timeFilter is a time range filter. Start and end dates are automatically generated depending on the detection runtime. Refer to the macro documentation.

TimeIndexFiller

After data is fetched, it is consumed by the TimeIndexFiller:

{
"name": "missingDataManager",
"type": "TimeIndexFiller",
"params": {
"component.timestamp": "ts"
},
"inputs": [
{
"sourcePlanNode": "currentDataFetcher",
"sourceProperty": "currentOutput"
}
],
"outputs": [
{
"outputName": "currentOutput"
}
]
}

Imagine you have no data on a Sunday. Because we group by a time column in the SQL DataFetcher, the Sunday line will be missing. This will cause problems in downstream nodes. TimeIndexFiller takes care of this kind of problem.

Notice there is no parameter for this node. This is because macros used by the DataFecher add metadata to result, (granularity, time column, etc…). TimeIndexFiller uses this metadata to know how to fill missing points. If you don’t want to use macros, you can still configure the TimeIndexFiller manually.

AnomalyDetector

The data finally flows to the AnomalyDetector node.

{
"name": "root",
"type": "AnomalyDetector",
"params": {
"type": "THRESHOLD",
"component.monitoringGranularity": "P1D",
"component.timestamp": "ts",
"component.metric": "met",
"component.max": "${max}",
"component.min": "${min}",
"anomaly.metric": "${metric}"
},
"inputs": [
{
"targetProperty": "current",
"sourcePlanNode": "missingDataManager",
"sourceProperty": "currentOutput"
}
],
"outputs": []
}

The AnomalyDetector is of type "THRESHOLD". There are many other types of detector, like PERCENTAGE_CHANGE, HOLT_WINTERS, etc…
Parameters prefixed by component. are specific to the detector execution. component.timestamp and component.metric tells the detector the column names of the time and the metric. See Anomaly Detectors documentation to learn more.
Notice that there is no output. Anomaly detectors do output anomalies data. This data is directly consumed by ThirdEye.

Metadata

This parts contains metadata that helps ThirdEye make sense of your alert pipeline. Metadata is used for anomaly filtering in the UI, RCA, and pipeline execution. ThirdEye is designed to give you maximum control over your queries and transformations, but this also means it is not always possible for ThirdEye to infer important information from the pipeline.
The metadata fields allow you to explicitly set important metadata.

"metadata": {
"datasource": {
"name": "${dataSource}"
},
"dataset": {
"name": "${dataset}",
"completenessDelay": "PT2H"
},
"metric": {
"name": "views",
"where": "browser=chrome"
},
"eventContext": {
"types": ["HOLIDAY"],
"sqlFilter" : "'US' member of dimensionMap['countryCode']"
},
"granularity": "P1D",
"timezone": "Europe/Paris"
}

Most of the time, metadata values are already used in nodes, so it is good practice putting them as template properties like in this example.

datasource

name

"name": "${dataSource}"

The name of the datasource to use for RCA. Also used for anomaly filtering.

dataset

name

"name": "${dataset}"

The name of the dataset to use for RCA. Also used for anomaly filtering.

completenessDelay

"completenessDelay": "PT2H"

The time for your data to be considered complete and ready for anomaly detection. In ISO-8601 format.
For instance: if you have a streaming system that can ingest data in 2 minutes, but you often have some data points up to 1 hour late, your completenessDelay is 1 hour (PT1H).
ThirdEye will only run detection on data that is complete.
Default value is no delay. This is rarely the case in real life.

dimensions

"dimensions": ["country", "browser"]

List of the dimensions (columns in the dataset) to use in RCA drill-downs. If not set or empty, all dimensions of the table are used.

rcaExcludedDimensions

"rcaExcludedDimensions": ["os", "device"]

List of dimensions (columns in the dataset) to ignore in RCA drill-downs. If not set or empty, all dimensions of the table are used.
dimensions and rcaExcludedDimensions arguments cannot be used at the same time.

metric

name

"name": "views"

The name of the metric to use for RCA.
Can be a column name col1 or a derived metric expression 2*col1 - col2.

aggregationFunction

"aggregationFunction": "count"

The aggregation function to use on the metric for RCA. If the metric name is known to ThirdEye, this parameter is optional. ThirdEye will use the default aggregation function of the metric.
If the metric is not known to ThirdEye, this parameter must be passed. For instance, you can perform rca on a count of a dimension column with

"name": "myDimension"
"aggregationFunction": "count"

where

"where": "browser=chrome"

A Sql filter clause that will be applied for RCA.
If you apply some custom filtering in your alert DataFetcher query, it is most of the time relevant to apply the same filtering for RCA. Notice that this field accepts AND browser=chrome instead of browser=chrome. This makes it easy to share custom filter clause between DataFetcher queries and this field with a template property.

eventContext

types

"types": ["HOLIDAY"],

A list of type to filter on for RCA. Only events that match such types will be shown in the RCA related events tab.

sqlFilter

"sqlFilter": "'US' member of dimensionMap['countryCode']"

A Sql filter for RCA events. Only events that match the filter will be shown in the RCA related events tab.
See filter examples here.

granularity

"granularity": "P1D"

The granularity of the data when it is passed to the AnomalyDetector nodes. In ISO-8601 format.
Consider a detection pipeline with a cron that runs every day at 3am. By default, the endTime of the detection pipeline will be 3am.
This means the last 0am-3am data will be taken, and put in a day bucket, even though the day is not finished and the data for the day is not complete. This can be avoided by passing a granularity of 1 day (P1D). ThirdEye will not take the 0am-3am because it is not a complete day.
With granularity, ThirdEye only runs detection for periods that are complete. This can be combined with the completenessDelay parameter.

timezone

"timezone": "Europe/Paris"

Timezone to use for the alert. In TZ database format. Defaults to UTC. The timezone is used:

mergeMaxGap

"mergeMaxGap": "PT2H"

Maximum gap between 2 anomalies for the anomalies to be merged. In ISO-8601 format. See anomaly merge.
To disable anomalies merging, set this value to P0D.

mergeMaxDuration

"mergeMaxDuration": "P7D"

Maximum duration of an anomaly merger. At merge time, if an anomaly merger would get bigger than this limit, the anomalies are not merged. See anomaly merge.

tip

Metadata fields are not mandatory.
You can start without metadata and update the alert configuration later if you need RCA. This will make RCA directly available in your existing anomalies.

In a production context, it is good practice to have metadata right from the start.