Skip to main content

Anomaly detection algorithms

Principles

Before diving into algorithms, it is important to understand the principles of good alerting systems.
If not done cautiously, alert pipelines can quickly get messy, generating alert spam, alert fatigue, or missing important anomalies.
When an alert triggers, it should be easy to understand why. Similarly, if an alert doesn't fire, it should be easy to check and see what happened. The more complicated an alert condition becomes, the harder it is to understand and debug1.

The above is theory that applies well to simple metric monitoring, like CPU usage in computers, but it can be harder to apply for operations and business monitoring, the core use cases of ThirdEye. Business monitoring is full of complex patterns, special events and seasonalities 2.

When designing alerts, you will have to find your place on the quadrant below.

the more complex the less actionable

In ThirdEye, alert rules tend to be more complex than in other systems, and RCA algorithms help you make sense of the anomalies.
This does not mean you should start by complex alerts.

Good practices to apply:

  • Do not try to monitor everything. Think about what is important and what is actionable.
  • Start simple and iterate3 to finetune your alert.
  • Pareto principle applies. Root for the 80% before going for the very complex 20%. 4

With this in mind, here is a review of the commonly used detector algorithms in ThirdEye.

Detector algorithms

Threshold

Simplest method. Detect an anomaly if a metric is above a maximum threshold or below a minimum threshold. Good for signals that are mostly flat, or should not go in certain range.
Pros: easy to configure and understand.
Cons: does not manage seasonalities. You have to estimate noise yourself.

Mean Variance Rule

Estimate the mean. Estimate the standard deviation, and consider the standard deviation is caused by noise only. If the value is above mean + n*std or below mean + n*std, detects an anomaly.
Good if your signal is flat with a lot of noise. mean varriance rule Pros: No need to estimate noise. Can adapt if noise changes.
Cons: Does not manage seasonalities.

Percentage rule

Compare current timeseries to a baseline. If the percentage change is above a certain threshold, detects an anomaly. A simple way of managing seasonalities. You define the baseline yourself.

A common usage is to compare your current value to the value of last weeks. For instance, compare Thursday 8, 8pm to Thursday 1, 8pm. This way, you manage hourly and weekly seasonalities.

Most of the time, percentage rule is easy to understand, but because percentage rules use division, it can be sensitive with values close to 0. It can also be sensitive to noise: if you define a percentage change limit of 20%, and the noise is such that it is common to do -+10% around the mean, you will have false positives when you have -10% in the baseline followed by +10% in the observed value. percentage change rule Pros: easiest way to manage seasonality. Easy to understand.
Cons: Sensitive to noise5. Sensitive to big trends6.

Absolute change rule

Compares current time series to a baseline. If the absolute change is above a certain threshold, detect it as an anomaly. An alternative to percentage rule, that works better for noisy and small values. absolute change rule Pros: alternative to avoid percentage rule pitfalls.
Cons: the absolute change value is harder to set and understand than a percentage.

Holt-Winters Rule

Holt-Winters method7 is a statistic forecasting algorithm commonly used for anomaly detection. Timeseries are extracted as a sum of trend, seasonality and noise8. The algorithm estimates these components. This algorithm performs very well for daily data and hourly data. For minutely data, finetuning the parameters can take some time. Holt-Winters method is faster than most other ML methods while keeping a very good short-term forecasting performance. The model detects an anomaly when the observed value is too far from the predicted value. The sensitivity can be finetuned. holt-winters rule

Pros: manages seasonality, trend and noise.
Cons: as a model that heavily relies on past observation, it is sensitive to concept drift, false trends9, special events, bad data, etc.

Remote HTTP

The Remote HTTP detector allows the anomaly detection to be performed by a remote HTTP service. The user can configure the alert to point to a REST endpoint. This endpoint must be able to accept the thirdeye detection payload and respond back with a specific response API. Upon successful exchange, the response is shared back with downstream operators thereby completing the detection workflow.

Sample Alert json

Note the AnomalyDetector node in the alert. The type should be set to REMOTE_HTTP. Here, component.url is used to pass the url to the http detector operator.

{
"name": "sample-alert-using-remote-http",
"description": "Sample description payload for testing",
"cron": "0 0 0 1/1 * ? *",
"template": {
"nodes": [
{
"name": "root",
"type": "AnomalyDetector",
"params": {
"type": "REMOTE_HTTP",
"component.timezone": "US/Pacific",
"component.monitoringGranularity": "P1D",
"component.timestamp": "ts",
"component.metric": "met",
"component.url": "http://localhost:5000/api/http-detector",
"anomaly.metric": "${metric}"
},
"inputs": [
{
"targetProperty": "current",
"sourcePlanNode": "currentDataFetcher",
"sourceProperty": "currentOutput"
}
],
"outputs": []
},
{
"name": "currentDataFetcher",
"type": "DataFetcher",
"params": {
"component.dataSource": "${dataSource}",
"component.query": "SELECT __timeGroup(\"${timeColumn}\", '${timeColumnFormat}', '${monitoringGranularity}') as ts, ${metric} as met FROM ${dataset} WHERE __timeFilter(\"${timeColumn}\", '${timeColumnFormat}') GROUP BY ts ORDER BY ts LIMIT 1000"
},
"inputs": [],
"outputs": [
{
"outputKey": "pinot",
"outputName": "currentOutput"
}
]
}
]
},
"templateProperties": {
"dataSource": "pinotQuickStartLocal",
"dataset": "pageviews",
"metric": "sum(views)",
"monitoringGranularity": "P1D",
"timeColumn": "date",
"timeColumnFormat": "yyyyMMdd"
}
}

Request Payload

Here's a sample request payload that is sent to the remote http service.

{
"startMillis": 1553468646555,
"endMillis": 1653468646555,
"spec": {
"help": "This is the anomaly detector spec object which is sent as is to the remote service",
...
},
"dataframe": {
"seriesMap": {
"timestamp": [1, 2, 3],
"current": ["v1", "v2", "v3"]
}
}
}

Response Payload

Here's a sample response payload. This must adhere to the interface defined here in order for the pipeline to execute successfully.

In this case, the expectation is to receive a dataframe in a format defined below with a predefined set of columns.

  • current: The value of the metric at different timestamps
  • timestamp: The timestamps associated with the observed values of the metric
  • value: baseline/predicted values of the metric
  • lower_bound: The allowed lower bound of the metric
  • lower_bound: The allowed upper bound of the metric
  • anomaly: boolean if this is or isn't an anomaly.
{
"dataframe": {
"seriesMap": {
"timestamp": [1, 2, 3],
"current": ["v1", "v2", "v3"],
"value": ["baseline1", "baseline2", "baseline3"],
"lower_bound": ["lower_bound1", "lower_bound2", "lower_bound3"],
"upper_bound": ["upper_bound1", "upper_bound2", "upper_bound3"],
"anomaly": ["true", "false", "false"]
}
}
}

Going further

If you want to integrate your own business specific model, see create a detector documentation.