Before diving into algorithms, it is important to understand the principles of good alerting systems.
If not done cautiously, alert pipelines can quickly get messy, generating alert spam, alert fatigue, or missing important anomalies.
When an alert triggers, it should be easy to understand why. Similarly, if an alert doesn't fire, it should be easy to check and see what happened. The more complicated an alert condition becomes, the harder it is to understand and debug1.
The above is theory that applies well to simple metric monitoring, like CPU usage in computers, but it can be harder to apply for operations and business monitoring, the core use cases of ThirdEye. Business monitoring is full of complex patterns, special events and seasonalities 2.
When designing alerts, you will have to find your place on the quadrant below.
In ThirdEye, alert rules tend to be more complex than in other systems, and RCA algorithms help you make sense of the anomalies.
This does not mean you should start by complex alerts.
Good practices to apply:
- Do not try to monitor everything. Think about what is important and what is actionable.
- Start simple and iterate3 to finetune your alert.
- Pareto principle applies. Root for the 80% before going for the very complex 20%. 4
With this in mind, here is a review of the commonly used detector algorithms in ThirdEye.
Simplest method. Detect an anomaly if a metric is above a maximum threshold or below a minimum threshold.
Good for signals that are mostly flat, or should not go in certain range.
Pros: easy to configure and understand.
Cons: does not manage seasonalities. You have to estimate noise yourself.
Mean Variance Rule
Estimate the mean. Estimate the standard deviation, and consider the standard deviation is caused by noise only.
If the value is above
mean + n*std or below
mean + n*std, detects an anomaly.
Good if your signal is flat with a lot of noise. Pros: No need to estimate noise. Can adapt if noise changes.
Cons: Does not manage seasonalities.
Compare current timeseries to a baseline. If the percentage change is above a certain threshold, detects an anomaly. A simple way of managing seasonalities. You define the baseline yourself.
A common usage is to compare your current value to the value of last weeks. For instance, compare Thursday 8, 8pm to Thursday 1, 8pm. This way, you manage hourly and weekly seasonalities.
Most of the time, percentage rule is easy to understand, but because percentage rules use division, it can be sensitive with values close to 0.
It can also be sensitive to noise: if you define a percentage change limit of 20%, and the noise is such that it is common to do -+10% around the mean, you will have false positives when you have -10% in the baseline followed by +10% in the observed value.
Pros: easiest way to manage seasonality. Easy to understand.
Cons: Sensitive to noise5. Sensitive to big trends6.
Absolute change rule
Compares current time series to a baseline. If the absolute change is above a certain threshold, detect it as an anomaly.
An alternative to percentage rule, that works better for noisy and small values.
Pros: alternative to avoid percentage rule pitfalls.
Cons: the absolute change value is harder to set and understand than a percentage.
Holt-Winters method7 is a statistic forecasting algorithm commonly used for anomaly detection. Timeseries are extracted as a sum of trend, seasonality and noise8. The algorithm estimates these components. This algorithm performs very well for daily data and hourly data. For minutely data, finetuning the parameters can take some time. Holt-Winters method is faster than most other ML methods while keeping a very good short-term forecasting performance. The model detects an anomaly when the observed value is too far from the predicted value. The sensitivity can be finetuned.
Pros: manages seasonality, trend and noise.
Cons: as a model that heavily relies on past observation, it is sensitive to concept drift, false trends9, special events, bad data, etc.
If you want to integrate your own business specific model, see create a detector documentation.
- Similar discussion here https://netflix.github.io/atlas-docs/asl/alerting-philosophy/#keep-it-simple↩
- This is why RCA algorithms are made available.↩
- See percentage rule and noise↩
- It is not a coincidence that the Pareto principle was developed in the context of quality control. See https://en.wikipedia.org/wiki/Pareto_principle↩
- See percentage rule and noise↩
- See percentage rule and big trends↩
- See https://otexts.com/fpp2/holt-winters.html↩
- See https://otexts.com/fpp2/tspatterns.html#tspatterns↩
- See a discussion about false trends.↩