Dimension Exploration
Dimension Exploration gives users the ability to create alerts for every value of a certain dimension or combination of dimensions.
Simply put, dimension exploration is similar to running a for loop on different sets of values on the same detection workflow.
Problem Description
Let’s take a look at the pageviews example. pageviews is a metric which computes the views aggregated by sum over a given time period. In order to create an alert if pageviews drops by 5% could be achieved in a single alert.
However, what if the user wants to be alerted for pageviews drop by 5% by country. The user would have to create a separate alert by each country.
This is the problem that Dimension Exploration tries to solve by giving the user the capability to write a single alert to monitor a workflow for different values of dimensions. In this example, country = { ‘US’, ‘IN’, ‘FR’ }
Approach
The approach here leverages the flexibility of the detection pipeline to be able to model a ‘loop’ within a workflow.
Detection Pipeline DAG
The detection pipeline supports nodes that allow multiple named input and output streams.
This gives us the ability to introduce 3 new primitives/operators.
- Enumerator Node
- Fork Join Node
- Combiner Node
Enumerator Node
An enumerator node outputs a set of key value pairs. These key value pairs will be used to determine the template variables used for running the ‘loop’.
Output Schema
List<Map<String, Object>>
Each of the items in a list is a map containing template variables and values. Therefore, an enumerator can emit multiple variables and values per iteration.
Current template variables in thirdeye are restricted to strings. This poses a design limitation since there is no way to feed in structured data back into the template.
An enumerator itself can be of different types. TBD.
Default Enumerator
The Default Enumerator node could be configured to emit a series of values configured as is.
This allows the user to pass a set of template properties that can be fed into the child pipeline. In this example, different values of queryFilter property will be fed into the pipeline for each run.
{
"name": "enumerator",
"type": "Enumerator",
"params": {
"type": "default",
"items": "${enumerationItems}"
}
}
where "enumerationItems" is a template variable that can be fed in using a json as
{
"name": "my alert",
... other properties,
"templateProperties": {
"dataSource": "pinot",
...
"enumerationItems": [
{
"name":"US Only",
"params": {
"queryFilters": " AND country = 'US'",
"max": 300000,
"min": 100000
}
},
{
"name": "Overall",
"params": {
"queryFilters": "",
"max": 900000,
"min": 300000
}
}
]
}
}
Combiner Node
The combiner node accepts the results of a fork operator and combines all the results.
A combiner can have different types. Initial approach would be to have a combiner that combines anomalies from each of the enumerations and tags them accordingly.
Example: The expectation here is to consume a list of outputs from each DAG and emit a single consolidated output. Note that each node can have multiple outputs. Therefore it is essentially flattening a list of lists into a list.
{
"name": "combiner",
"type": "Combiner"
}
In this case, we simply use the default combiner. This should work fine for most use cases.
Fork Join Node
A fork join node follows a fork + join model which allows the user to input a set of enumerations. The fork join node forks the pipeline to execute a sub workflow DAG for each of the enumerations.
Operation
The Fork Join Node takes 3 parameters Enumerator: This is an enumerator node that will be used to get all the enumerations that need to be executed Root: This is the root node of the sub DAG that the fork join is supposed to execute feeding in the enumerations Combiner: This is the combiner node that will consume the list of outputs from each enumeration execution.
The sub DAG execution is done in parallel in a threaded execution model to leverage parallelization.
Each execution of the fork join consumes an entry from the enumerator. This is what we call an enumeration Item.
Enumeration Item
The enumeration node generates a list of enumeration items. These essentially contain a set of parameters that are fed in the sub pipeline which is executed in the fork join. Thus, in a way they define the results of the pipeline - meaning, any anomalies which may be calculated in the sub pipeline need to be attributed to the corresponding enumeration item.
For ease of access, filtering and other operations, enumeration items are also persisted in the database as entities. Here is a sample enumeration item entity
{
"name": "US, 1.0",
"description": "slice for US only results",
"params": {
"queryFilter": " AND country='US' AND version='1.0'"
}
}
Anomalies generated with this pipeline is automatically tagged with the enumeration item
{
// anomaly object
"id": 1234,
...
"enumerationItem": {
"id": 1234
}
}