on
Type-safe model-based alerting
Alerting is something you must do well. You cannot fix what you do not know about, and understanding your system well enough to know what anomalous behavior looks like is a fundamental command of any software developer working in a system that is sufficiently complex. Despite the prerequisite required knowledge of the component/code/behavior, there is an implicit requirement to also understand the statistics behind your model-based alert. Regardless of how you have implemented your alerting, you are using a model to predict or outline divergent behavior. In fact, without at least a mental model to define what is expected as compared to what is unexpected, there is no reasonable way to define a coherent set of alerts. As such, a painfully common mistake that even experienced engineers make is to not define the assumptions they are making about their model. More often than I care to admit I have seen averages of quantiles, averages of averages, or even using standard deviation for determining the likelihood of long-tail events.
Having thus defined the disconnect between modeling of behavior and implementation of detecting anomalies in that model, we can begin to consider the case where we define the model we expect or are using when and where we define those alerts. Building off the work in type systems so ubiquitous across many computing languages, here we propose the idea of model typing for alerts. There are a few main problems we solve:
- Identifying when incorrect or misleading mathematics are used in defining alerts
- Providing a mechanism by which to update or disprove these behavioral models empirically over time
- Automatically generating alerts from those model definitions based on the appropriate mathematical signals
We will start with a basic implementation of our model type checking building off the excellent work of Prometheus and open-metrics, which already has basic type-checking built in. For the former goal, this mechanism will provide a way to ensure that priors are rigorously defined. Furthermore, this type-checking prevents using the wrong mathematical operations on improper data types.
The latter case is perhaps even more interesting. When we create a model (mental or otherwise) of a system, we typically base that upon a smallish sample of empirical data plus a priori knowledge of how the system should functions. As we go along collecting data about this system, we will from time to time observe anomalies. These anomalies indicate some combination of the following statements:
- We are actually witnessing behavior we did not expect (something worth bringing in a human to investigate!), or perhaps more interestingly
- We can now invalidate our model choice and reevaluate how we think about that system
One of those two must be true, and by formally defining our priors like any good Bayesian we grant ourselves a mechanism by which to do that post hoc model negation.