on
About
If you see fraud and do not say fraud, you are a fraud. – N. Taleb
There are countless misconceptions in the world of distributed software systems. I am not a theoretician, I am a practitioner with an interest in solving problems the right way. With those caveats, let us dive in.
This blog is intended to drill down into how to manage risk in distributed software systems.
We can loosely define risk as a nonlinear combination of complexity and opacity:
complexity ^ opacity ~= systemic risk
By merely reducing both of those quantities you will go far in mitigating the systemic risk of collapse.
Complexity in a system increases the likelihood of total failure, and more nodes and more connections enter into the critical path. Opacity acts as a multiplier in terms of how quickly you are able to recover your system and understand the failure in all it’s glory.
We are not so much worried about temporary collapse of the system, that is a given over a long enough time frame. What we are worried about is consistent and prolonged collapse, as this behavior is what drives people away.
These ideas and principles may be generally applicable to other distributed systems, but I do not work with other types regularly and therefore will not pontificate about them.
Our goal as those entrusted with the safety of our computing systems is essentially to turn ergodic systems into non-ergodic systems, as best we can. We need to rid ourselves and our systems of any and all absorbing barriers we can conceive or expose. We also aim to build antifragile systems, ones that respond positively to load and become safer over time (rather than riskier, as we add complexity to deal with our newfound scale).
Our job as SREs, software developers, systems administrators, and operations folks is ultimately one of risk management.