Supervision is a way to ensure a computer service does not go out of service should a key component fail. The principle is rather simple: another computer or program continuously tracks whenever the critical process or program is running, and if it is not, that process will find the resources required to start it again.
This other process is called a supervisor, and is an essential component to any and all cloud applications.
A supervisor has for task to minimize the downtime of a critical service. For that it must also minimize its own downtime. For that here are two premises of making a supervised software:
- The supervisor software runs as far as possible from the software it is keeping track of
- The supervision system can restart the supervisor if it crashes
For the first one, that means that the supervisor runs in an independent process or on an independent computer or even in a different datacenter than the application. The second one means that the supervision system can be designed so that: you have multiple supervisors overseeing one another or a supervised process can order the restart of the supervisor if it were to not check for the process livelihood.
This type of process is useful in any form of cloud software, hence why SStorage, our encrypted cloud storage, uses it.
In Open-Telecom Platform
Open-Telecom Platform is a toolkit made by the Swedish telephone company Ericsson to handle their telephone line switches around 1995. This project was a work of Joe Armstrong, one of the fathers of modern distributed computing.
The principle was that since any process can start another process on either a distant or local node. This means that supervision is easy to implement on this system: any node can easily ask another node to start a new service and properly guide whatever depends on the process.
In SStorage, the supervision is done with a set of supervisors that verify each others existence and are themselves managed by an automatically elected leader. The leader supervisor is responsible for managing the non-leader supervisors. Those are tasked with other tasks as well as with keeping up with the structure of processes currently running on the cluster.
This set of supervisors is named
d_system when it comes to SStorage.
Supevision systems are another component of reliable computer systems. It is a quite natural answer to a very common problem in computer science. They make our services highly available, and when there is downtime they make it as short as possible. This is part of what enables us nowadays to write a reliable end-to-end encrypted cloud storage.