Failure Scenarios
Overview
This document outlines the different failure scenarios we have anticipated and what our mitigation is.
Failure Scenarios
Event | Consequences | Mitigations |
---|---|---|
Github Failure | Github is a dependency for Flux. Without Github changes to the manifests cannot be applied, including the deployment of new applications, infrastructure change. It doesn't affect the operational state of the cluster. If using Github as package (image) registry as well, new applications that have to run won't be able to do. | - |
Cloud Failure | In case of complete Cloud failure, our platform will be unaccessible (partially or fully). | Choose trustworthy provider. A costly mitigation is having a hot side in another provider. Given the scope of our operations, this is not recommended |
Storage Failure | Object storage within the cloud should be used mostly for non-operational tasks, such as backups. If used for stateful applications, then unavailability of this will cause the corresponding applications to fail. | - |
Public Network Failure | Public network failure will make our public applications unreachable from outside the private network (i.e., for users) | - |
Private Network Failure | Internal network failure can lead to cluster working in degraded state. Stateless applications will be able to move to other nodes in the cluster. Stateful applications bound to failed nodes will be unavailable. If the node(s) affected are the ones performing ingress in the cluster, the consequence is also the same as the public network failure. | Use distributed storage or networked storage with high SLA to tolerate node failures. The complete resiliency will depend by the node count and redundancy, Kubernetes cluster will keep working until N/2+1 control plane nodes are available. Same considerations for the Vault node/cluster. |
Node Failure | Node failure has the exact same effect as the above Private Network Failure, as from the Kubernetes perspective there is no difference between the node being off and the node being unreachable. | - |
Bastion Failure | Bastions unavailable means that Shiftcollective engineers won't be able to access the Kubernetes nodes via SSH. The cluster functionality is not affected. | Use multiple machines with floating IP. Perform ad-hoc changes to the firewall to allow SSH access directly to nodes (essentially, skip Bastion). |
WAF Failure | WAF failure will have the same consequences as the Public Network Failure | Switch DNS names to point to ingress IPs directly and allow public traffic from firewall (essentially, skip WAF). |
Vault Failure | Vault failure will impact Shiftcollective engineers by not allowing SSH access to other nodes, including bastions. Kubernetes applications that need to start and require secrets/API keys will also not be available. Already running applications will retain access to the previously mounted secrets. | Run Vault in cluster with 3-5 members. |
Table of Contents