Gitlab Chaos Templates for Kubernetes-based CI/CD
Problem to solve
- Kubernetes application (stateless & stateful) developers are adopting chaos engineering as a means to harden the application resiliency. Typically, this has been in the purview of SRE teams, with failure injection in pre-prod & sometimes prod environments. However, there is an increasing interest in using chaos in CI/CD pipelines itself. It would be ideal to have stages (or pipelines, as, say, extended e2e) where chaos "tests" are performed by injecting specific failures on desired "components" while checking for application availability.
Intended users
- Developers/DevOps Engineers of applications that are built-on/intended to run on Kubernetes
Further details
-
Some of the ways in which developers can achieve this today (w/o this proposal/feature):
- Write custom scripts with the logic to inject failures alongside the business logic in the application repo & include in .gitlab-ci.yaml. They may/may not use existing tools such as kube-monkey, pumba to achieve pod failures and network-delays
Proposal
-
Having readily usable gitlab "chaos templates" to perform standard Kubernetes failures is desirable, as it allows developers to spend more time on application business logic. The developer can define one or more chaos stages in the .gitlab-ci.yaml spec which performs a "remote include" of the chaos template, the template being packaged with its own runner image, execution script and a standard set of variables to describe the chaos experiment (these are typically overridden), as illustrated in the following example.
-
In application developer's .gitlab-ci.yaml
include:
- remote: 'https://github.com/<repo>/raw/<branch>/chaos/pod-templates/pod-failure.yaml'
stages:
- build
- test
- deploy
- chaos
- pre-prod
< build-job >
< test-job >
< deploy-job >
:
Inject Pod Failure Chaos:
stage: chaos
extends: .pod_failure_template
variables:
app_ns: default
app_label: "app=nginx"
dependencies:
- <deploy-job>
artifacts:
paths:
- <kubeconfig>
- In Gitlab chaos templates src/folder (../chaos/pod-templates/pod-failure.yaml):
---
variables:
app_ns: percona
app_label: 'app=mysql'
mode: pod-delete
.pod_failure_template:
image: litmuschaos/ansible-runner:ci
script:
- /chaos/pod-templates/pod_failure --mode $mode --namespace $app_ns --label $app_label
- What does the chaos executor run?
Litmus is an open-source chaos engineering framework for Kubernetes. It runs chaos experiments in a kubernetes-native way, i.e., as jobs, with the application & failure parameters passed as ENV variables. Internally, it makes use of its own LitmusLib (written as ansible taskfiles) along with other tools such as chaoskube, pumba etc.., In the above example, the respective job manifest (which is bundled inside the runner image) is preconditioned based on the arguments passed & then executed to inject the failures. Typically, the application end-points are monitored for good status (Running) & results are captured in a custom-resource which is queried by the executor script to determine the success of the gitlab job.
Some common failure-injection/chaos-experiments that are desired:
- Random pod deletion
- Random container kill
- Network (egress) delays on containers
- Packet loss to specific containers
- Simulated (via 'eviction taints') & actual node loss (reboot/resets)
- Daemon crash (kubelet, docker)
- CPU/Memory load
- Simulated & actual Disk Loss
Permissions and Security
- The chaos templates run Litmus jobs, which runs with a dedicated serviceaccount and RBAC setup.
- Some of the chaos tests make use of privileged containers to run system-level experiments
- Chaos tests executed on cloud providers such as AWS etc., may need tokens passed to the template (gitlab ENV)
Documentation
- Different chaos templates require specific arguments depending on the nature of the experiment, which has to be documented.
- A general workflow of the experiment run should be documented.
Testing
-
The remote-includes of gitlab templates is a supported gitlab feature.
-
The executor scripts, chaos K8s jobs (called as litmusbooks) & ansible-utils need to undergo tests. Currently, Litmus has a limited CI that performs lint & validation functions on the artifacts
What does success look like, and how can we measure that?
- Adoption of chaos gitlab templates to perform chaos in CI/CD
What is the type of buyer?
Links / references
- Details of Litmus are available in https://litmusdocs.openebs.io/
- OpenEBS CI uses Litmus in e2e: https://blog.openebs.io/a-primer-on-openebs-continuous-integration-b6162243cf86