Gitlab Chaos Templates for Kubernetes-based CI/CD

Problem to solve

Kubernetes application (stateless & stateful) developers are adopting chaos engineering as a means to harden the application resiliency. Typically, this has been in the purview of SRE teams, with failure injection in pre-prod & sometimes prod environments. However, there is an increasing interest in using chaos in CI/CD pipelines itself. It would be ideal to have stages (or pipelines, as, say, extended e2e) where chaos "tests" are performed by injecting specific failures on desired "components" while checking for application availability.

Intended users

Developers/DevOps Engineers of applications that are built-on/intended to run on Kubernetes

Further details

Some of the ways in which developers can achieve this today (w/o this proposal/feature):
- Write custom scripts with the logic to inject failures alongside the business logic in the application repo & include in .gitlab-ci.yaml. They may/may not use existing tools such as kube-monkey, pumba to achieve pod failures and network-delays

Proposal

Having readily usable gitlab "chaos templates" to perform standard Kubernetes failures is desirable, as it allows developers to spend more time on application business logic. The developer can define one or more chaos stages in the .gitlab-ci.yaml spec which performs a "remote include" of the chaos template, the template being packaged with its own runner image, execution script and a standard set of variables to describe the chaos experiment (these are typically overridden), as illustrated in the following example.
In application developer's .gitlab-ci.yaml

include:
  - remote: 'https://github.com/<repo>/raw/<branch>/chaos/pod-templates/pod-failure.yaml'
     
stages:
  - build
  - test
  - deploy
  - chaos
  - pre-prod

< build-job  >
< test-job   >
< deploy-job >
:
Inject Pod Failure Chaos:
  stage: chaos
  extends: .pod_failure_template
  variables:
    app_ns: default
    app_label: "app=nginx"
  dependencies:
    - <deploy-job>
  artifacts:
    paths: 
	  - <kubeconfig>

In Gitlab chaos templates src/folder (../chaos/pod-templates/pod-failure.yaml):

---
variables:
  app_ns: percona
  app_label: 'app=mysql'
  mode: pod-delete

.pod_failure_template:
  image: litmuschaos/ansible-runner:ci
  script:
    - /chaos/pod-templates/pod_failure --mode $mode --namespace $app_ns --label $app_label

What does the chaos executor run?

Litmus is an open-source chaos engineering framework for Kubernetes. It runs chaos experiments in a kubernetes-native way, i.e., as jobs, with the application & failure parameters passed as ENV variables. Internally, it makes use of its own LitmusLib (written as ansible taskfiles) along with other tools such as chaoskube, pumba etc.., In the above example, the respective job manifest (which is bundled inside the runner image) is preconditioned based on the arguments passed & then executed to inject the failures. Typically, the application end-points are monitored for good status (Running) & results are captured in a custom-resource which is queried by the executor script to determine the success of the gitlab job.

Some common failure-injection/chaos-experiments that are desired:

Random pod deletion
Random container kill
Network (egress) delays on containers
Packet loss to specific containers
Simulated (via 'eviction taints') & actual node loss (reboot/resets)
Daemon crash (kubelet, docker)
CPU/Memory load
Simulated & actual Disk Loss

Permissions and Security

The chaos templates run Litmus jobs, which runs with a dedicated serviceaccount and RBAC setup.
Some of the chaos tests make use of privileged containers to run system-level experiments
Chaos tests executed on cloud providers such as AWS etc., may need tokens passed to the template (gitlab ENV)

Documentation

Different chaos templates require specific arguments depending on the nature of the experiment, which has to be documented.
A general workflow of the experiment run should be documented.

Testing

The remote-includes of gitlab templates is a supported gitlab feature.
The executor scripts, chaos K8s jobs (called as litmusbooks) & ansible-utils need to undergo tests. Currently, Litmus has a limited CI that performs lint & validation functions on the artifacts

What does success look like, and how can we measure that?

Adoption of chaos gitlab templates to perform chaos in CI/CD

What is the type of buyer?

Links / references

Details of Litmus are available in https://litmusdocs.openebs.io/
OpenEBS CI uses Litmus in e2e: https://blog.openebs.io/a-primer-on-openebs-continuous-integration-b6162243cf86

Edited Aug 14, 2020 by 🤖 GitLab Bot 🤖