Build Chaos Testing Proof of Concept
Background
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. https://principlesofchaos.org/
The typical process to achieve this is following a process that involves
- Defining a 'steady state' - i.e what the system does during normal operation
- Hypothesize that the 'steady state' should continue to work throughout an 'experiment'
- Introduce events and actions that simulate errors to the system, e.g. server crashes, disk failures, network errors, etc
- Attempt to disprove the hypothesis by looking and testing for differences betwen the 'stead state system' and 'experimental system'
While test environments can provide a level of confidence, the goal standard for this would be to continuously run these experiments against a production environment to truly understand and have confidence in the system's capabilities. We however should being this process in test environments, to better understand out assumptions but keep in mind that we may want to use these experiments in production in the future.
Initial Proposal:
-
Use existing E2E test orchestration as a basis for test environments
-
Use a selection of gitlab-qa e2e tests and or performance tests to define and test the 'steady state'
-
Define a small subset of error conditions to build out a prototype of tests
- gitaly server crashes
- CPU spike on gitlab instance
- postgres primary outage
-
Proposal to use https://github.com/Shopify/toxiproxy as tool to enable chaos tests focused on networking issues.
- allows us to run a docker container which can easily link into our existing orchestrated test suite
- allows us to reuse existing QA E2E test framework as a basis for writing automated tests
- allows us to build on top of our existing test pipelines so no need to additional side projects
- can be scheduled for a nightly job initially to determine effectiveness without impacting pipeline durations etc
- if deemed stable and providing high value, can be integrated into more frequent pipelines
- ease of adoption by other team members who will be familiar with the QA E2E test suite
Future Work
With this proof-of-concept prototype in place, we should be able to continue to add additional experiments and build the set of experiments out from here
Tasks to complete:
-
Add chaos component to gitlab-org/gitlab-qa gitlab-qa!1040 (merged) -
Add job to CI pipeline to run on-demand/schedule gitlab-qa!1040 (merged) -
Add E2E tests to gitlab-org/gitlab !98019