Build Chaos Testing Proof of Concept

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Background

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. https://principlesofchaos.org/

The typical process to achieve this is following a process that involves

Defining a 'steady state' - i.e what the system does during normal operation
Hypothesize that the 'steady state' should continue to work throughout an 'experiment'
Introduce events and actions that simulate errors to the system, e.g. server crashes, disk failures, network errors, etc
Attempt to disprove the hypothesis by looking and testing for differences betwen the 'stead state system' and 'experimental system'

While test environments can provide a level of confidence, the goal standard for this would be to continuously run these experiments against a production environment to truly understand and have confidence in the system's capabilities. We however should being this process in test environments, to better understand out assumptions but keep in mind that we may want to use these experiments in production in the future.

Initial Proposal:

Use existing E2E test orchestration as a basis for test environments
Use a selection of gitlab-qa e2e tests and or performance tests to define and test the 'steady state'
Define a small subset of error conditions to build out a prototype of tests
- gitaly server crashes
- CPU spike on gitlab instance
- postgres primary outage
Proposal to use https://github.com/Shopify/toxiproxy as tool to enable chaos tests focused on networking issues.
- allows us to run a docker container which can easily link into our existing orchestrated test suite
- allows us to reuse existing QA E2E test framework as a basis for writing automated tests
- allows us to build on top of our existing test pipelines so no need to additional side projects
- can be scheduled for a nightly job initially to determine effectiveness without impacting pipeline durations etc
  - if deemed stable and providing high value, can be integrated into more frequent pipelines
- ease of adoption by other team members who will be familiar with the QA E2E test suite

Future Work

With this proof-of-concept prototype in place, we should be able to continue to add additional experiments and build the set of experiments out from here

Tasks to complete:

Add chaos component to gitlab-org/gitlab-qa gitlab-qa!1040 (merged)
Add job to CI pipeline to run on-demand/schedule gitlab-qa!1040 (merged)
Add E2E tests to gitlab-org/gitlab !98019

Edited Sep 28, 2025 by 🤖 GitLab Bot 🤖