Improve test environment reliability and reduce flaky/transient test failures

Summary

Flaky and transient test failures are often have testing infrastructure/environments as their root cause. These can be difficult to debug and identify. Without doing so, we are less likely to be able to get needed fixes into our infrastructure. This tracking and categorization can also help us identify improvements to our existing framework, such as in areas of observability and testability.

Goal

Investigate, track, and aggregate test environment and test infrastructure issues to improve test reliability. With sufficient data collected, we can then advocate for infrastructure, environment, or test framework improvements.

Tasks

Identify high-level categorizations of test environment and test infrastructure failures
Create tracking issues for each area
Create an ongoing process for management of infrastructure and framework reliability issues
- Track existing open issues in each area to collect data and facilitate fixes
- Create infrastructure issues for related problems
- Create test framework improvement issues for related problems

As categorizations are identified, add a comment and description for the category, along with a related tracking issue where related problems can be aggregated and linked to. Use these tracking issues as data source for advocating infrastructure or test framework improvements.

Edited May 24, 2022 by Zeff Morgan