Identify root causes of infrastructure issues

As we have identified in initial analysis in both Identify top smoke failures using Grafana Test ... (#3843 - closed) and Identify top smoke failures using delivery bloc... (#3842 - closed), there is a significant number of failures due to 500(s) and 400(s) server responses.

In this issue, can we dive deeper and identify:

Is there a pattern to when they occur often in a release cycle? How often do we observe "mass failure" that caused by these sever responses in a release cycle?
Can we identify what causes them? - Eg: A service is unavailable? Which service? Why?

The goal is to narrow down to the exact the root cause(s) of these outages and come up with actionable items for infrastructure team to resolve them. Per findings in deployment-to-production-blocked-by-test-data, let's focus on timeframe between April - July 2025 where we had a spike in deployment blockers.

Edited Sep 03, 2025 by Tiffany Rea