Investigate how to prevent frontend and backend changes involving GraphQL from getting merged and breaking canary
In the last week we've had 2 incidents that broke gprd-cny due to changes that contained both frontend and backend updates touching GraphQL getting merged.
- https://app.incident.io/gitlab/incidents/1265 caused by gitlab-org/gitlab!191475 (merged)
- https://app.incident.io/gitlab/incidents/1276 caused by gitlab-org/gitlab!185081 (merged)
These types of updates break because of how the canary environment works - API requests are routed to canary backends based on these paths. However, canary frontend requests using GraphQL cannot be routed this way since they all go to the same path (/api/graphql). In our rolling deployments, canary frontend will receive a new version of the code with GraphQL schema changes, but the main backends won't receive it until the main stage deployment is completed. Since 5% of requests are randomly routed to canary, there's actually a 95% chance of a frontend request from canary failing due to incompatible GraphQL schemas between the frontend and backend.
We currently have a dangerbot warning that detects such changes, e.g. https://gitlab.com/project_278964_bot_b66b169fda2a3223a645094be35d5515
However this is clearly not effective enough at preventing these types of breakages. It doesn't help that the message is buried under a bunch of other warnings.
This issue is to investigate what other measures we can take to prevent these changes from getting merged.
This ticket was created from INC-1276 and was automatically exported by incident.io
