Investigate how to prevent frontend and backend changes involving GraphQL from getting merged and breaking canary
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
In the last week we've had 2 incidents that broke gprd-cny due to changes that contained both frontend and backend updates touching GraphQL getting merged.
- https://app.incident.io/gitlab/incidents/1265 caused by !191475 (merged)
- https://app.incident.io/gitlab/incidents/1276 caused by !185081 (merged)
These types of updates break because of how the canary environment works - API requests are routed to canary backends based on these paths. However, canary frontend requests using GraphQL cannot be routed this way since they all go to the same path (/api/graphql). In our rolling deployments, canary frontend will receive a new version of the code with GraphQL schema changes, but the main backends won't receive it until the main stage deployment is completed. Since 5% of requests are randomly routed to canary, there's actually a 95% chance of a frontend request from canary failing due to incompatible GraphQL schemas between the frontend and backend.
We currently have a dangerbot warning that detects such changes, e.g. https://gitlab.com/project_278964_bot_b66b169fda2a3223a645094be35d5515
However this is clearly not effective enough at preventing these types of breakages. It doesn't help that the message is buried under a bunch of other warnings.
This issue is to investigate what other measures we can take to prevent these changes from getting merged.
This ticket was created from INC-1276 and was automatically exported by incident.io
