Skip to content

Fix percentage of time rollouts for routing tables switch

Marius Bobin requested to merge 377534-switch-percentage-of-time into master

What does this MR do and why?

When building a query for the routing table switch, we check the feature flag multiple times and this doesn't work well with percentage of time rollouts, leading to queries with mixed table names:

SELECT "ci_builds_metadata".* FROM "ci_builds_metadata" WHERE "p_ci_builds_metadata"."build_id" = $1 LIMIT $2

-- or

SELECT "p_ci_builds_metadata".* FROM "p_ci_builds_metadata" WHERE "ci_builds_metadata"."build_id" = $1 LIMIT $2
SELECT "ci_builds_metadata".* FROM "ci_builds_metadata" WHERE "p_ci_builds_metadata"."build_id" IN ($1, $2)

-- or

SELECT "p_ci_builds_metadata".* FROM "p_ci_builds_metadata" WHERE "ci_builds_metadata"."build_id" IN ($1, $2)

This fix caches the value for the flag check for the duration of the request and returns the same value, ensuring that we use the same table name in the query.

100% enables were also not safe because the L1 process cache could expire during a request and the next check would return a different value.

How to set up and validate locally

  1. Enable the flag for 10% of time:
Feature.enable_percentage_of_time :ci_partitioning_use_ci_builds_metadata_routing_table, 10
  1. Create a project with a bunch of jobs on each pipeline
test:
  image: busybox:latest
  variables:
    GIT_STRATEGY: none
  script:
    - echo "Do your test here"
  parallel: 25
  1. On master, some jobs will fail with structural integrity errors when assigned to a runner and for those that are executed the log page sometimes returns 500 errors.
  2. On this branch it works as expected.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #377534 (closed)

Edited by Marius Bobin

Merge request reports