2024-04-18: Record all running builds instead of only ones running in shared runners

Production Change

Change Summary

This change relates to the deployment of gitlab-org/gitlab#452166 (closed). It will cause to all running builds to be logged to the ci_running_builds table. Currently, only builds running on instance (shared) runners are logged. Therefore, the number of records being inserted is expected to grow by a factor of 4-5 times.

Looking at the production database, we're creating 1.1M records on a typical day. With this change, that will increase to almost 5M.

Source - ClickHouse
SELECT toStartOfInterval (started_at, INTERVAL 1 DAY) AS day, runner_type, COUNT(*)
FROM "ci_finished_builds" FINAL
WHERE ci_finished_builds.started_at >= '2024-04-03' AND ci_finished_builds.started_at < '2024-04-10'
GROUP BY day, runner_type
ORDER BY day DESC, runner_type;

image

Change Details

  1. Services Impacted - Postgres
  2. Change Technician - @pedropombeiro
  3. Change Reviewer - @mayra-cabrera
  4. Time tracking - unknown
  5. Downtime Component - none

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 360 minutes

  • Request EOC approval
  • Request DBRE approval
  • Request Infra manager approval
  • Notify the EOC
  • Notify release-managers
  • Set label changein-progress /label ~change::in-progress
  • Enable add_all_ci_running_builds/remove_all_ci_running_builds project-scoped FFs in staging.
    • Enable on gitlab-org/gitlab project (/chatops run feature set remove_all_ci_running_builds true --project gitlab-org/gitlab --staging --staging-ref / /chatops run feature set add_all_ci_running_builds true --project gitlab-org/gitlab --staging --staging-ref).
    • Enable remove_all_ci_running_builds globally (/chatops run feature set remove_all_ci_running_builds true --dev --pre --staging --staging-ref).
    • Enable on 10% of all projects /chatops run feature set add_all_ci_running_builds 10 --actors --dev --pre --staging --staging-ref.
    • Enable on 25% of all projects /chatops run feature set add_all_ci_running_builds 25 --actors --dev --pre --staging --staging-ref.
    • Enable on 50% of all projects /chatops run feature set add_all_ci_running_builds 50 --actors --dev --pre --staging --staging-ref.
    • Enable on 75% of all projects /chatops run feature set add_all_ci_running_builds 75 --actors --dev --pre --staging --staging-ref.
    • Enable globally /chatops run feature set add_all_ci_running_builds true --dev --pre --staging --staging-ref.
  • Wait at least 12 hours before starting production rollout
  • Enable add_all_ci_running_builds/remove_all_ci_running_builds FFs in production.
    • Enable on gitlab-org/gitlab project (/chatops run feature set remove_all_ci_running_builds true --project gitlab-org/gitlab / /chatops run feature set add_all_ci_running_builds true --project gitlab-org/gitlab).
    • Enable remove_all_ci_running_builds globally (/chatops run feature set remove_all_ci_running_builds true).
    • Enable on 10% of all projects /chatops run feature set add_all_ci_running_builds 10 --actors.
    • Enable on 25% of all projects /chatops run feature set add_all_ci_running_builds 25 --actors.
    • Enable on 50% of all projects /chatops run feature set add_all_ci_running_builds 50 --actors.
    • Enable on 75% of all projects /chatops run feature set add_all_ci_running_builds 75 --actors.
  • Enable add_all_ci_running_builds/remove_all_ci_running_builds FFs globally in production /chatops run feature set add_all_ci_running_builds true.
  • Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - (5 minutes to disable add_all_ci_running_builds FF, 240 minutes to disable remove_all_ci_running_builds FF)

Monitoring

Key metrics to observe

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Pedro Pombeiro