Skip to content

Pipeline Security FCL Plan for Incident 8619

Overview

The Pipeline Security team will be participating in a Feature Change Lock (FCL) process for incident 8619, where the team will be focused on reliability work for 5 business days. Note that beyond any corrective actions related to the incident, this will be a great opportunity to add any improvements to strengthen the reliability of any product categories the team owns (such as improvements to monitoring, added test coverage, etc).

Here are a couple videos that @sgoldstein recorded about a year ago related to blameless culture and our FCL process, which may be useful for folks contributing to this FCL

Timeline

Per the timeline guidance as documented in the handbook: (note that these are business days)

Day 0 (today): Incident

Days 1-2: confirmation that an FCL is required for this incident and start planning.

Days 3-4: planning time

Days 5-9 (1 week): complete planned work

Days 10-11: closing ceremony, retrospective and report back to standup

Let's brainstorm some reliability issues we could focus on for the FCL. I will also create a separate incident review placeholder issue where we can discuss the incident and identify any corrective actions. From there, @morefice will work with the Pipeline Security engineers come up with a FCL plan, which is a list of issues the team will work on.

Incident Review

gitlab-com/gl-infra/production#8621 (closed) - WIP

FCL time period (2023-04-03 | 2023-04-07)

FCL - Planned Work

Category Description Issue DRI Progress
Process Automate feature flag review with Danger #425 (closed) gitlab!116563 (merged) @iamricecake 100% - MR is merged
Process Enforce code review a domain expert in Verify for everything related to artifacts #426 (closed) @mfluharty 100% (gitlab!116607 (merged) and gitlab!116877 (merged))
Monitoring Improve patroni runbook about replication lag. #427 (closed) gitlab-com/runbooks!5645 (merged) @iamricecake 100% - under maintainer review
Testing Remove redundant E2E spec gitlab-org/quality/quality-engineering/team-tasks#1707 (closed) @mgandres 100%
Testing Remove instances of asserting with wrapper.vm in specs for ci_variables_list gitlab#396827 (closed) gitlab!116589 (merged) @mgandres 100%
Performance Drop unused index gitlab#393913 (closed) @alberts-gitlab 100% - async index removal has been completed in production. Sync index removal has been merged

Nice to have - if time permits

Category Description Issue DRI Progress
Performance Query timeout in unlock ci_job_artifacts gitlab#379089 (closed) @alberts-gitlab 10% Spike MR opened
Monitoring Track dead tuples for tables we owned and alert in case it increases too fast #429 (closed) @dbiryukov closed
Testing Investigate if we can improve testing (end to end?) to anticipate cascade usage on the DB #430 @dbiryukov ?
Testing Create simulated activity with positive and negative scenarios in staging #428 (closed) @alberts-gitlab ?
Performance Query timeout on Ci::PipelineArtifacts::ExpireArtifactsWorker gitlab#404375 (closed) ? ?
Edited by Max Orefice