Pipeline Security FCL Plan for Incident 8619
Overview
The Pipeline Security team will be participating in a Feature Change Lock (FCL) process for incident 8619, where the team will be focused on reliability work for 5 business days. Note that beyond any corrective actions related to the incident, this will be a great opportunity to add any improvements to strengthen the reliability of any product categories the team owns (such as improvements to monitoring, added test coverage, etc).
Here are a couple videos that @sgoldstein
recorded about a year ago related to blameless culture and our FCL process, which may be useful for folks contributing to this FCL
- Blameless Culture & Reliability at GitLab 3min 45sec
- Feature Change Locks - What, Why, and How 4min 38sec
Timeline
Per the timeline guidance as documented in the handbook: (note that these are business days)
Day 0 (today): Incident
Days 1-2: confirmation that an FCL is required for this incident and start planning.
Days 3-4: planning time
Days 5-9 (1 week): complete planned work
Days 10-11: closing ceremony, retrospective and report back to standup
Let's brainstorm some reliability issues we could focus on for the FCL. I will also create a separate incident review placeholder issue where we can discuss the incident and identify any corrective actions. From there, @morefice will work with the Pipeline Security engineers come up with a FCL plan, which is a list of issues the team will work on.
Incident Review
gitlab-com/gl-infra/production#8621 (closed) - WIP
FCL time period (2023-04-03 | 2023-04-07)
FCL - Planned Work
Category | Description | Issue | DRI | Progress |
---|---|---|---|---|
Process | Automate feature flag review with Danger | #425 (closed) gitlab!116563 (merged) | @iamricecake | 100% - MR is merged |
Process | Enforce code review a domain expert in Verify for everything related to artifacts |
#426 (closed) | @mfluharty | 100% (gitlab!116607 (merged) and gitlab!116877 (merged)) |
Monitoring | Improve patroni runbook about replication lag. | #427 (closed) gitlab-com/runbooks!5645 (merged) | @iamricecake | 100% - under maintainer review |
Testing | Remove redundant E2E spec | gitlab-org/quality/quality-engineering/team-tasks#1707 (closed) | @mgandres | 100% |
Testing | Remove instances of asserting with wrapper.vm in specs for ci_variables_list
|
gitlab#396827 (closed) gitlab!116589 (merged) | @mgandres | 100% |
Performance | Drop unused index | gitlab#393913 (closed) | @alberts-gitlab | 100% - async index removal has been completed in production. Sync index removal has been merged |
Nice to have - if time permits
Category | Description | Issue | DRI | Progress |
---|---|---|---|---|
Performance | Query timeout in unlock ci_job_artifacts | gitlab#379089 (closed) | @alberts-gitlab | 10% Spike MR opened |
Monitoring | Track dead tuples for tables we owned and alert in case it increases too fast | #429 (closed) | @dbiryukov | closed |
Testing | Investigate if we can improve testing (end to end?) to anticipate cascade usage on the DB | #430 | @dbiryukov | ? |
Testing | Create simulated activity with positive and negative scenarios in staging | #428 | @alberts-gitlab | ? |
Performance | Query timeout on Ci::PipelineArtifacts::ExpireArtifactsWorker
|
gitlab#404375 | ? | ? |