Skip to content

Recover LFS and Repository objects stored in local filesystem of the git fleet

PLANNING THE CHANGE

  • Context:

Consistency scans has shown that some LFS objects are missing from the LFS NFS drive.

In order to fix this, we will need to unmount the LFS partition from each of the git front-end servers, move the local LFS drive to a different location to resync available objects with the LFS drive.

We can perform this operation one by one by just removing them from the LB so they don't get any traffic (just by stopping all the services in it)

We will need to perform this operation in all the git related NFS drives in all the hosts, which means that we may need to script it away.

  • Downtime: Will the change introduce downtime, and if so, how much?

No downtime is expected.

  • People:

    • Someone from the production engineering team.
    • Someone with LFS knowledge.
  • Pre-checks: What should we check before starting with the change? Consider dashboards, metrics, limits of current infrastructure, etc.

    • N/A
  • Change Procedure:

    • On each ssh host
      • Stop all the services
      • Wait for connections to drain
      • Check no new ssh connections are being generated
      • For each drive in LFS, nfs-git-data 1-8:
        • Unmount partition
        • Move the folder to a temporary location
        • Remount the partition
      • Sync files from the local folder to the new folder (as long as the local file is newer)
      • Package and backup the local folders
      • Restart all the services
      • Check that traffic is going through.
  • Preparatory Steps: What can be done ahead of time? How far ahead?

    • Create a disk snapshot of the LFS server, in case things go wrong, we can recover it.
  • Post-checks: What should we check after the change has been applied?

    • Run the consistency scan again, we should have 0 objects lost.
    • Should any alerts be modified as a consequence of this change?
      • No
  • Rollback procedure: _In case things go wrong, what do we need to do to recover?

    • Recover the snapshot created in the preparatory step.
  • Create an invite using a 4 hr block of time on the "GitLab Production" calendar (link in handbook), inviting the ops-contact group. Include a link to the issue. (Many times you will not expect to need - or actually need - all 4 hrs, but past experience has shown that delays and unexpected events are more likely than having things go faster than expected.)

  • Ping the Production Lead in this issue to coordinate who should be present from the Production team, and to confirm scheduling.

  • When will this occur? leave blank until scheduled

  • Communication plan:

    • Tweet: default to tweeting when schedule is known, then again 12 hrs before, 1 hr before, when starting, during if there are delays, and after when complete.

DOING THE CHANGE

Preparatory steps

  • Copy/paste items here from the Preparatory Steps listed above.

Initial Tasks

  • Create a google doc to track the progress. This is because in the event of an outage, Google docs allow for real-time collaboration, and don't depend on GitLab.com being available.
    • Add a link to the issue where it comes from, copy and paste the content of the issue, the description, and the steps to follow.
    • Title the steps as "timeline". Use UTC time without daylight saving, we all are in the same timezone in UTC.
    • Link the document in the on-call log so it's easy to find later.
    • Right before starting the change, paste the link to the google doc in the #production chat channel and "pin" it.
  • Discuss with the person who is introducing the change, and go through the plan to fill the gaps of understanding before starting.
  • Final check of the rollback plan and communication plan.
  • Set PagerDuty maintenance window before starting the change.

The Change

  • Start running the changes. When this happens, one person is making the change, the other person is taking notes of when the different steps are happening. Make it explicit who will do what.
  • When the change is done and finished, either successfully or not, copy the content of the document back into the issue and deprecate the doc (and close the issue if possible).
  • Retrospective: answer the following three questions:
    • What went well?
    • What should be improved?
    • Specific action items / recommendations.
  • If the issue caused an outage, or service degradation, label the issue as "outage".