Skip to content

Define a strategy to deal with Fluentd log replays through boot disks rebuilds

During the Gitaly Fleet Upgrade we encountered an issue where logs were replayed because of the loss of the position file. From @igorwwwwwwwwwwwwwwwwwwww on gitlab-com/runbooks!4441 (comment 899216947):

  • Fluentd's position file is stored on the root disk. Since the root disk is deleted during a rebuild, this will make fluentd replay all of its logs after the machine comes up for the first time.
  • Possible fix: Delete log disk during rebuild. This will require a bit more work during bootstrapping, but avoids the log replay issue.
  • Alternatively, change the location of the position file to be on the log disk, so that it survives a replacement of the boot disk. Making that change to the existing fleet requires some care though, as we don't want to run into a fleet-wide log replay.
Edited by Alejandro Rodríguez