Increase max_wal_size to 5GB for PostgreSQL

This setting is currently set to 2GB, which is rather low. @sfrost recommends setting this to the amount of data written in roughly 5 minutes. Since I'm not exactly sure what that value is (it might be much more) I suggest just starting with 5 GB (2.5x more) and monitor things. This setting can be applied by just sending a SIGHUP to PostgreSQL. For this to work nicely however we need to:

  1. Edit postgresql.conf manually
  2. SIGHUP PostgreSQL
  3. Apply the changes to Chef

chef-client is disabled, but this ensures no hard restarts will take place.

cc @ernstvn @pcarranza: I'd like to try this out e.g. tomorrow, no downtime or whatsoever is expected.

Checklist

PLANNING THE CHANGE

For more background on when this template should be used, see the infrastructure handbook.

  • Context: see issue body above
  • Downtime: no downtime needed
  • People:
    • Yorick: this is such a trivial change it really doesn't need the help from production engineers
  • Pre-checks: What should we check before starting with the change? Consider dashboards, metrics, limits of current infrastructure, etc.
    • Does the change alter how we use Azure or how many of Azure's resources we use?: No
    • The amount of WAL written per 5 minutes needs to be checked, see below.
    • Check that you have all the correct versions of the required software installed in the affected hosts.
    • Check that you have the right access level to the required resources.
  • Change Procedure:
    • List the steps that are needed for the change; be as granular as possible: see below
    • Did you do a dry run to test / measure performance and timings?: it's a configuration change
  • Preparatory Steps: What can be done ahead of time? How far ahead?
    • When determined what these may be, add them to the "Preparatory Steps" section below under "Doing the Change".
  • Post-checks: What should we check after the change has been applied?
    • Monitoring of checkpoint intervals to see if any change is noticeable
    • See the "doing the changes" section
  • Rollback procedure: _In case things go wrong, what do we need to do to recover?
    • We undo the configuration change
  • Create an invite using a 4 hr block of time on the "GitLab Production" calendar (link in handbook), inviting the ops-contact group. Include a link to the issue. (Many times you will not expect to need - or actually need - all 4 hrs, but past experience has shown that delays and unexpected events are more likely than having things go faster than expected.)
  • Ping the Production Lead in this issue to coordinate who should be present from the Production team, and to confirm scheduling.
  • When will this occur?
    • 12:00 UTC
  • Communication plan:
    • Tweet before starting
    • Tweet when done

DOING THE CHANGE

  • Schedule in the production calendar starting at 14:00 Amsterdam time / 12:00 UTC
  • Figure out (if possible) how much data we roughly write per minute on the primary
    • 505 MB per 5 minutes, based on the output of pg_current_xlog_insert_location and pg_xlog_location_diff, yielding roughly 100 MB per minute
  • Make sure chef-client is stopped on db1.cluster.gitlab.com
  • Make sure chef-client is stopped on db3.cluster.gitlab.com
  • Make sure chef-client is stopped on db4.cluster.gitlab.com
  • Apply the changes by editing postgresql.conf manually
    • db1.cluster.gitlab.com
    • db3.cluster.gitlab.com
    • db4.cluster.gitlab.com
  • Apply the changes using chef-repo, submit an MR (can't push directly to master any more), self merge it: https://dev.gitlab.org/cookbooks/chef-repo/merge_requests/713
  • Monitor for the next hour or so

Initial Tasks

  • Create a google doc to track the progress. This is because in the event of an outage, Google docs allow for real-time collaboration, and don't depend on GitLab.com being available.
    • Add a link to the issue where it comes from, copy and paste the content of the issue, the description, and the steps to follow.
    • Title the steps as "timeline". Use UTC time without daylight saving, we all are in the same timezone in UTC.
    • Link the document in the on-call log so it's easy to find later.
    • Right before starting the change, paste the link to the google doc in the #production chat channel and "pin" it.
  • Discuss with the person who is introducing the change, and go through the plan to fill the gaps of understanding before starting.
  • Final check of the rollback plan and communication plan.
  • Set PagerDuty maintenance window before starting the change.

The Change

  • Before starting the Change
    • Tweet to publicly notify that you are performing a change in production following the guidelines.
  • Start running the changes. When this happens, one person is making the change, the other person is taking notes of when the different steps are happening. Make it explicit who will do what.
  • When the change is done and finished, either successfully or not
    • Tweet again to notify that the change is finished and point to the change issue.
    • Copy the content of the document back into the issue redacting any data that is necessary to keep it blameless and deprecate the doc. Document doesn't contain anything useful other than:
      • 12:00 UTC: work starts
      • 12:07 UTC: work done
    • Perform a quick post mortem following the Blameless Postmortem guideline in the infrastructure handbook in a new issue.
    • If the issue caused an outage, or service degradation, label the issue as "outage".
Edited by Yorick Peterse