Perform various changes to PostgreSQL of which some require downtime
There is a whole list of configuration changes that we need to apply to PostgreSQL. Most of these don't require downtime, but the following do:
- https://gitlab.com/gitlab-com/infrastructure/issues/1561
-
https://gitlab.com/gitlab-com/infrastructure/issues/1555Postponed until we have failovers so we can change this back - https://gitlab.com/gitlab-com/infrastructure/issues/1158
The following changes I would like to apply don't require downtime:
- https://gitlab.com/gitlab-com/infrastructure/issues/1630
- https://gitlab.com/gitlab-com/infrastructure/issues/1559
- https://gitlab.com/gitlab-com/infrastructure/issues/1557
- https://gitlab.com/gitlab-com/infrastructure/issues/1553
- https://gitlab.com/gitlab-com/infrastructure/issues/1552
PLANNING THE CHANGE
For more background on when this template should be used, see the infrastructure handbook.
-
Context: What is the background of the change? Relevant links? -
Downtime: Will the change introduce downtime, and if so, how much? Yes, 10 minutes at most -
What options were considered to avoid downtime? Failovers could be used but the system we use currently is not reliable -
What is the downtime estimate based on? Can it be tested in some way? We just need to restart the DBs, which can take up to a minute per DB
-
-
People: -
Who will be present? Who will handle communications (twitter, banner, google doc, etc.)? What other roles will be needed? Yorick will be present
-
-
Pre-checks: What should we check before starting with the change? Consider dashboards, metrics, limits of current infrastructure, etc. Not much since we're adding things not used before, we do need to monitor response timings in case we reduce the buffer sizes -
Does the change alter how we use Azure or how many of Azure's resources we use? No -
Check that you have all the correct versions of the required software installed in the affected hosts. We do -
Check that you have the right access level to the required resources. I do
-
-
Change Procedure: -
List the steps that are needed for the change; be as granular as possible. -
Did you do a dry run to test / measure performance and timings? Staging is currently being heavily used for other things and is such a different environment it can't be used reliably for this
-
-
Preparatory Steps: What can be done ahead of time? How far ahead? -
Post-checks: What should we check after the change has been applied? -
How do we know the change worked well? How do we know the change did not work well? What monitoring do we need to pay attention to? How do we verify data integrity? - Response timings need to be monitored in case we reduced the buffer sizes too much
- Memory usage of PostgreSQL should be monitored to see how this is affected by the buffer sizes
-
Should any alerts be modified as a consequence of this change? No
-
-
Rollback procedure: _In case things go wrong, what do we need to do to recover? - The configuration changes can just be undone, but we can't downgrade PostgreSQL from 9.6.3 to 9.6.1
-
Create an invite using a 4 hr block of time on the "GitLab Production" calendar (link in handbook), inviting the ops-contact group. Include a link to the issue. (Many times you will not expect to need - or actually need - all 4 hrs, but past experience has shown that delays and unexpected events are more likely than having things go faster than expected.) -
Ping the Production Lead in this issue to coordinate who should be present from the Production team, and to confirm scheduling. -
When will this occur? leave blank until scheduled -
Communication plan: -
Tweet: default to tweeting when schedule is known, then again 12 hrs before, 1 hr before, when starting, during if there are delays, and after when complete. -
Deploy banner: display warning 1 hr before -
Other?
-
DOING THE CHANGE
Preparatory steps
- Enable the deploy page
- Stop Sidekiq and Unicorn
- Stop db1, db3, and db4
- Update the package version in Chef
- Add the settings necessary to Chef
- run chef-client / gitlab-ctl reconfigure on the servers
- Start the servers again, starting with db1
- Start Sidekiq
- Start Unicorn
- Disable deploy page
PostgreSQL Changes To Apply
-
Set max_connections
to300
⚠ This must be done on the primary first⚠ -
Update the GitLab package on the database servers to upgrade PostgreSQL to 9.6.3, then restart PostgreSQL ⚠ This must be done on the primary first⚠ -
Set random_page_cost
to2
-
Set max_locks_per_transaction
to128
-
Set log_temp_files
to0
-
Set log_checkpoints
toon
-
Set log_min_duration_statement
to1000
Changing max_connections
This requires a careful set of steps to make sure that all databases play nice:
- Stop all databases
- Adjust the settings on the primary
- Start the primary
- Start the secondaries
- Wait a minute
- Adjust it on the secondaries
- Restart the secondaries
Initial Tasks
-
Create a google doc to track the progress. This is because in the event of an outage, Google docs allow for real-time collaboration, and don't depend on GitLab.com being available: https://docs.google.com/document/d/164-kq8LdtuP-qNt9E8JP-AZVb4m3ekX4DpYKxGYJhTU/edit# -
Add a link to the issue where it comes from, copy and paste the content of the issue, the description, and the steps to follow. -
Title the steps as "timeline". Use UTC time without daylight saving, we all are in the same timezone in UTC. -
Link the document in the on-call log so it's easy to find later. -
Right before starting the change, paste the link to the google doc in the #production chat channel and "pin" it.
-
-
Discuss with the person who is introducing the change, and go through the plan to fill the gaps of understanding before starting. -
Final check of the rollback plan and communication plan. -
Set PagerDuty maintenance window before starting the change.
The Change
- Before starting the Change
-
Tweet to publicly notify that you are performing a change in production following the guidelines.
-
- Start running the changes. When this happens, one person is making the change, the other person is taking notes of when the different steps are happening. Make it explicit who will do what.
- When the change is done and finished, either successfully or not
-
Tweet again to notify that the change is finished and point to the change issue. -
Copy the content of the document back into the issue redacting any data that is necessary to keep it blameless and deprecate the doc.
-
Edited by Jason Tevnan