Update the Ref Arch setup docs for 10k, and up. Instruct sysadmins to over provision the site and increase the memory available to the Postgres nodes.
In Geo Troubleshooting, add a section, with relevant error messages to make it searchable, and then instruct sysadmins to over provision the site and increase the memory available to the Postgres nodes.
Initial replication will fail due to running out of memory, this occurs on a brand-new environment with no prepopulated data. The below error will repeat in the logs. I think this is related more to a config error, but I'm not 100% on that. At the moment the issue can be avoided by over provisioning the secondary site.
22-11-04_15:47:45.00609 2022-11-04 15:47:45,004 INFO: bootstrapped clone from remote master postgresql://10.132.0.119:54322022-11-04_15:47:45.23456 FATAL: could not map anonymous shared memory: Cannot allocate memory2022-11-04_15:47:45.23477 HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 8130248704 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.2022-11-04_15:47:45.23498 LOG: database system is shut down2022-11-04_15:47:45.23856 2022-11-04 15:47:45,234 INFO: postmaster pid=27102022-11-04_15:47:45.24032 /var/opt/gitlab/postgresql:5432 - no response2022-11-04_15:47:46.24283 2022-11-04 15:47:46,242 ERROR: postmaster is not running2022-11-04_15:47:46.24817 2022-11-04 15:47:46,247 INFO: removing initialize key after failed attempt to bootstrap the cluster2022-11-04_15:47:46.25553 2022-11-04 15:47:46,255 INFO: renaming data directory to /var/opt/gitlab/postgresql/data_2022-11-04-15-47-462022-11-04_15:47:46.26659 2022-11-04 15:47:46,265 INFO: Deregister service postgresql-ha/geo-3k-staging-ref-postgres-3.c.gitlab-qa-geo-986758.internal2022-11-04_15:47:46.46938 2022-11-04 15:47:46,468 INFO: Deregister service postgresql-ha/geo-3k-staging-ref-postgres-3.c.gitlab-qa-geo-986758.internal2022-11-04_15:47:46.47049 Traceback (most recent call last):2022-11-04_15:47:46.47063 File "/opt/gitlab/embedded/bin/patroni", line 8, in <module>2022-11-04_15:47:46.47073 sys.exit(main())2022-11-04_15:47:46.47080 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/__init__.py", line 171, in main2022-11-04_15:47:46.49403 return patroni_main()2022-11-04_15:47:46.49430 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/__init__.py", line 139, in patroni_main2022-11-04_15:47:46.49436 abstract_main(Patroni, schema)2022-11-04_15:47:46.49436 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/daemon.py", line 100, in abstract_main2022-11-04_15:47:46.49511 controller.run()2022-11-04_15:47:46.49530 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/__init__.py", line 109, in run2022-11-04_15:47:46.49540 super(Patroni, self).run()2022-11-04_15:47:46.49548 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/daemon.py", line 59, in run2022-11-04_15:47:46.49558 self._run_cycle()2022-11-04_15:47:46.49562 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/__init__.py", line 112, in _run_cycle2022-11-04_15:47:46.49569 logger.info(self.ha.run_cycle())2022-11-04_15:47:46.49577 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/ha.py", line 1469, in run_cycle2022-11-04_15:47:46.49619 info = self._run_cycle()2022-11-04_15:47:46.49624 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/ha.py", line 1343, in _run_cycle2022-11-04_15:47:46.49666 return self.post_bootstrap()2022-11-04_15:47:46.49675 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/ha.py", line 1236, in post_bootstrap2022-11-04_15:47:46.49710 self.cancel_initialization()2022-11-04_15:47:46.49715 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/ha.py", line 1229, in cancel_initialization2022-11-04_15:47:46.49756 raise PatroniFatalException('Failed to bootstrap cluster')2022-11-04_15:47:46.49767 patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
Updating patroni should be a smaller effort that, if not fix this, can help us avoid issues in other areas (like pg_rewind related ones).
Also:
Initial replication will fail due to running out of memory
This may suggest our data pattern is different from the last time we defined the minimum requirements (ex, we are partioning the database, that is not free).
@nwestbury Can the solution here looks like defining the "over-provisioning" limits as the new minimum?
Once the above issue is passed and a leader is bootstrapped, the following message will be printed continuously.
2022-11-04_15:32:44.97819 ERROR: recovery is in progress2022-11-04_15:32:44.97824 HINT: WAL control functions cannot be executed during recovery.2022-11-04_15:32:44.97825 STATEMENT:2022-11-04_15:32:44.97825 SELECT slot_name, database, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)2022-11-04_15:32:44.97826 FROM pg_replication_slots2022-11-04_15:32:44.97826
@nwestbury I have since successfully set up a new 3k hybrid primary and 3k hybrid secondary. Do you think this is still an issue? Maybe it's somehow specific to using a 10k hybrid primary?
It could be related to the performance difference between and 3 and 10k environment. We saw similar issues when trying to get staging-ref working, I'm currently in the process of trying to get Geo back onto staging-ref so will keep an eye out for this and update.
We are hitting this now on the new staging-ref secondary
023-01-26_14:39:06.15256 2023-01-26 14:39:06,152 INFO: bootstrap_standby_leader in progress2023-01-26_14:39:06.17028 LOG: starting PostgreSQL 12.12 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit2023-01-26_14:39:06.17050 LOG: listening on IPv4 address "0.0.0.0", port 54322023-01-26_14:39:06.17476 LOG: listening on Unix socket "/var/opt/gitlab/postgresql/.s.PGSQL.5432"2023-01-26_14:39:06.17996 FATAL: could not map anonymous shared memory: Cannot allocate memory2023-01-26_14:39:06.18012 HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 8132009984 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.2023-01-26_14:39:06.18022 LOG: database system is shut down2023-01-26_14:39:07.05782 2023-01-26 14:39:07,057 ERROR: postmaster is not running2023-01-26_14:39:07.06211 2023-01-26 14:39:07,061 INFO: removing initialize key after failed attempt to bootstrap the cluster2023-01-26_14:39:07.06850 2023-01-26 14:39:07,068 INFO: renaming data directory to /var/opt/gitlab/postgresql/data_2023-01-26-14-39-072023-01-26_14:39:07.08079 2023-01-26 14:39:07,079 INFO: Deregister service postgresql-ha/staging-ref-3k-geo-postgres-2.c.gitlab-staging-ref.internal2023-01-26_14:39:07.47297 2023-01-26 14:39:07,471 INFO: Deregister service postgresql-ha/staging-ref-3k-geo-postgres-2.c.gitlab-staging-ref.internal2023-01-26_14:39:07.47403 Traceback (most recent call last):2023-01-26_14:39:07.47417 File "/opt/gitlab/embedded/bin/patroni", line 8, in <module>2023-01-26_14:39:07.47426 sys.exit(main())2023-01-26_14:39:07.47435 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/__init__.py", line 171, in main2023-01-26_14:39:07.47550 return patroni_main()2023-01-26_14:39:07.47560 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/__init__.py", line 139, in patroni_main2023-01-26_14:39:07.47571 abstract_main(Patroni, schema)2023-01-26_14:39:07.47573 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/daemon.py", line 100, in abstract_main2023-01-26_14:39:07.47636 controller.run()2023-01-26_14:39:07.47647 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/__init__.py", line 109, in run2023-01-26_14:39:07.47655 super(Patroni, self).run()2023-01-26_14:39:07.47662 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/daemon.py", line 59, in run2023-01-26_14:39:07.47671 self._run_cycle()2023-01-26_14:39:07.47678 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/__init__.py", line 112, in _run_cycle2023-01-26_14:39:07.47683 logger.info(self.ha.run_cycle())2023-01-26_14:39:07.47686 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/ha.py", line 1469, in run_cycle2023-01-26_14:39:07.48480 info = self._run_cycle()2023-01-26_14:39:07.48497 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/ha.py", line 1343, in _run_cycle2023-01-26_14:39:07.48532 return self.post_bootstrap()2023-01-26_14:39:07.48544 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/ha.py", line 1236, in post_bootstrap2023-01-26_14:39:07.48578 self.cancel_initialization()2023-01-26_14:39:07.48586 File "/opt/gitlab/embedded/lib/python3.9/site-packages/patroni/ha.py", line 1229, in cancel_initialization2023-01-26_14:39:07.48620 raise PatroniFatalException('Failed to bootstrap cluster')2023-01-26_14:39:07.48628 patroni.exceptions.PatroniFatalException: 'Failed to bootstrap cluster'
The workaround for now is to over provision the site and increase the memory available to the Postgres nodes. But, we can certainly recreate the issue here.
If there is a way to limit the memory usage and prevent this error then it would be a worthwhile addition to the setup docs.
Even without a fix, this does break the setup of the env, in my opinion, having the note about over provisioning would be more useful to the user in the set-up docs over troubleshooting. Troubleshooting feels a bit late, the user will have already run into this issue and had to search for a solution.
Replication is actually looking pretty good on staging-ref at the moment, verification is lagging behind.
The secondary sites standby leader is repeatly printing the below error in the logs
2022-11-04_15:32:44.97819 ERROR: recovery is in progress2022-11-04_15:32:44.97824 HINT: WAL control functions cannot be executed during recovery.2022-11-04_15:32:44.97825 STATEMENT:2022-11-04_15:32:44.97825 SELECT slot_name, database, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)2022-11-04_15:32:44.97826 FROM pg_replication_slots2022-11-04_15:32:44.97826