When setting up Geo, an instance of GitLab goes from "I don't know what Geo node I am" to "Ah ha, I am this secondary node". Geo::CronManager runs periodically. In this case it disables nearly all jobs and enables secondary-only jobs.
Similarly, for a Geo primary, it disables secondary-only jobs and enables everything else.
Similarly, for a "I don't know what Geo node I am, or, there are no Geo nodes" instance, CronManager enables all jobs and disables Geo-only ones.
Note that it is possible for "what node am I?" to change after set up. E.g.:
When promoting a secondary or demoting a primary
When setting up a new Sidekiq node, perhaps gitlab.rb is not 100% correct immediately
??
So CronManager's approach is simple and robust, but clobbers manual changes.
Possible fixes
TBH nothing ideal is coming to mind.
Monkeypatch sidekiq-cron so we can note a manually disabled/enabled job vs the expected configuration
Make CronManager less robust to changes, and make rake gitlab:geo:check warn about deviation of job configuration
Workaround
Disable the CronManager from running, then disable other jobs.
Though this won't be permanent either since it looks like we have additional redundancy-- when Sidekiq is initialized, it ensures this job is running :( (I don't know if Sidekiq initialization occurs during "normal" operation e.g. after RSS killer, or only when the service starts)
This is still an issue. Based on the original mr, I'm not sure there is a need to continually run CronManager, except for the case that the sites role may change. So it may be ok to only run CronManager during Geo node creation, promotion, demotion, and not run CronManager on a regular basis.
For primary nodes, augment the creation/promotion steps in Gitlab::Geo::GeoTasks to run Gitlab::Geo::CronManager
When creating primary node, we run [gitlab-ctl set-geo-primary-node](gitlab-ctl set-geo-primary-node
), which eventually calls gitlab-rake geo:set_primary_node.
For secondary nodes, add/augment a logcursor event to call CronManager when a secondary site is added.
Cons
I'm not crazy about adding more steps to setup Geo. Would definitely need a check added to gitlab:geo:check. This concern is somewhat offset as the steps could be automated as part of existing steps.
If we do add new workers for Geo, they will be more complicated to roll out. Looking at the changelog, this is rare.
I'll assign this a weight of 3, but would accept a higher weight. This could be broken into sub-issues, but I do think it is important to rollout the change in behavior alongside the check.
It might be worth another point just to verify cronjobs are configured properly after Geo-related transitions like failover, setting up Geo, removing Geo, adding a secondary. Also we need to be sure it works for GitLab Helm chart. Do you think we need to verify all of that, or a subset?
This could be broken into sub-issues, but I do think it is important to rollout the change in behavior alongside the check.
I agree, the check is important since it'd be easier to get into a bad state and waste a lot of time troubleshooting.
We could break it down but put the new check as a blocker. Since IIUC the new check would work for the current world even though it isn't yet necessary. WDYT?
It might be worth another point just to verify cronjobs are configured properly after Geo-related transitions like failover, setting up Geo, removing Geo, adding a secondary. Also we need to be sure it works for GitLab Helm chart. Do you think we need to verify all of that, or a subset?
So I think the events we have are
Standalone site becomes primary
Part of the installation process. Happens via geo:set_primary_node rake task
Secondary site becomes primary
Part of failover process. Via sudo gitlab-ctl geo promote
Standalone site becomes secondary
Part of initial installation, recovery after failover, or possibly growing an install to another site
I think a former primary becoming a secondary would fit here.
This might be one spot where CronManager is necessary now. We could add an event tied to adding a secondary node maybe? Log cursor could process the event to ensure the appropriate workers are enabled.
I'm not sure where disabling these jobs would fit into. Perhaps gitlab-ctl command or rake task to disable the jobs?
We could break it down but put the new check as a blocker. Since IIUC the new check would work for the current world even though it isn't yet necessary. WDYT?
I think that makes sense. I'll open a separate issue for the check, and update this issue description for the CronManager change.
Part of initial installation, recovery after failover, or possibly growing an install to another site
I think a former primary becoming a secondary would fit here.
This might be one spot where CronManager is necessary now. We could add an event tied to adding a secondary node maybe? Log cursor could process the event to ensure the appropriate workers are enabled.
I'm not sure where disabling these jobs would fit into. Perhaps gitlab-ctl command or rake task to disable the jobs?
Yeah I'm not sure how these should work either.
A log cursor event meant for a new secondary might possibly have a race condition in which the new secondary doesn't track Geo events yet.
For removing Geo, I believe we only need to worry about primaries because a secondary removing Geo falls under promotion.
A primary would become standalone if all secondary geo_nodes records get removed, which we can detect in Rails.
Also if a primary's geo_nodes record's name changes to no-longer match the geo_node_name/global.geo.nodeName, which we can also detect in Rails.
I suppose we don't need to worry about someone modifying geo_node_name/global.geo.nodeName to no-longer match the DB, or manually updating the DB. That would be not a supported workflow, and they could eventually figure out what happened with rake gitlab:geo:check.
Rails validation prevents updating the primary geo_nodes record's primary field to false, so we are covered there.
Thanks for these details @mkonozo, greatly appreciated. Here's the text-searchable form of the CronManager job name, making it easier for people who want/need to disable that job manually: geo_sidekiq_cron_config_worker
It's also possible to disable it in the Rails console, for those who feel so inclined: