Increased sidekiq queue after isolating the service to new nodes

What happened?

As part of the migration to our Azure ARM environment we enabled three new specialised sidekiq nodes and slowly began switching off all the services on the previous ones in the Classic environment. In the classic environment the workers were running all gitlab services (unicorn, workhorse, sidekiq, etc) while on the new environment only sidekiq and sidekiq-cluster were started.

We started the service shutdown task at 10:38 and completed it at 11:37.

Right after we stopped the services on the last node we noticed an increase in the sidekiq jobs being queued.

We also started seeing this error when attempting to accept a merge request: GitLab: Failed to authorize your Git request: internal API unreachable

Thinking three nodes weren't enough to sustain all the sidekiq traffic, we restarted the sidekiq and sidekiq-cluster services on three of the classic workers at 11:51 as a countermeasure. Six minutes later we added a fourth one.

Seeing that the queue still wasn't decreasing, at 13:33 we restarted sidekiq and sidekiq-cluster on all the remaining classic workers. After this action the queue got flushed. Merge requests stopped showing the API error.

But at 14:47 we received another alert about gitlab-shell failing to contact the internal API. After a quick investigation at 15:09 we found out that our sidekiq configuration assumes the presence of a local internal API server. This wasn't the case in our new sidekiq nodes and even on the classic ones we only restarted the sidekiq services. This also means that background jobs were throwing errors, causing them to remain in redis unprocessed. To the end user this translated in very long delays.

As a final fix we started the nginx and unicorn services on all sidekiq nodes and quickly all jobs got finally processed.

What went wrong?

Sidekiq had an undocumented hard dependency on a local API server. Historically workers at GitLab were multi-purposed, meaning that every node had all GitLab services started. This led to design applications and configurations based on certain assumptions. As we are transitioning towards specialised worker nodes we quickly had our first encounter with this.

What could have we done to avoid this?

We could've tested the isolation of the services in a staging environment. Unfortunately we didn't consider this option due to the coincidence with the GitLab release.

Also, we couldn't see exactly what was going on in our dashboards. This led to blindness and lack of timely response.

What are we going to do to avoid this from happening again?

Analyse the configurations and find out all the assumptions that a service is hosted locally. - https://gitlab.com/gitlab-com/infrastructure/issues/1234
Improve our sidekiq graphs. - https://gitlab.com/gitlab-com/infrastructure/issues/1236
Have multiple staging environments. - https://gitlab.com/gitlab-com/infrastructure/issues/1035

cc/ @gl-infra @DouweM