Make imports more resilient to errors related to deploys
Problem summary
Yesterday, I kicked off a test import of linux
from GitHub into the staging-ref GitLab environment.
My import failed part way and when you look at the errors in the logs it seems there are a number of Redis::ConnectionError
and ActiveRecord::ConnectionNotEstablished
errors. These correspond with the timeframe when a deployment went out to staging-ref.
There is already an issue for staging-ref about data integrity issues related to sidekiq jobs that are processing when staging-ref goes through a with-downtime upgrade: https://gitlab.com/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/issues/94+
This seems to be a similar situation: some jobs were enqueued while a deploy went out. The jobs failed due to connection errors. The jobs never succeeded. I assume that staging-ref also does with-downtime deploys.
There was also a staging-ref deploy yesterday, which involved temporary maintenance mode in the staging-ref environment, but the timeline doesn't line up with the timing of the errors in the logs. Seems suspicious, though.
This might be an issue that is specific to staging-ref but it does seem like it would be better if our import jobs were more resilient to this type of interruption.
Workaround
We already have one workaround available, which is to use the "optimistic" timeout strategy. With this strategy, the interrupted jobs would still fail but these failures wouldn't stop the whole import from completing. So, we'd have some missing data but less overall missing data.
Alternative
Alternatively, we could consider staging-ref to be an unusual environment because it uses with-downtime upgrades. But it seems like this is a good example of jobs being interrupted and we should have a better way of recovering from this that does not involve losing data.