This groupgeo bug has at most 25% of the SLO duration remaining and is ~"approaching-SLO" breach. Please consider taking action before this becomes a ~"missed-SLO" in 14 days (2022-11-23).
In the Managed pods section, click on the Name of a Running pod. For example, the Name looks like gitlab-toolbox-5955db475c-ng2xr.
Click the Kubectl dropdown near the top
Hover over Exec to reveal a sub menu
Click toolbox
A Cloud Shell should start up
Edit the command kubectl exec gitlab-toolbox-5955db475c-ng2xr -c toolbox -- ls to add the -it options and execute the bash command instead of ls. It should look like kubectl exec -it gitlab-toolbox-5955db475c-ng2xr -- bash
Updated replication status below. Everything is looking significantly better in regard to projects and wikis. We still have a fair few items queued when it comes to other replication types.
I've updated the access to give @brodock Admin access on staging-ref. Only @vsizov, @aakriti.gupta and @juan-silva don't have Admin, I can't find users for them on there so most likely they just need to sign in at some point and someone can give them Admin.
I don't think I can give access to the GCP project, but will find out.
Uploads are very behind, so I took a peek at the failures.
irb(main):001:0> pp Geo::UploadRegistry.group(:last_sync_failure).count{"Non-success HTTP response status code 500"=>2, "Non-success HTTP response status code 401"=>2129, "Sync timed out after 28800"=>14660, nil=>250509, "Error downloading file: failed to connect: Connection refused - connect(2) for \"staging-ref.gitlab.com\" port 443"=> 2, "Error downloading file: Server error"=>1, "Non-success HTTP response status code 404"=>2}=> {"Non-success HTTP response status code 500"=>2, "Non-success HTTP response status code 401"=>2129, "Sync timed out after 28800"=>14660, nil=>250509, "Error downloading file: failed to connect: Connection refused - connect(2) for \"staging-ref.gitlab.com\" port 443"=>2, "Error downloading file: Server error"=>1, "Non-success HTTP response status code 404"=>2}
Since most failures are Non-success HTTP response status code 401, I looked at that and opened #398836.
Hello @nwestbury@niskhakova. I discussed this issue with @sranasinghe, and we are wondering if we would need to clean the stale data that could be in the staging-ref environment and if this process is on track or planned for the next iteration. Could you confirm that? I know that replication is working perfectly fine for some of the records. Hence, I'd like to know if these problems are mostly related to that stale data or if we can start the troubleshooting and investigation process with the database as it is now. Thank you!
Staging-ref will be going through a rebuild This will remove all the data and give us a clean starting point. Once the rebuild is complete it should be much easier to track down any issues as they arise.
@jtapiab Mentioning here the workaround(?) we found in the Slack thread. I was able to "Confirm" your email and then "Validate user account" on your user page in the Admin Area.
Regarding weirdness with user validation, it appears to be caused by identity_verification_credit_card, arkose_labs_signup_challenge and other Arkose related FF that has been enabled by Sec::Anti-abuse team for testing (see comments in staging-ref channel).
Both the sites are showing healthy now, The issue looked to be down to the secondary site replication being paused. After resuming replication and running Geo::MetricsUpdateWorker.new.perform on both sites they both now show healthy. Replication is paused as part of the upgrade process for staging-ref, so it looks like the resume command isn't correctly being run.
Ok the main issue here seems to be with relation to the health check being run. The primary site was showing as unhealthy again with the check not having been run for 26 minutes, running Geo::MetricsUpdateWorker.new.perform again solves the issue, but I'm guessing in 10 minutes the site will be unhealthy again.
@mkozono / @jtapiab - Do we know what would be preventing this from running on the primary as usual?
If triggering the worker manually works, my first assumption is that another process fails to schedule the worker every minute. Maybe the worker could not obtain an exclusive lease for some reason.
I checked some logs in staging-ref in GCP and found some Redis failures that I'm not sure are related to, but they could be.
I also searched by the Geo::MetricsUpdateWorker, and I can see the worker being triggered every minute. The last time was one minute ago:
Interestingly, Gitlab::Metrics.prometheus_metrics_enabled? is returning false, which makes the primary node avoid updating the metrics for secondaries and it looks like it's returning false because ::Prometheus::Client.configuration.multiprocess_files_dir is returning nil.
Looking for the implementation of multiprocess_files_dir, I found that ENV.fetch('prometheus_multiproc_dir') is not defined.
Maybe we can try setting the ENV.fetch('prometheus_multiproc_dir') environment variable in both sites and check if the behavior persists so we can discard that it's related to Dir.mktmpdir("prometheus-mmap").
Interestingly, Gitlab::Metrics.prometheus_metrics_enabled? is returning false
Note: We are seeing similar return values for the Prometheus enabled methods in Rails console in https://gitlab.com/gitlab-com/geo-customers/-/issues/125#note_1356923343. In that issue as well as in this one, it seems like the prometheus methods may be a red herring. I think the prometheus metrics do not need to be updated in order for sites on /admin/geo/sites to be shown as "healthy".
The primary site was showing as unhealthy again with the check not having been run for 26 minutes, running Geo::MetricsUpdateWorker.new.perform again solves the issue
Note: This indicates at least that the exclusive lease for this worker is not taken on the primary site. Though by itself it doesn't rule out that Sidekiq is failing to obtain a lease due to a Redis connection problem.
But, at the moment, the primary site's status appears to be up-to-date and getting regularly updated. So I suppose this resolved itself?
The secondary site's status is currently out-of-date.
The way the secondary site's geo_node_statuses record gets updated is:
In the secondary site, the metrics worker is enqueued every minute.
In the secondary site, a Sidekiq worker must pick up the job and execute it. => Idea: We could look for errors during this job, or Sidekiq "done" logs
The job should build a GeoNodeStatus object
The job should make a POST request against the primary site at /api/v4/geo/status => Idea: We could look for these requests on the primary site and their HTTP status code
A quick update: the secondary site gets updated after some hours, then becomes unhealthy until 7 hours (sometimes less than that). I plan to start digging into this on Friday to test Mike's ideas and dig into the logs. I'm curious about this behavior and why it gets up-to-date eventually
I think this was mentioned in another issue-- I think staging-ref's secondary Geo site goes "unhealthy" multiple times a day for multiple hours because it is getting upgraded multiple times a day. Also as part of that upgrade, the secondary site's replication gets "paused", which I suspect exacerbates the issue.
From memory, I think part of the problem is that the upgrade process treats the secondary site as a second-class citizen (as a first iteration of implementing staging-ref with Geo) by putting the upgrade of the secondary in a separate job that doesn't block the primary's upgrade process. While the GitLab versions are different, Geo considers the secondary "Unhealthy".
Pausing replication is listed as an optional step within our docs, do we think it would be better to remove this from the staging-ref upgrade?
Currently, the primary and secondary sites are update in parallel to each other, so they shouldn't be on different versions for long. As @mkozono points out, the failure of one doesn't affect the other, but even if they did, the site would probably still be left in an unhealthy state without manual intervention.
Pausing replication is listed as an optional step within our docs, do we think it would be better to remove this from the staging-ref upgrade?
In the context of staging-ref's frequent upgrades and not using zero-downtime procedure, I think it may be worth a try to see if it reduces unhealthy duration. It might not have an effect though, it is just a hunch.
I think this was mentioned in another issue-- I think staging-ref's secondary Geo site goes "unhealthy" multiple times a day for multiple hours because it is getting upgraded multiple times a day. Also as part of that upgrade, the secondary site's replication gets "paused", which I suspect exacerbates the issue.
I checked the Geo sites health today, and the secondary has been unhealthy for 4 hours (in this moment):
Checking the logs within the affected timeframe, I can see the following error:
{jsonPayload:{class:"Geo::NodeStatusRequestService"component:"gitlab"correlation_id:"a9833ed3283235f36e6d365f843e84b2"error:"Net::ReadTimeout"gitlab_host:"geo.staging-ref.gitlab.com"level:"error"message:"Failed to Net::HTTP::Post to primary url: https://staging-ref.gitlab.com/api/v4/geo/status"subcomponent:"geo"}}
{"textPayload":"CRON JOB: connection to server at \"10.172.0.3\", port 6432 failed: FATAL: server login has been failing, try again later (server_login_retry)\n",...}
These errors are repeated after the Sidekiq pod gets restarted for staging-ref-eu. This only happens sometimes since it eventually gets up-to-date.
Could it be related to the upgrades since we don't follow a zero-downtime procedure in staging-ref? Under that context.. is 5 hours an acceptable amount of time to have the secondary site in an unhealthy state?
IIUC, the secondary is POSTing its status to the primary, and while the primary is supposed to be sending a response, the secondary hits its own 20s read timeout.
Could it be related to the upgrades since we don't follow a zero-downtime procedure in staging-ref?
I think so. The primary is reconfiguring and restarting services as part of upgrade. And rotating pods. I'm not sure on the details of that.
Under that context.. is 5 hours an acceptable amount of time to have the secondary site in an unhealthy state?
It would be helpful for us to reduce it, in order to reduce red herrings, replication failures, and unhealthy Geo statuses during testing.
Do the deduplicated logs go on for hours? If so, then it seems likely that the job deduplication key is getting orphaned, which would cause all subsequent jobs with same args to be deduplicated until the orphaned key expires.
@niskhakova Ideally we would do zero downtime updates, but given it's not available for Cloud Native Hybrid, WDYT about enabling Maintenance Mode while upgrading? Ideally until upgrading is finished, including the secondary and the Geo playbook. But at least until the primary is finished upgrading.
Or are GitLab QA runs maybe synchronized to avoid making writes while the deployment is upgrading?
If not, this could be a source of bad data, which costs Geo time to troubleshoot but is not really a supported workflow.
And the problem with zero downtime is that it would take hours to upgrade and the deploy cadence to Stagings sometimes is an hour or less => environment would have been lagging upgrade cycle from Staging Canary.
If Staging Ref doesn't follow supported workflow due to its deployment cycle and process, could it be fine to acknowledge and disregard known sync issues in such case? Or perhaps we could add command to run Geo::MetricsUpdateWorker.new.perform after each deployment to bypass that?
However worth calling out that if the future plan is to have Geo nodes for GitLab.com - this deployment cadence problem will persist. Is there some functional way to trigger Geo::MetricsUpdateWorker.new.perform more often?
Sorry, I wasn't thinking about Geo status updates. Maintenance Mode wouldn't help with that.
My thought with using Maintenance Mode is to avoid one likely source of bad data:
GitLab QA is constantly and rapidly creating, updating, and deleting resources via the API and web UI. Meanwhile, we are constantly performing upgrades, which restarts frontend and backend services. If various services are restarted while serving write requests (writing to one or more datastore-- Postgres, Redis, Gitaly, container registry, or object storage), then we should expect to produce inconsistent data across those data stores.
If Staging Ref doesn't follow supported workflow due to its deployment cycle and process, could it be fine to acknowledge and disregard known sync issues in such case?
Unfortunately the bad data can manifest in many different ways, so I don't think there is a way to filter it across the board.
Or perhaps we could add command to run Geo::MetricsUpdateWorker.new.perform after each deployment to bypass that?
I took a look at what is going on with deduplication of Geo::MetricsUpdateWorker.
The job fails due to an exception (apparently many different kinds)
The job did not clear its idempotency_key, so it remains for 6 hours by default
All jobs with the same class and args get deduplicated for 6 hours
We could workaround the immediate Geo::MetricsUpdateWorker problem by setting its ttl to e.g. 10 minutes.
This short ttl workaround is safe for this worker, but it isn't safe for jobs such as replication jobs.
All jobs use job deduplication unless otherwise specified, so this problem can affect any job that takes longer than the 25s termination period. In this case, it took longer than 25s seemingly because of a PG-related restart.
However worth calling out that if the future plan is to have Geo nodes for GitLab.com - this deployment cadence problem will persist.
Current options:
Fast upgrade, downtime while live, bad data (Our current choice)
Fast upgrade, downtime, consistent data
Slow upgrade, zero-downtime, consistent data
1. I assume produces GitLab QA failures which do not provide any value to us. (Is this true?) And real users using the system during upgrades also receive errors and produce inconsistent data.
2. is doable-- Use Maintenance Mode or a static page, and send a signal to Sidekiq services.
3. is hard to implement and blocks deploys for too long.
For now, I agree that 2 is probably the best solution we have, as long as putting the staging-ref into Maintenance mode during upgrades isn't an issue. Would staging-ref need to stay in maintenance mode during the secondary sites upgrade or just the primaries?
For option 3, ZDT isn't supported in terms of Hybrid, but we could use ZDT to upgrade the omnibus components and then just upgrade the cluster at the end, would that be a potential solution or could that still potentially cause issues?
As for the time issues with ZDT, we run this on some Geo pipelines already, and it takes about 1H 40M - 1H 50M per site to finish a ZDT. With this kind of time frame we would very easily end up with multiple deployments overlapping. At the moment with ZDT we upgrade each component sequentially and upgrade one node at a time. We could potentially upgrade each component in parallel whilst still doing one node at a time, this could improve the speed whilst keeping a level of ZDT.
Thanks for the deep dive and clarification what is the concern. As mentioned above, option 2 is possible to implement, however it might not be desirable from engineering/user perspective.
GitLab.com deployment cycle varies from an hour to few hours between new package deployment. On average it takes 1 hour for Staging Ref to finish deployment pipeline (https://ops.gitlab.net/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/pipelines/1984485 example). If we put it in Maintenance mode during this time, it could mean that env won't be available for an hour and then for example a new package will be deployed and the env again won't be accessible. This nullifies Staging Ref's purpose of being available sandbox environment.
Perhaps we can explore speeding up the Staging Ref's pipeline for example using reconfigure tag from GET, but I think it will still take about 30mins in ideal conditions so the availability window will be small.
GitLab QA is constantly and rapidly creating, updating, and deleting resources via the API and web UI. Meanwhile, we are constantly performing upgrades, which restarts frontend and backend services.
For now, I agree that 2 is probably the best solution we have, as long as putting the staging-ref into Maintenance mode during upgrades isn't an issue. Would staging-ref need to stay in maintenance mode during the secondary sites upgrade or just the primaries?
Good idea, I think Maintenance Mode is only needed during the primary upgrade since it is where "canonical" writes occur. The secondary web and API interfaces send write requests to the primary anyway. Though the primary upgrade still takes 43 minutes.
For option 3, ZDT isn't supported in terms of Hybrid, but we could use ZDT to upgrade the omnibus components and then just upgrade the cluster at the end, would that be a potential solution or could that still potentially cause issues? As for the time issues with ZDT, we run this on some Geo pipelines already, and it takes about 1H 40M - 1H 50M per site to finish a ZDT. With this kind of time frame we would very easily end up with multiple deployments overlapping.
It sounds like a good idea though I'm also not sure how exactly it should work in practice. And the duration does sound like a problem.
At the moment with ZDT we upgrade each component sequentially and upgrade one node at a time. We could potentially upgrade each component in parallel whilst still doing one node at a time, this could improve the speed whilst keeping a level of ZDT.
How does GitLab.com handle this frequency of upgrades with zero-downtime? It is a bigger env so I would have assumed it would be slower to upgrade. But the deploys go in, and it has really high uptime.
GitLab.com deployment cycle varies from an hour to few hours between new package deployment. On average it takes 1 hour for Staging Ref to finish deployment pipeline (https://ops.gitlab.net/gitlab-org/quality/gitlab-environment-toolkit-configs/staging-ref/-/pipelines/1984485 example). If we put it in Maintenance mode during this time, it could mean that env won't be available for an hour and then for example a new package will be deployed and the env again won't be accessible. This nullifies Staging Ref's purpose of being available sandbox environment.
Perhaps we can explore speeding up the Staging Ref's pipeline for example using reconfigure tag from GET, but I think it will still take about 30mins in ideal conditions so the availability window will be small.
Thank you for this context! It is difficult to balance all constraints. We already step on the constraint of "Staging-ref can be used for testing by developers" a bit with upgrades resulting in some amount of:
Unexpected request failures
Unexpected background job failures
Inconsistent data
Orphaned job deduplication keys or exclusive leases
If we don't use Maintenance Mode, maybe it could help to post a banner during deployments: "Staging-ref is currently being upgraded. Occasional errors may be expected.".
GitLab QA is constantly and rapidly creating, updating, and deleting resources via the API and web UI. Meanwhile, we are constantly performing upgrades, which restarts frontend and backend services.
Ah thank you, this is what I was wondering and hoping for. If QA tests are never run during upgrades, and their associated background jobs finish prior to the next upgrade, then the QA tests are not a source of bad data. I was worried about QA tests because the rate of writes is high, which is an easy way to create bad data during upgrade.
The deployment approach is still susceptible to the problem though. If a user pushes during an upgrade, then the push may fail and may generate bad data.
Could it be that the issue is somewhere else?
Yes, this is a broad problem for Geo. There are a lot of possible sources of bad data.
Prior bugs that have been fixed
Unfixed bugs
Transient infrastructure problems like containers running out of memory and getting killed
Typically, bad or missing data generally goes unnoticed, except by the occasional user who perhaps finds a workaround ("the push failed and then I had to push again and it seemed to work").
With Geo, data integrity problems are suddenly highly visible. So all GitLab data integrity problems are unfortunately our problem to an extent.
If we decide to live with known sources of bad data in staging-ref, then Geo has some tough choices:
Ignore data integrity problems in staging-ref
Spend additional time investigating data integrity problems which turn out to have a known root cause
We have historically chosen 1. 95% of the time with staging and staging-ref because it takes a lot of time investigate these kinds of problems, and old staging had a ton of known sources of invalid Geo failures and bad data (e.g. most Gitaly data was missing).
IMHO, staging-ref works very well for most developers as-is, including for my own testing of Geo changes. But if possible, I would like to try to eliminate as many sources of bad data as possible to maximize the value of staging-ref to Geo.
If we decide to live with known sources of bad data in staging-ref, then Geo has some tough choices:
Ignore data integrity problems in staging-ref
Spend additional time investigating data integrity problems which turn out to have a known root cause
We have historically chosen 1. 95% of the time with staging and staging-ref because it takes a lot of time investigate these kinds of problems, and old staging had a ton of known sources of invalid Geo failures and bad data (e.g. most Gitaly data was missing).
IMHO, staging-ref works very well for most developers as-is, including for my own testing of Geo changes. But if possible, I would like to try to eliminate as many sources of bad data as possible to maximize the value of staging-ref to Geo.
Do you know how to narrow the scope of these sources? Could we break this issue into multiple tasks?
We could workaround the immediate Geo::MetricsUpdateWorker problem by setting its ttl to e.g. 10 minutes.
What do you think about starting with this one? Although the failure of these jobs still needs to be discovered, probably because the primary site gets unreachable while upgrading.
It is interesting to know why the secondary site gets unhealthy for so long if the upgrade takes up to one hour, for example. Could be related to these long-running jobs that you mentioned @mkozono?
One thing I've noticed that won't be helping prevent issues, during the upgrade we rerun the GitLab Geo playbook, whilst this is required to get any changes to Geo config, the process reruns the Patroni workaround which deletes the PostgreSQL data dir on the secondary. This is going to restart replication after every update and should probably be skipped. I'll take a look into doing this today,
How does GitLab.com handle this frequency of upgrades with zero-downtime? It is a bigger env so I would have assumed it would be slower to upgrade. But the deploys go in, and it has really high uptime.
If we don't use Maintenance Mode, maybe it could help to post a banner during deployments: "Staging-ref is currently being upgraded. Occasional errors may be expected.".
The deployment approach is still susceptible to the problem though. If a user pushes during an upgrade, then the push may fail and may generate bad data.
Makes sense. It shouldn't be too big though, only if someone ad-hoc does it. No QA automation is involved.
Spend additional time investigating data integrity problems which turn out to have a known root cause
IMHO, staging-ref works very well for most developers as-is, including for my own testing of Geo changes. But if possible, I would like to try to eliminate as many sources of bad data as possible to maximize the value of staging-ref to Geo.
I'm in favour of investigating it further if it's possible. I'm not familiar with this area, but perhaps if the data that's causing issues can be identified we can understand what's causing this? For example if the issues happen in QA related group - it would indicate that actually QA pipelines might be causing something, even though it shouldn't be happening.
Do you know how to narrow the scope of these sources? Could we break this issue into multiple tasks?
I'm in favour of investigating it further if it's possible. I'm not familiar with this area, but perhaps if the data that's causing issues can be identified we can understand what's causing this? For example if the issues happen in QA related group - it would indicate that actually QA pipelines might be causing something, even though it shouldn't be happening.
My thought is, given that we have identified two notable sources of bad data, we should try to mitigate them:
In both cases, if the request or job has written to one datastore and is in the middle of writing to another datastore when the exception hits, then inconsistent data has been created.
I'll open issues to track these since they are not trivial, and there are multiple ways of mitigating them.
We could workaround the immediate Geo::MetricsUpdateWorker problem by setting its ttl to e.g. 10 minutes.
What do you think about starting with this one? Although the failure of these jobs still needs to be discovered, probably because the primary site gets unreachable while upgrading.
It is interesting to know why the secondary site gets unhealthy for so long if the upgrade takes up to one hour, for example. Could be related to these long-running jobs that you mentioned @mkozono?
I believe the cause of Geo::MetricsUpdateWorker getting deduplicated for 6 hours is because of "Performing with-downtime upgrades while Sidekiq jobs are being processed causes exceptions during jobs".
I do think we should set Geo::MetricsUpdateWorker's ttl to 10 or 20 minutes. I've opened #414047 (closed)
@jtapiab Thanks for pushing on this issue! Since the staging-ref rebuild reset the data, I think we should probably close this overly broad issue. WDYT?
We have some good follow up issues already. And we can open new issues for new failures, such as, there is something going on with designs at the moment, probably somehow related to the SSF migration.