`Adding dead ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper job` due to `Net::ReadTimeout` errors, causing problems with outgoing emails in 15.10.3 (likely due to mail gem 2.8.1)
Summary
In 15.10.2 we did Update mail gem to v2.8.1. Something in the line of that dependency seems to be causing an issue with outgoing emails for some customers. It appears that the Sidekiq job doesn't register the successful sending for some reason, and performs all three retries of the job, resulting in four sent emails total. (Similar behavior was observed in #231353 (closed) a while ago.)
🎫 #404862 (internal)
We recently upgraded our Omnibus GitLab instance to version 15.10.3 and our Sidekiq mailers queue began to fail We believe that we have traced the failure back to a reset command being sent from the GitLab server [to our SMTP relay].
Detailed configuration, Sidekiq log excerpt, and a screenshot of the RSET
command they're referring to in a SMTP connection log available in the ticket.
🎫 #400698 (internal)
After performing an upgrade from 15.10.1 to GitLab version 15.10.3 we have noticed a regression. Upon a pipeline failure Gitlab sends 4 identical emails. We have identified several WARN records in Sidekiq log files that point to a problem in ActionMailer::MailDeliveryJob.
Detailed configuration, full SOS report as well as email headers available in the ticket.
What is the current bug behavior?
Multiple emails are being sent.
What is the expected correct behavior?
Only a single email is being sent.
Relevant logs and/or screenshots
{"severity":"INFO","time":"2023-05-03T09:21:46.063Z","message":"Delivered mail note_1234@REDACTED (5024.5ms)"}
{"severity":"INFO","time":"2023-05-03T09:21:46.064Z","message":"Adding dead ActiveJob::QueueAdapters::SidekiqAdapter::JobWrapper job 9b9e820b8415a20130033e63"}
{"severity":"WARN","time":"2023-05-03T09:21:46.074Z","retry":3,"queue":"mailers","class":"ActionMailer::MailDeliveryJob","args":["[FILTERED]"],"jid":"9b9e820b8415a20130033e63","created_at":"2023-05-03T09:19:58.686Z","correlation_id":"08c06236ca4f0825dda2916e3c2affd5","meta.caller_id":"NewNoteWorker","meta.remote_ip":"REDACTED","meta.feature_category":"team_planning","meta.user":"REDACTED","meta.user_id":REDACTED,"meta.project":"REDACTED","meta.root_namespace":"REDACTED","meta.client_id":"REDACTED","meta.root_caller_id":"Projects::NotesController#create","worker_data_consistency":"always","idempotency_key":"resque:gitlab:duplicate:mailers:59bfe5710f3b205756ebd2dd2cded4a1351ef635c593ffd3bb11f590d7c36f1b","size_limiter":"validated","enqueued_at":"2023-05-03T09:21:40.766Z","error_message":"Net::ReadTimeout with #\u003cSocket:(closed)\u003e","error_class":"Net::ReadTimeout","failed_at":"2023-05-03T09:20:04.339Z","retry_count":2,"retried_at":"2023-05-03T09:21:08.433Z","job_size_bytes":394,"pid":533,"message":"ActionMailer::MailDeliveryJob JID-9b9e820b8415a20130033e63: fail: 5.305325 sec","job_status":"fail","scheduling_latency_s":0.001642,"gitaly_calls":6,"gitaly_duration_s":0.035136,"redis_calls":3,"redis_duration_s":0.002132,"redis_read_bytes":1244,"redis_write_bytes":254,"redis_cache_calls":1,"redis_cache_duration_s":0.000294,"redis_cache_read_bytes":1136,"redis_cache_write_bytes":45,"redis_queues_calls":1,"redis_queues_duration_s":0.001572,"redis_queues_read_bytes":1,"redis_queues_write_bytes":123,"redis_repository_cache_calls":1,"redis_repository_cache_duration_s":0.000266,"redis_repository_cache_read_bytes":107,"redis_repository_cache_write_bytes":86,"db_count":42,"db_write_count":3,"db_cached_count":8,"db_replica_count":0,"db_primary_count":42,"db_main_count":42,"db_main_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":8,"db_main_cached_count":8,"db_main_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_main_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.037,"db_main_duration_s":0.037,"db_main_replica_duration_s":0.0,"cpu_s":0.213799,"mem_objects":155258,"mem_bytes":23733176,"mem_mallocs":234932,"mem_total_bytes":29943496,"worker_id":"sidekiq_0","rate_limiting_gates":[],"duration_s":5.305325,"completed_at":"2023-05-03T09:21:46.073Z","load_balancing_strategy":"primary","exception.class":"Net::ReadTimeout","exception.message":"Net::ReadTimeout with #\u003cSocket:(closed)\u003e","exception.backtrace":["lib/gitlab/database/load_balancing/sidekiq_server_middleware.rb:26:in `call'","lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executing.rb:16:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:44:in `perform'","lib/gitlab/sidekiq_middleware/duplicate_jobs/server.rb:8:in `call'","lib/gitlab/sidekiq_middleware/worker_context.rb:9:in `wrap_in_optional_context'","lib/gitlab/sidekiq_middleware/worker_context/server.rb:19:in `block in call'","lib/gitlab/application_context.rb:115:in `block in use'","lib/gitlab/application_context.rb:115:in `use'","lib/gitlab/application_context.rb:55:in `with_context'","lib/gitlab/sidekiq_middleware/worker_context/server.rb:17:in `call'","lib/gitlab/sidekiq_status/server_middleware.rb:7:in `call'","lib/gitlab/sidekiq_versioning/middleware.rb:9:in `call'","lib/gitlab/sidekiq_middleware/query_analyzer.rb:7:in `block in call'","lib/gitlab/database/query_analyzer.rb:37:in `within'","lib/gitlab/sidekiq_middleware/query_analyzer.rb:7:in `call'","lib/gitlab/sidekiq_middleware/admin_mode/server.rb:14:in `call'","lib/gitlab/sidekiq_middleware/instrumentation_logger.rb:9:in `call'","lib/gitlab/sidekiq_middleware/batch_loader.rb:7:in `call'","lib/gitlab/sidekiq_middleware/extra_done_log_metadata.rb:7:in `call'","lib/gitlab/sidekiq_middleware/request_store_middleware.rb:10:in `block in call'","lib/gitlab/with_request_store.rb:17:in `enabling_request_store'","lib/gitlab/with_request_store.rb:10:in `with_request_store'","lib/gitlab/sidekiq_middleware/request_store_middleware.rb:9:in `call'","lib/gitlab/sidekiq_middleware/server_metrics.rb:76:in `block in call'","lib/gitlab/sidekiq_middleware/server_metrics.rb:103:in `block in instrument'","lib/gitlab/metrics/background_transaction.rb:33:in `run'","lib/gitlab/sidekiq_middleware/server_metrics.rb:103:in `instrument'","lib/gitlab/sidekiq_middleware/server_metrics.rb:75:in `call'","lib/gitlab/sidekiq_middleware/monitor.rb:10:in `block in call'","lib/gitlab/sidekiq_daemon/monitor.rb:46:in `within_job'","lib/gitlab/sidekiq_middleware/monitor.rb:9:in `call'","lib/gitlab/sidekiq_middleware/size_limiter/server.rb:13:in `call'","lib/gitlab/sidekiq_logging/structured_logger.rb:21:in `call'"],"db_duration_s":0.036123}
Output of checks
Checks are fine, no observable issues besides the errors. One customer is still on 15.10.3, the other upgraded to 15.11 by now, but the problem persists.