internal_ids table gets out of sync with actual deployments entries

Summary

Customer reported they were getting this error repeatedly when trying to start new pipelines on existing projects:

Completed 500 Internal Server Error in 580ms (ActiveRecord: 89.1ms | Elasticsearch: 0.0ms)

ActiveRecord::RecordNotUnique (PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_deployme
nts_on_project_id_and_iid"
DETAIL:  Key (project_id, iid)=(13312, 772) already exists.
: INSERT INTO "deployments" ("project_id", "environment_id", "ref", "tag", "sha", "user_id", "status", "deployable_id", "d
eployable_type", "iid", "created_at", "updated_at") VALUES (13312, 1833, 'scm-test-gitlab-branch', 'f', '58f504ee858efb881
ec03d45143f227a7553364a', 59, 0, 1042393, 'CommitStatus', 772, '2019-04-18 22:41:36.526733', '2019-04-18 22:41:36.526733')
 RETURNING "id"):
  config/initializers/active_record_locking.rb:11:in `_create_record'
  app/models/concerns/deployable.rb:20:in `create_deployment'
  lib/gitlab/metrics/instrumentation.rb:159:in `block in create_deployment'
  lib/gitlab/metrics/method_call.rb:34:in `measure'
  lib/gitlab/metrics/instrumentation.rb:159:in `create_deployment'
  lib/gitlab/ci/pipeline/chain/create.rb:11:in `perform!'
  lib/gitlab/ci/pipeline/chain/sequence.rb:19:in `block in build!'
  lib/gitlab/ci/pipeline/chain/sequence.rb:16:in `each'
  lib/gitlab/ci/pipeline/chain/sequence.rb:16:in `build!'
  app/services/ci/create_pipeline_service.rb:44:in `execute'
  app/controllers/projects/pipelines_controller.rb:57:in `create'
  lib/gitlab/i18n.rb:55:in `with_locale'
  lib/gitlab/i18n.rb:61:in `with_user_locale'
  app/controllers/application_controller.rb:427:in `set_locale'
  lib/gitlab/middleware/multipart.rb:101:in `call'
  lib/gitlab/request_profiler/middleware.rb:14:in `call'
  ee/lib/gitlab/jira/middleware.rb:15:in `call'
  lib/gitlab/middleware/go.rb:17:in `call'
  lib/gitlab/etag_caching/middleware.rb:11:in `call'
  lib/gitlab/middleware/rails_queue_duration.rb:22:in `call'
  lib/gitlab/metrics/rack_middleware.rb:15:in `block in call'
  lib/gitlab/metrics/transaction.rb:53:in `run'
  lib/gitlab/metrics/rack_middleware.rb:15:in `call'
  lib/gitlab/middleware/read_only/controller.rb:40:in `call'
  lib/gitlab/middleware/read_only.rb:16:in `call'
  lib/gitlab/middleware/basic_health_check.rb:25:in `call'
  lib/gitlab/request_context.rb:20:in `call'
  lib/gitlab/metrics/requests_rack_middleware.rb:27:in `call'
  lib/gitlab/middleware/release_env.rb:10:in `call'

Looking at system logs, we were able to see that every time we tried to start a new pipeline for each of 3 different projects, this error showed up, with the same attempted iid each time (with different source commits) on a per-project basis. We compared the last_value in the internal_ids table to the max value in the deployments table for each project, and got this:

gitlabhq_production=> create temporary table max_iids as select project_id, max(iid) from deployments group by 1; select * from max_iids join internal_ids USING (project_id) WHERE usage=2 AND max_iids.max <> last_value;
SELECT 209
 project_id | max  | id  | usage | last_value | namespace_id
------------+------+-----+-------+------------+--------------
       5039 |  553 | 179 |     2 |        552 |
      13312 |  763 | 660 |     2 |        762 |
      11711 | 3015 |  55 |     2 |       3014 |
(3 rows)

We had them bump the last_value in each project by one, to match the actual max, and piplines started working again for the three projects.

update internal_ids set last_value=553 where project_id=5039 and usage=2;
update internal_ids set last_value=763 where project_id=13312 and usage=2;
update internal_ids set last_value=3015 where project_id=11711 and usage=2;

Now they have come back again today with 8 projects showing the same behavior. 2 are repeat projects from yesterday, with 6 new ones.

gitlabhq_production=>  select * from max_iids join internal_ids USING (project_id) WHERE usage=2 AND max_iids.max > last_value;
 project_id | max  | id  | usage | last_value | namespace_id
------------+------+-----+-------+------------+--------------
         73 | 2187 |  68 |     2 |       2186 |
      10861 |  783 |  23 |     2 |        782 |
      11711 | 3051 |  55 |     2 |       3050 |
      12398 |  291 |  70 |     2 |        290 |
      12463 |  115 | 154 |     2 |        114 |
      13242 |  183 | 183 |     2 |        182 |
      13310 |  469 | 350 |     2 |        468 |
      13312 |  773 | 660 |     2 |        771 |
(8 rows)

They have a standby PG database but have had no failovers. We PG logs that show the iid errors but nothing indicating how it got in this state. They do not have sidekiq log arguments enabled.

What other data do we need to get to the bottom of this? This is a 7000+ user customer.

https://gitlab.zendesk.com/agent/tickets/119134

/cc @abrandl @dosuken123 @stanhu @jwoods06

Edited Apr 19, 2019 by Diana Stanley