Gitter incident 2019-3-26

Gitter incident 2019-3-26

Gitter infra docs: https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure

Cause

See https://gitlab.com/gitlab-com/gl-infra/production/issues/749

Because this could be used to DOS Gitter, I have made it private for now

Reports:

Users seeing repeated messages and messages sometimes not going through.

  • https://gitter.im/angular/angular?at=5c9a171752c7a91455fb4044
  • https://gitter.im/gitterHQ/gitter?at=5c9a3934cf786a56c77f8b33
  • https://twitter.com/joshis_tweets/status/1110530663252480000
  • https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6466#note_154307919

Alerts

  • Start of PagerDuty weirdness, https://gitlab.slack.com/archives/CB2HUAY7L/p1553610781001600

https://gitter.pagerduty.com/incidents/P7CPTXR

mongo-replica-01: loadavg(5min) of 48.9 matches resource limit [loadavg(5min)<2.0]

https://gitter.pagerduty.com/incidents/P0MTS7C

High CPU usage on host:mongo-replica-01

https://gitter.pagerduty.com/incidents/P8BXQW9 https://gitter.pagerduty.com/incidents/POJTB79

webapp-01: Profiling CPU usage /tmp/tmp.QJlXopa997.deploy-tools-profile / kernel.kptr_restrict = 0 kernel.perf_event_paranoid = 0 / profile-service-cpu failed Profiling memory usage /var/log/gitter/heap.webapp-01.gitter-api-1.8972.2019-03-26-143522.heapsnapshot Uploaded new snapshot to https://gitter-diagnostics.s3.amazonaws.com/heap.webapp-01.gitter-api-1.8972.2019-03-26-143522.heapsnapshot.gz?xxx

mongo-replica-1 CPU spike

Spike in mongo-replica-1 CPU dashboard, https://app.datadoghq.com/dashboard/qw5-aw8-ah7/cpu-usage-across-all-servers?tile_size=s&page=0&is_auto=false&from_ts=1553439600000&to_ts=1553612400000&live=true

Mongo system load spike

Everything MongoDB dashboard, https://app.datadoghq.com/dashboard/whz-6ej-pd7/everything-mongodb?tile_size=m&page=0&is_auto=false&from_ts=1553443200000&to_ts=1553616000000&live=true

Spike in system load on Mongo servers but the start of it doesn't correspond to other metrics there

Mongo logs

$ ssh mongo-replica-01.prod.gitter
$ ssh mongo-replica-02.prod.gitter
$ ssh mongo-replica-03.prod.gitter

$ less +G /var/log/mongodb/mongod.log


# Accces mongo
$ mongo mongo-replica-01.prod.gitter
> use gitter

Lots of large gitter.troupeusers aggregate commands

I see a bunch(6,695) of these type of log messages from 11:30am(5:30am GMT-6) -> 14:34(8:30am GMT-6) on mongo-replica-01 which corresponds when the Mongo CPU alerts start popping 🤔. There is no blame or finger pointing here, just trying to understand the logs

  • User 56e79a2b85d51f252ab8d593: Now ghosted deleted user, see GitHub username (private snippet) https://gitlab.com/gitlab-com/gl-infra/production/snippets/1839147
  • Troupe 56e8d92685d51f252ab906dd: https://gitter.im/russianidiot/cached.py
  • Troupe 56e8d6f185d51f252ab90688: https://gitter.im/russianidiot/russianidiot/assert_exists.py
  • Troupe 56e8d96f85d51f252ab906ea https://gitter.im/russianidiot/classifiers.py
  • Troupe ...
  • Group 57542d96c43b8c601977e29b: https://gitter.im/russianidiot/home (https://github.com/russianidiot)
2019-03-26T11:30:24.990+0000 I COMMAND  [conn255446565] command gitter.troupeusers command: aggregate { aggregate: "troupeusers", pipeline: [ { $match: { userId: ObjectId('56e79a2b85d51f252ab8d593'), troupeId: { $ne: ObjectId('56e8d92685d51f252ab906dd') } } }, { $project: { _id: 0, troupeId: 1 } }, { $lookup: { from: "troupes", localField: "troupeId", foreignField: "_id", as: "troupe" } }, { $unwind: "$troupe" }, { $match: { troupe.groupId: ObjectId('57542d96c43b8c601977e29b') } }, { $limit: 1 }, { $project: { _id: 0, exists: "$troupe.groupId" } } ] } keyUpdates:0 writeConflicts:0 numYields:7192 reslen:96 locks:{ Global: { acquireCount: { r: 14392 } }, Database: { acquireCount: { r: 7196 } }, Collection: { acquireCount: { r: 7196 } } } protocol:op_query 14042ms

...

2019-03-26T14:34:59.196+0000 I COMMAND  [conn255446001] command gitter.troupeusers command: aggregate { aggregate: "troupeusers", pipeline: [ { $match: { userId: ObjectId('56e79a2b85d51f252ab8d593'), troupeId: { $ne: ObjectId('571d96ee659847a7aff459cd') } } }, { $project: { _id: 0, troupeId: 1 } }, { $lookup: { from: "troupes", localField: "troupeId", foreignField: "_id", as: "troupe" } }, { $unwind: "$troupe" }, { $match: { troupe.groupId: ObjectId('57542d96c43b8c601977e29b') } }, { $limit: 1 }, { $project: { _id: 0, exists: "$troupe.groupId" } } ] } keyUpdates:0 writeConflicts:0 numYields:7479 reslen:68 locks:{ Global: { acquireCount: { r: 14964 } }, Database: { acquireCount: { r: 7482 }, acquireWaitCount: { r: 22 }, timeAcquiringMicros: { r: 533680 } }, Collection: { acquireCount: { r: 7482 } } } protocol:op_query 68228ms

In-application metrics

https://app.datadoghq.com/dashboard/8mb-66i-q2b/in-application-metrics?tile_size=m&page=0&is_auto=false&from_ts=1553443200000&to_ts=1553616000000&live=true

There is a spike in chat messages sent that corresponds with the end of the CPU/load spike but not the beginning

Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading