Gitter incident 2019-3-26
Gitter incident 2019-3-26
Gitter infra docs: https://gitlab.com/gitlab-com/gl-infra/gitter-infrastructure
Cause
See https://gitlab.com/gitlab-com/gl-infra/production/issues/749
Because this could be used to DOS Gitter, I have made it private for now
Reports:
Users seeing repeated messages and messages sometimes not going through.
- https://gitter.im/angular/angular?at=5c9a171752c7a91455fb4044
- https://gitter.im/gitterHQ/gitter?at=5c9a3934cf786a56c77f8b33
- https://twitter.com/joshis_tweets/status/1110530663252480000
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6466#note_154307919
Alerts
- Start of PagerDuty weirdness, https://gitlab.slack.com/archives/CB2HUAY7L/p1553610781001600
https://gitter.pagerduty.com/incidents/P7CPTXR
mongo-replica-01: loadavg(5min) of 48.9 matches resource limit [loadavg(5min)<2.0]
https://gitter.pagerduty.com/incidents/P0MTS7C
High CPU usage on host:mongo-replica-01
https://gitter.pagerduty.com/incidents/P8BXQW9 https://gitter.pagerduty.com/incidents/POJTB79
webapp-01: Profiling CPU usage /tmp/tmp.QJlXopa997.deploy-tools-profile / kernel.kptr_restrict = 0 kernel.perf_event_paranoid = 0 / profile-service-cpu failed Profiling memory usage /var/log/gitter/heap.webapp-01.gitter-api-1.8972.2019-03-26-143522.heapsnapshot Uploaded new snapshot to https://gitter-diagnostics.s3.amazonaws.com/heap.webapp-01.gitter-api-1.8972.2019-03-26-143522.heapsnapshot.gz?xxx
mongo-replica-1 CPU spike
Spike in mongo-replica-1 CPU dashboard, https://app.datadoghq.com/dashboard/qw5-aw8-ah7/cpu-usage-across-all-servers?tile_size=s&page=0&is_auto=false&from_ts=1553439600000&to_ts=1553612400000&live=true
Mongo system load spike
Everything MongoDB dashboard, https://app.datadoghq.com/dashboard/whz-6ej-pd7/everything-mongodb?tile_size=m&page=0&is_auto=false&from_ts=1553443200000&to_ts=1553616000000&live=true
Spike in system load on Mongo servers but the start of it doesn't correspond to other metrics there
Mongo logs
$ ssh mongo-replica-01.prod.gitter
$ ssh mongo-replica-02.prod.gitter
$ ssh mongo-replica-03.prod.gitter
$ less +G /var/log/mongodb/mongod.log
# Accces mongo
$ mongo mongo-replica-01.prod.gitter
> use gitter
Lots of large gitter.troupeusers aggregate commands
I see a bunch(6,695) of these type of log messages from 11:30am(5:30am GMT-6) -> 14:34(8:30am GMT-6) on mongo-replica-01 which corresponds when the Mongo CPU alerts start popping
- User
56e79a2b85d51f252ab8d593: Now ghosted deleted user, see GitHub username (private snippet) https://gitlab.com/gitlab-com/gl-infra/production/snippets/1839147 - Troupe
56e8d92685d51f252ab906dd: https://gitter.im/russianidiot/cached.py - Troupe
56e8d6f185d51f252ab90688: https://gitter.im/russianidiot/russianidiot/assert_exists.py - Troupe
56e8d96f85d51f252ab906eahttps://gitter.im/russianidiot/classifiers.py - Troupe ...
- Group
57542d96c43b8c601977e29b: https://gitter.im/russianidiot/home (https://github.com/russianidiot)
2019-03-26T11:30:24.990+0000 I COMMAND [conn255446565] command gitter.troupeusers command: aggregate { aggregate: "troupeusers", pipeline: [ { $match: { userId: ObjectId('56e79a2b85d51f252ab8d593'), troupeId: { $ne: ObjectId('56e8d92685d51f252ab906dd') } } }, { $project: { _id: 0, troupeId: 1 } }, { $lookup: { from: "troupes", localField: "troupeId", foreignField: "_id", as: "troupe" } }, { $unwind: "$troupe" }, { $match: { troupe.groupId: ObjectId('57542d96c43b8c601977e29b') } }, { $limit: 1 }, { $project: { _id: 0, exists: "$troupe.groupId" } } ] } keyUpdates:0 writeConflicts:0 numYields:7192 reslen:96 locks:{ Global: { acquireCount: { r: 14392 } }, Database: { acquireCount: { r: 7196 } }, Collection: { acquireCount: { r: 7196 } } } protocol:op_query 14042ms
...
2019-03-26T14:34:59.196+0000 I COMMAND [conn255446001] command gitter.troupeusers command: aggregate { aggregate: "troupeusers", pipeline: [ { $match: { userId: ObjectId('56e79a2b85d51f252ab8d593'), troupeId: { $ne: ObjectId('571d96ee659847a7aff459cd') } } }, { $project: { _id: 0, troupeId: 1 } }, { $lookup: { from: "troupes", localField: "troupeId", foreignField: "_id", as: "troupe" } }, { $unwind: "$troupe" }, { $match: { troupe.groupId: ObjectId('57542d96c43b8c601977e29b') } }, { $limit: 1 }, { $project: { _id: 0, exists: "$troupe.groupId" } } ] } keyUpdates:0 writeConflicts:0 numYields:7479 reslen:68 locks:{ Global: { acquireCount: { r: 14964 } }, Database: { acquireCount: { r: 7482 }, acquireWaitCount: { r: 22 }, timeAcquiringMicros: { r: 533680 } }, Collection: { acquireCount: { r: 7482 } } } protocol:op_query 68228ms
In-application metrics
There is a spike in chat messages sent that corresponds with the end of the CPU/load spike but not the beginning


