MR discussion diffs persists and reads too much data from Redis

Problem

During gitlab-com/gl-infra/scalability#418 (comment 371299552) investigation I've noticed that Projects::MergeRequestsController#discussions is the top Redis cache reader (taking around 45% in the context of the top 3 actions), reading an average of 20gb per day.

Looking at the last 2 hours for instance, I can see a single project constantly reading 20MB, taking 6 seconds avg for each request:

source

This negatively impacts our Redis cache instances considering we'll read as long as these MR discussions goes.

Proposals

Persist in DB and Redis just what is needed

Currently we persist the discussion diff in the DB, then cache its highlighted version in Redis.

When persisting this diff we take everything from very first line of the diff until the commented line. So imagine that if a MR has 200 comments in the middle of big files, we'll end up with a considerable amount of data persisted.

That happens because it's not simple to rebuild the diff file if it doesn't have the diff header (see how we parse diffs). But, if we rebuild the diff with a custom header, trim every other line (leaving just a few, say, 5 lines in total) like we do with suggestions, then persist, we'll then be able to feed that into the same parser mentioned earlier. As an outcome:

Reduce the amount of data persisted in DB
Reduce the IO with Redis and DB
Improve response time in the process

Hard limit number of discussions

Currently we don't have limits for how much is loaded from Projects::MergeRequestsController#discussions, meaning that one could send an indefinite amount of discussions in MRs put pressure in a considerable amount of services at once.

Status on Close

The original problem of persisting/reading too much data from Redis was mitigated by introducing compression. The long term challenge of being more efficient with the data (and then removing the need for compression) will be addressed in #247497.

Edited Sep 10, 2020 by Rachel Nienaber