Add `AvoidIoRead` Cop that flags memory intensive IO calls (!27590) · Merge requests · GitLab.org / GitLab

What does this MR do?

Closes #209922 (closed)

Reading files or IO streams into memory in their entirety (i.e. until EOF) can have adverse effects on memory consumption and GC behavior, so we should avoid doing this for files exceeding, say, a few hundred KB in size.

This cop is somewhat aggressive in that a lot of calls to e.g. File.read are harmless if the file is small (e.g. a YAML config), so it has a relatively high chance of identifying false positives. Moreover, not all calls to IO.read can be done line-by-line since sometimes you need the entire structure in memory (e.g when making it available as a Ruby hash.)

To mitigate that I already excluded the following files and usages from being linted by this cop:

config files
any tests or test related files
any custom scripts
rake tasks
packages where we commonly read via other types, e.g. Redis
calls to read and readlines that take a path based on Rails.root (we can assume these are not massive files.)

On the other hand, the few calls that are true positives could potentially do significant damage to our resource footprint (as explained in the ticket, it's not just the raw memory that's allocated which is problematic, but also the effect that it has on future GC runs.)

I believe the cop detected two cases which seem legit, one of which we already knew was a problem (the legacy importer) and which is being tackled in &2734)

The other match that caught my eye was also in the import module, in lfs_restorer: https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/import_export/lfs_restorer.rb#L69-79

I'm not sure what this is doing, i.e. whether it's just restoring smallish amounts of metadata or actualy large files, but the fact that it's LFS related might be worth looking at this a bit closer. It also looks like that JSON file is only read for a single attribute, then discarded again.

Alternative approaches to this or using a Cop in general:

Merely document this somewhere for now. Pros: least disruptive. Cons: less direct nudge, devs might not know about it or decide to ignore it.
Only enable it for modules where we know this could get out of hand (such as lib/import). Pros: reduces noise in areas where it's unlikely to be a problem. Cons: if we add a new module where it's desirable to run, we will most likely forget to include it.
Patch File.read[lines] to perform a size check first; if the size is above a given threshold, fail fast with an error. Pros: simple to implement and would let most cases pass. Cons: It's impossible to know upfront when this might make something fall over, esp. when dynamic data is involved such as with imports.

Please let me know what your thoughts are!

Screenshots

Does this MR meet the acceptance criteria?

Conformity

[-] Changelog entry
[-] Documentation (if required)
Code review guidelines
[-] Merge request performance guidelines
Style guides
[-] Database guides
[-] Separation of EE specific content

Edited May 31, 2022 by 🤖 GitLab Bot 🤖

Add `AvoidIoRead` Cop that flags memory intensive IO calls

What does this MR do?

Screenshots

Does this MR meet the acceptance criteria?

Conformity

Merge request reports