Document memory performance implications of `IO.read[lines]`
Problem to solve
Replaces #209922 (closed)
Several functions on File
and IO
read the source in its entirey (i.e. until hitting EOF), which can lead to memory bloat if the file is large. For instance File.readlines
returns an array where each element points to a file, so on top of the actual file contents, Ruby needs to maintain the array and object references to each line.
Here's a small program demonstrating the impact of reading a large file via readlines
:
pid = Process.pid
GC.start
File.readlines('gitlabhq_export.tar.gz').each { |l| nil }
pp GC.stat
puts `ps -o rss -p #{pid}`
where gitlabhq_export.tar.gz
is 752MB on disk.
Output:
{:count=>48,
:heap_allocated_pages=>5758,
:heap_sorted_length=>7311,
:heap_allocatable_pages=>1553,
:heap_available_slots=>2346961,
:heap_live_slots=>2346805,
:heap_free_slots=>156,
:heap_final_slots=>0,
:heap_marked_slots=>2245736,
:heap_eden_pages=>5758,
:heap_tomb_pages=>0,
:total_allocated_pages=>5758,
:total_freed_pages=>0,
:total_allocated_objects=>2398550,
:total_freed_objects=>51745,
:malloc_increase_bytes=>26861816,
:malloc_increase_bytes_limit=>33554432,
:minor_gc_count=>32,
:major_gc_count=>16,
:remembered_wb_unprotected_objects=>218,
:remembered_wb_unprotected_objects_limit=>436,
:old_objects=>2245494,
:old_objects_limit=>4070658,
:oldmalloc_increase_bytes=>60416640,
:oldmalloc_increase_bytes_limit=>75361551}
RSS
783904
RSS is in kB so memory grew beyond the file size ~783MB. Moreover, note how malloc_increase_bytes
jumped to ~27MB
. This value specifies how much additional application heap space will be requested from the OS next time the VM runs out of heap space. This value will depend on how long the app had been running and what the allocations had been thus far, so in an app where the GC has matured over some hours or days the jump will be smaller, but that is still a good chunk of memory.
Moreover, we had to instantiate a significant number of objects to represent the lines in the file, as is evident by the ~2.3M
object slots being recorded in heap_live_slots
.
We need to make developers aware of this.
Proposal
We should call out these inefficiencies in our performance guidelines and give recommendations for how to address them.
For instance, it is a lot more efficient to consume IO line-by-line and discarding all previously processed rows in the process, e.g.:
while line = file.readline
# process line
end