Pre-receive Secret Detection: Run regex operations within subprocess
Context
During the recent spike conducted for evaluating regex for Pre-recieve Secret Detection, Ruby using RE2 library came out on the top of the list and so is our choice.
Although Ruby has an acceptable regex performance, its language limitations could cause certain pitfalls like the following:
- Memory: Generally regex operations require ~2x of the blob's memory size. By the time garbage collection is triggered, we would've hit the memory overflow in no time. (see #422574 (comment 1582015771))
- CPU: Ruby's popular limitation; Global Interpreter Lock aka GIL allows only one thread to run at a time within a process. This limitation allows to processing of only one request at a time per-process-per-cpu, keeping the other requests blocked for longer periods that in turn could potentially block the customers on git push operation.
- Parallelism: Achieving concurrency on a CPU-bound operation wouldn't be effective. We can rather optimize it via parallelism utilizing multiple CPU cores. However, due to Point 2 above, GIL could stop us from effectively using parallelism by blocking threads from executing simultaneously.
Proposal
As @stanhu
suggested, the above mentioned problems could be tackled to a certain extent by running all the regex operations of each file within a child process. We kill the child process once the regex operations running inside it are completed.
Here's the code snippet extracted from a source:
child_pid = fork.do
# regex operations on a file
end
pgid = Process.getpgid(child_pid)
Process.kill('SIGHUP', -pgid)
Process.detach(child_pid)
The above approach solves the outlined problems:
- Memory: Since the regex operations run within a process, killing that process would immediately free-up the memory occupied by that process.
- CPU: Since the scope of the GIL limits within a process, regex operations between files/requests would not block from each other.
- Parallelism: We opted for multiprocessing rather than multithreading which resolved this issue.
NOTE: Some reference links yet to be added.
Edited by Vishwa Bhat