We are adding the ability to use DiffBlobs to the current gem but we want the Secret Detection Service (SDS) to present an API to be able to scan strings regardless of whether they're blob diffs, whole commits, or, in the future, text from other sources.
We need to refactor the DiffBlob logic out of the gem code so the gem can be generalized and the blob diff logic can remain in the rails code.
Questions to answer
At what point do we "cut over" to adding new features/functionality in one place to avoid the parity issue moving forward?
How would that development process look like? Where should this be documented?
No, we don't need to do this anymore. The initial DiffBlobs MR had coupled Gitaly and git's terminologies with the secret detection gem. However, we decided later to keep the secret detection gem independent considering the future input types (job logs, etc) and handle this step in the pre-processing stage (secrets_check.rb) before calling the gem's scan method. So, we can close this issue.
@vbhat161 Is this true? @serenafang's in-flight MR shows changes in the gem files pertaining to using diffblobs which we would want to include in the SDS, unless I'm mistaken.
@eurie You're right. There is a new scan_diffs.rb file that is being introduced, although I'm not convinced about that approach. On the safer side, let's consider there's a possibilityof some dependent changes from DiffBlobs MR that we need to include in the SDS gem. Given this, it makes sense to keep this issue open for now. We can close it in case the mentioned approach is isolated from the scan logic.
As pointed out by @eurie, we still need to account for this change because !162782 (merged) introduces scan_diffs.rb to handle scanning payloads that are diff-based. This was mainly due to concerns (1, 2) over the use of feature flags in the gem (simply put, it's not possible to use FFs in the gem).
I have discussed @vbhat161's comment synchronously and we agreed it would be okay to move forward with that approach so as not to delay Serena's merge request further, but that would also require introducing a new RPC endpoint in the service.
@ahmed.hemdan Thanks for sharing it here! There's a slight misunderstanding. I'm in favor of moving forward with that approach for that merge request since it was long due and handling the separation of parsing logic from scanning logic when we migrate gem files to SDS project. However, I'm not in favor of introducing a new RPC endpoint. The SD service is solely designed to scan the payload but not transform the raw data into a scannable format and then run the scan on it, unless we want to take that direction.
There's a slight misunderstanding. I'm in favor of moving forward with that approach for that merge request since it was long due and handling the separation of parsing logic from scanning logic when we migrate gem files to SDS project. However, I'm not in favor of introducing a new RPC endpoint.
@vbhat161 Thanks for clarifying this. I misunderstood what you said earlier. 😅
The SD service is solely designed to scan the payload but not transform the raw data into a scannable format and then run the scan on it, unless we want to take that direction.
I don’t think we should take that direction, it’s better to keep the service focused solely on scanning payloads.
I agree with you both @vbhat161 and @ahmed.hemdan, that the SDS should focus on scanning in the generic sense and not converting data to a scannable. Thank you for sharing the outcome of your discussion and clarifying.
Just to recap, no changes to the SDS gem are needed for DiffBlobs support. However, we do need to incorporate the logic for parsing raw git-diff data into a scannable payload before invoking the Gem/Service in the Rails monolith.
@eurie Maybe we can modify this issue description to reflect the parsing task in the Rails instead?
How would that development process look like? Where should this be documented?
I'd say we'll inform this information to the team when #488497 (closed) is taken up for development. Any further changes to the SD core logic should be done in the SD service project repository. What do you think?
UPDATE: As mentioned here, we might have to move over DiffBlobs changes too in the SD gem.
Please note that SD Exclusions for Secret Push Protection is also introducing a minor change to how exclusions are handled while scanning (see my draft MR) due to the change in exclusions object format (i.e. it's now a hash rather than an array).
I'm hoping to raise a merge request in SDS to update the SD core logic accordingly as soon as I'm done with !166511 (merged).
I'd say we'll inform this information to the team when #488497 (closed) is taken up for development. Any further changes to the SD core logic should be done in the SD service project repository. What do you think?
@ahmed.hemdan@vbhat161@eurie👋 I'm beginning development on this issue and I was hoping to get your opinions on how we could standardize the payload info in ee/lib/gitlab/checks/secrets_check.rb before sending it to the gem/SDS. I have a draft MR with my attempt at standardizing the info, but I'm open to any other ideas.
Problem:
The format of the response returned by Gitaly::DiffBlobsRequest and ::Gitlab::Checks::ChangedBlobs is quite different. For example, this is what a DiffBlob response looks like:
[#<Gitlab::GitalyClient::DiffBlob:0x0000000388891120 @binary=false, @left_blob_id="e453c0db9822625c2761e43a27a4d25d7499c580", @over_patch_bytes_limit=false, @patch= "@@ -5,5 +5,7 @@\ncontext line 1\ncontext line 2\ncontext line 3\ncontext line 4\ncontext line 5\n+added line 1\n+added line 2\n", @right_blob_id="e1a69d0f17aa4fba82a8f77fb04dd8445e3921c5", @status=:STATUS_END_OF_PATCH>]
and this is what a ChangedBlob response looks like:
[#<Gitlab::Git::Blob:0x000000039ae53038 @binary=nil, @commit_id=nil, @data= "context line 1\ncontext line 2\ncontext line 3\ncontext line 4\ncontext line 5\nadded line 1\nadded line 2\n", @id="e1a69d0f17aa4fba82a8f77fb04dd8445e3921c5", @loaded_all_data=true, @loaded_size=178, @mode=nil, @name=nil, @path=nil, @raw= "context line 1\ncontext line 2\ncontext line 3\ncontext line 4\ncontext line 5\nadded line 1\nadded line 2\n", @size=101>]
Previously we would send the DiffBlobs to gems/gitlab-secret_detection/lib/gitlab/secret_detection/**scan_diffs**.rb for diff parsing and scanning, but this issue aims to refactor DiffBlobs logic out of the gem code. We will remove scan_diffs.rb and scan both DiffBlobs and ChangedBlobs in scan.rb. Since it's not possible to check the feature flag status in gems/gitlab-secret_detection/lib/gitlab/secret_detection/scan.rb, we want to make sure that the format of the payload we are sending to scan.rb is the same, whether we are scanning a DiffBlob (diffs only) or a ChangedBlob (entire file contents).
Possible solution:
Please take a look at my draft MR for my attempt to standardize the payload format. My idea is to create a new method called standardize_payloads. The method returns an array of hashes that looks like this:
[{:id=>"e1a69d0f17aa4fba82a8f77fb04dd8445e3921c5", :data=> [{:line_content=>"added line 1\n", :line_number=>6}, {:line_content=>"added line 2\n", :line_number=>7}] }]
where id is the blob.id or the diff.right_blob_id, and data is the blob.data or the diff.patch lines that start with +. The reason that data contains the keys line_content and line_number is so when use_diff_scan? is true and we parse diffs, we can record only the content of the added lines (lines that start with + in a git diff) and the line number of the newly added lines. If use_diff_scan? is false, then we will process and record each line of the entire file. So if use_diff_scan? is true we would send the hash above, and if use_diff_scan? is false we would send:
[{:id=>"e1a69d0f17aa4fba82a8f77fb04dd8445e3921c5", :data=> [ {:line_content=>"context line 1\n", :line_number=>1}, {:line_content=>"context line 2\n", :line_number=>2}, {:line_content=>"context line 3\n", :line_number=>3}, {:line_content=>"context line 4\n", :line_number=>4}, {:line_content=>"context line 5\n", :line_number=>5}, {:line_content=>"added line 1\n", :line_number=>6}, {:line_content=>"added line 2\n", :line_number=>7} ] }]
This way, gems/gitlab-secret_detection/lib/gitlab/secret_detection/scan.rb#secrets_scan takes an array of standardized payloads, with each payload having id and data properties. gems/gitlab-secret_detection/lib/gitlab/secret_detection/scan.rb#find_secrets will scan each data[:line_content] for secrets, and record the data[:line_number] if there are any findings.
@serenafang Thank you for bringing this up and for sharing your thought process. 🙏🏼
One thing that stood out to me is this:
We will remove scan_diffs.rb and scan both DiffBlobs and ChangedBlobs in scan.rb.
Are we not planning to drop scanning entire files (i.e. ChangedBlobs) in favor of committed changes (i.e. DiffBlobs) anyways? Why would we care about having two separate payload formats if that's the case?
I imagine this may be related to having to keep support for both scanning mechanisms as long as the feature flag is around? In that case, perhaps it will make more sense to prioritize FF removal rather than add more redundant code that will be removed shortly? 🤔
When it comes to the payload format, I like the use of id and data properties as originally suggested by @vbhat161. 💯
However, I have no strong opinion on the idea of breaking up diff_blob.patch into separate lines. While I'm generally leaning more towards passing the data as whole and let scanning engines (SDS or gem) handle it whichever way it deems suitable, looking back to the discussion we had a few weeks ago, we may want to split payload formatting away from scanning logic.
Curious to know what Ethan and Vishwa think as well.
Are we not planning to drop scanning entire files (i.e. ChangedBlobs) in favor of committed changes (i.e. DiffBlobs) anyways?
Even after we remove the FF, we still need to account for the protocol check; http/ssh pushes will use diff-scanning, and WebIDE pushes (and other web actions) will use entire-file-scanning. Until we are able to differentiate between WebIDE user pushes and other web workflows using the action param, we need to keep support for both entire files and committed diffs.
Additionally, if we plan to expand SPP to other places as part of the platform wide SD vision, it could be helpful to establish a standardized payload format for the gem to accept for all future areas that we expand to.
Even after we remove the FF, we still need to account for the protocol check; http/ssh pushes will use diff-scanning, and WebIDE pushes (and other web actions) will use entire-file-scanning. Until we are able to differentiate between WebIDE user pushes and other web workflows using the action param, we need to keep support for both entire files and committed diffs.
Additionally, if we plan to expand SPP to other places as part of the platform wide SD vision, it could be helpful to establish a standardized payload format for the gem to accept for all future areas that we expand to.
Agreed. From my side, I'm not a fan of breaking up payloads into single lines as stated above, but have no compelling argument against it, so don't mind doing this if everyone else on the team is in favour of this format. 🙌🏼
I'm not a fan of breaking up payloads into single lines
I'm definitely open to other approaches, since I agree that breaking up the data/patch into data=>[{:line_content=>"context line 1\n", :line_number=>1}] is a little awkward 😅
But if we don't do that, I'm not sure how gems/gitlab-secret_detection/lib/gitlab/secret_detection/scan.rb would recognize diffs vs the entire file and then handle them correctly. For example, we send the whole diff_blob.patch as data to scan.rb: "@@ -5,5 +5,7 @@\ncontext line 1\ncontext line 2\ncontext line 3\ncontext line 4\ncontext line 5\n+added line 1\n+added line 2\n". Then scan.rb would need a way to determine that the data is a diff without checking the feature flag in order to parse it.
One potential solution could be to check that data starts with @@ in scan.rb, but then we run into the possibility that a file can technically start with @@ but isn't a diff patch, and shouldn't be parsed as such.
Another idea I considered was to replace all diff context lines in a file with a blank line in secrets_check.rb#parse_diffs before sending it to scan.rb. So the example patch from earlier would be changed into "\n\n\n\n\n\nadded line 1\nadded line 2\n", and we would send
[{:id=>"e1a69d0f17aa4fba82a8f77fb04dd8445e3921c5", :data=> ["\n\n\n\n\n\nadded line 1\nadded line 2\n"] }]
to scan.rb without having to break data/patch up into single lines. Then scan.rb would loop through each line of data looking for secrets. A drawback to this approach is that it's possible that a git diff patch has widely spaced changes, like this:
@@ -8,6 +8,7 @@ context line 7 context line 8 context line 9+added line on line 10 context line 11 context line 12 context line 13@@ -998,6 +999,7 @@ context line 996 context line 997 context line 998+added line on line 1000 context line 1001 context line 1002 context line 1003
Using this approach, we would send nearly 1000 empty lines along with 2 lines with content to scan.rb#find_secrets, requiring us to iterate through all of them to find secrets and correctly count the line numbers.
With the original single-line approach, parse_diffs only has to look through a few lines of diff, and efficiently extracts just the added lines:
:data=> [{:line_content=>"added line on line 10\n", :line_number=>10}, {:line_content=>"added line on line 1000\n", :line_number=>1000}]
But if we don't do that, I'm not sure how gems/gitlab-secret_detection/lib/gitlab/secret_detection/scan.rb would recognize diffs vs the entire file and then handle them correctly.
I think we could possibly pass a variable to scan.rb indicating the type of scan expected, e.g. scan_type: :diff_scanning vs. scan_type: :file_scanning. What do you think?
That could work! 🙌 Though sending a scan_type variable would require some extra parsing in scan.rb; if our goal is to keep the gem/SDS focused solely on scanning the payload, rather than transforming the raw data into a scannable format and then running the scan on it, as discussed in this comment, then it might be cleaner to handle the pre-processing in secrets_check.rb instead.
Though sending a scan_type variable would require some extra parsing in scan.rb; if our goal is to keep the gem/SDS focused solely on scanning the payload, rather than transforming the raw data into a scannable format and then running the scan on it, as discussed in this comment, then it might be cleaner to handle the pre-processing in secrets_check.rb instead.
Agreed, if we decide to leave any parsing out of gem/SDS, then pre-processing needs to be done in the push check. 💯
Currently the MR with the single-line approach scans diffs and entire files correctly 🙌 I'll wait for Ethan and Vishwa's thoughts and the team's consensus on whether we want to leave parsing out of the gem/SDS before tackling fixing the specs.
@serenafang Thanks for sharing your thoughts. 🙇 It was super helpful to gather the problem context.
The approach of sending line_content and line_number feels like a hack to address this problem. Please bear in mind that any change we plan to introduce to the scan interface should be roughly compatible with other future data sources, including the ones having large data to scan. It is fair to expect the average blob size to be small for the diff-blobs scenario so the proposed structure would be acceptable, however, it would be super expensive for the client to send large data on a line-by-line basis. Splitting lines and processing on each one isn't a memory-friendly operation unless the string block we're trying to split is small. FWIW we have been blaming the regex operation for taking ~4x memory of input size so far, while part of the blame would go for the logic of splitting the large blob and processing line-by-line.
How about adjusting small changes in the scan interface and some parts from the client side? bear with my thought process:
Because hunk headers in the diff path contain the start and end lines of the diff, what if:
From the server/gem side, we will introduce a new property in the payload structure, say offset, which would act as a baseline number added to the detected finding's line in the result.
From the client side(i.e. secrets_check side), we will take subsequent added lines and form them as one payload where the offset value that payload would be the starting line from the added line.
Let's take an example for the below diff patch:
@@ -2,6 +2,7 @@ context line 8 context line 9+added one line // line no=4 as per hunk header+added second line containing secret context line 11 context line 12+added third line // line no=8 context line 13
Here, we could pass to the scan interface:
payloads: [ { id: "1", data: "added one line\nadded second line containing secret", offset: 4 }, { id: "2", data: "added third line", offset: 8 }]
So the detected finding returned by the gem would be on the line added second line containing secret with line number would be = detected-line in context to offset + offset value i.e. 1 + 4 = 5
This approach could be extended for other data sources, particularly data streaming like logs where we would want to scan log buffers instead of the entire log file at once.
We don't need to stick to the above approach, it was roughly what I had in mind. We could brainstorm other approaches too if the above ones have any loopholes. Let me know your thoughts!
@vbhat161 Ah thanks for explaining, it makes sense that the single-line approach would be super memory-expensive.
I like the offset idea! I've just pushed a change to my draft MR to use that approach. I opted to keep a single id per blob with multiple entries in data. Each entry in data has a body (the grouped lines) and its corresponding offset. This way, we can account for multiple sections of diffs within the same blob without having to duplicate the blob ID.
So for this example diff:
@@ -0,0 +1,2 @@+new line+another added line in the same section@@ -7,1 +8,2 @@ unchanged line here+another new line
We would process it into:
[{ id: "535d297abe2680c5d6491c9fc44e1ea3e4ead1ad", data: [ { body: "new line\nanother added line in the same section\n", offset: 1 }, { body: "another new line\n", offset: 8 } ]}]
If use_diff_scan? is false, then we would send the entire file, which would look like this:
[{ id: "535d297abe2680c5d6491c9fc44e1ea3e4ead1ad", data: [ { body: ""new line\nanother added line in the same section\nexisting line 3\nexisting line 4\nexisting line 5\nexisting line 6\nunchanged line here\nanother new line\nexisting line 9\nexisting line 10\nexisting line 11\nexisting line 12\nexisting line 13"", offset: 1 } ]}]
@serenafang It is okay to send a payload array with multiple elements having the same id value since that value is not used in the scanning logic. The id property is primarily meant for the client to identify which payload block the finding belongs to. So, using your example, the scan request's payload array can have the following structure instead:
[{id:"535d297abe2680c5d6491c9fc44e1ea3e4ead1ad",//sameidvaluedata:"new line\nanother added line in the same section\n",offset:1},{id:"535d297abe2680c5d6491c9fc44e1ea3e4ead1ad",//sameidvaluedata:"another new line\n",offset:8}]
The above request structure should work as expected IMO and also adhere to the existing structure of the scan interface (except for the new offset property of course).
@ahmed.hemdan@eurie What do you think about this updated approach?
I think it makes sense. 💯 I would be also agree that having multiple elements with the same id should generally work without issues as long as we're still creating findings with the correct corresponding id.
@vbhat161@ahmed.hemdan I'll proceed with the offset approach then 👍 If you'd like, please have a look at my draft for my attempt -- I still need to fix up the specs, but the changes to secrets_check and scan are basically finished.
@serenafang I've reviewed your draft MR and added some comments there.
Also, to keep things consistent with monolith gem, I've created an MR that introduces/accepts offset property to calculate the absolute line number in the SDS, as per the final payload structure discussed above.
@serenafang Sure, I will take a look shortly, but as a start, do you think we could move payload formatting into its own separate class as a first step towards refactoring the push check to reduce complexity?
@ahmed.hemdan Good suggestion! Payload formatting could definitely go in its own class to reduce complexity of secrets_check👍 However, I'd like to unblock the SDS work, and I'm worried that splitting out payload formatting + specs could delay this issue and thus the SDS timeline. Would it be okay to proceed with the MR as it is now (after I've addressed your suggestions and fixed the specs, of course) and tackle the class refactor with #481883 (closed)?
@serenafang I don't object to proceeding with the merge request, but I'm also worried we will keep increasing technical debt if we do so, especially since we already have a lot of stuff going on in the push check code.
I will defer the decision here to both @amarpatel and @eurie who as far as I understand are interested in unblocking the related SDS work sooner rather than later, so that they can chime in with their thoughts! 🙏🏼
Now that @eurie has approved !170010 (merged), we've assigned the maintainer review to @ahmed.hemdan since it's a large change and he has the context. In the meantime, Ethan's 4 MRs can go in, I'll likely have to rebase my MR on Friday/Monday, and hopefully Ahmed is able to review during his Monday morning and we get the MR in before the PCL on Wednesday. 🙇
Rollback strategy: In case of unexpected results after merging !170010 (merged), we can minimize the damage by disabling the spp_scan_diffs flag, since the majority of the change is dealing with how the diffs are parsed -- when the flag is off and the entire file is scanned, the refactored parsing and scanning logic is nearly the same as the unrefactored logic. If we are still seeing the unexpected results after disabling spp_scan_diffs, then we can revert !170010 (merged).