This runbook is intended to be used when monitoring the [secret push protection](https://docs.gitlab.com/ee/user/application_security/secret_detection/secret_push_protection/index.html) feature to identify and mitigate any reliability issues or performance regressions that may occur when it is enabled on Gitlab.com. The runbook can also be used to understand more about relevant dashboards below and how to improve them:
This runbook is intended to be used when monitoring the [secret push protection](https://docs.gitlab.com/user/application_security/secret_detection/secret_push_protection/#secret-push-protection-workflow) feature to identify and mitigate any reliability issues or performance regressions that may occur while it is enabled on Gitlab.com. The runbook can also be used to understand more about relevant dashboards below and how to improve them:
*[Secret Push Protection – Memory and GC Monitoring](https://dashboards.gitlab.net/d/abe91e88-a9a4-4483-97f0-bc170c087cfb/spp-memory-and-gc-monitoring)
### What to monitor?
While the feature, in its [current form](https://docs.gitlab.com/ee/architecture/blueprints/secret_detection/#high-level-architecture), doesn't have any external components and is entirely encapsulated within the application server as a dependency, it does interact with a number of components as can be seen in this [push event sequence diagram](https://docs.gitlab.com/ee/architecture/blueprints/secret_detection/#push-event-detection-flow). Those components are:
The feature, in its [current form](../../../../../architecture/design-documents/secret_detection/#high-level-architecture), doesn't have external components and is encapsulated within the application server as [a dependency](https://gitlab.com/gitlab-org/security-products/secret-detection/secret-detection-service/-/blob/main/gitlab-secret_detection.gemspec), it does interact with a number of components as can be seen in this [push event sequence diagram](../../../../../architecture/design-documents/secret_detection/#push-event-detection-flow). Those components are:
* GitLab Shell (Git over SSH):
*`git-receive-pack`
@@ -20,9 +21,9 @@ While the feature, in its [current form](https://docs.gitlab.com/ee/architecture
*`SSHReceivePack`
*`PostReceivePack`
*`PreReceiveHook`
*`ListAllBlobs()` RPC
*`ListBlobs()` RPC
*`GetTreeEntries()` RPC
*`ListAllCommits()` RPC ([or `ListCommits()` RPC](https://gitlab.com/gitlab-org/gitlab/-/blob/a7c19f7ae8ed00f512bf7324879ae87d59bb088c/lib/gitlab/gitaly_client/commit_service.rb#L369-370) when no [quarantine directory](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/object_quarantine.md) exist)
*`FindChangedPaths()` RPC
*`DiffBlobs()` RPC
* Rails:
*`/internal/allowed` Endpoint
@@ -30,6 +31,7 @@ Below is a sequence diagram showing the entire workflow whether a `git push` tak
```mermaid
sequenceDiagram
autonumber
actor User
User->>+Workhorse/GitLab Shell: git push
Workhorse/GitLab Shell->>+Gitaly: tcp/ssh
@@ -51,9 +53,16 @@ sequenceDiagram
Gitaly->>+Workhorse/GitLab Shell: outcome of push
Workhorse/GitLab Shell->>+User: outcome of push
end
Rails->>+Gitaly: ListBlobs or ListAllBlobs
Rails->>+Gitaly: ListCommits or ListAllCommits
Note over Gitaly, Rails: depends on quarantine directory existence
Gitaly->>+Rails: grpc
Rails->>+Gitaly: FindChangedPaths
Note over Gitaly, Rails: returns all changed files for new commits
Gitaly->>+Rails: grpc
Rails->>+Rails: Populate PayloadPathsLookupMap with commit sha/file path
Rails->>+Gitaly: DiffBlobs
Note over Gitaly, Rails: returns all diff patches for changed files
gitlab-secret_detection->>+Rails: fail - secret found
Rails->>+Gitaly: GetTreeEntries
Note over Gitaly, Rails: retrieves blobs' file path and commit sha
Gitaly->>+Rails: grpc
Rails->>+Rails: Use PayloadPathsLookupMap to retrieve commit sha/file path
Rails->>+Rails: Format Response
Rails->>+Gitaly: reject - secret detected
end
@@ -76,13 +83,17 @@ sequenceDiagram
Workhorse/GitLab Shell->>+User: outcome of push
```
And here is a workflow diagram explaining how Secret Push Protection works [without `GetTreeEntries()](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/210708) starting with GitLab 18.7:
_Note: `PreReceiveHook` is not to be confused with git's [pre-receive hook](https://git-scm.com/docs/githooks#pre-receive). In fact, the former is a [binary wrapper](https://gitlab.com/gitlab-org/gitaly/-/tree/master/cmd/gitaly-hooks) around the actual git hook. Please read more about the [hook setup](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/hooks.md#hook-setup) in Gitaly's documentation._
These components are therefore the main elements we are trying to focus on when monitoring the feature.
### How we monitor the feature?
As discussed above, the functionality spans a number of components. Therefore, are three main tools we could use for monitoring the feature:
As discussed above, the functionality spans a number of components. Therefore, are different tools we use for monitoring the feature:
* Kibana (Logs)
*[Staging](https://nonprod-log.gitlab.net)
@@ -95,6 +106,24 @@ As discussed above, the functionality spans a number of components. Therefore, a
*[Logs / All Completed Scans](https://log.gprd.gitlab.net/app/r/s/ax6qa)
* Kibana (Visualizations)
*[Average Duration of Completed Scans](https://log.gprd.gitlab.net/app/lens#/edit/a0b71153-c3a7-4b76-9cd5-c856dd2ef6e1?_g=(filters:!(),refreshInterval:(pause:!t,value:60000),time:(from:now-7d,to:now)))
*[Maximum Duration of Completed Scans](https://log.gprd.gitlab.net/app/lens#/edit/01389e92-932a-4d1e-9d59-bc1656026800?_g=(filters:!(),refreshInterval:(pause:!t,value:60000),time:(from:now-7d,to:now)))
* [Completed Scans Duration in 10 Second Increments](https://log.gprd.gitlab.net/app/lens#/edit/6d230ed7-61f0-4453-a4ca-7a8bf6d57b21?_g=(filters:!(),refreshInterval:(pause:!t,value:60000),time:(from:now-7d,to:now)))
*[Breakdown of Changed Paths over Time](https://log.gprd.gitlab.net/app/lens#/edit/7f5a1b82-8b77-426f-8fe3-736f40da0b7e?_g=(filters:!(),refreshInterval:(pause:!t,value:60000),time:(from:now-7d,to:now)))
@@ -425,9 +454,9 @@ The section is divided into four sub-sections as follows, with most focus being
* Gitaly / Before `/internal/allowed`:
*`PreReceiveHook`.
* Gitaly / During `/internal/allowed`:
*`ListAllBlobs()` RPC
*`ListBlobs()` RPC
*`GetTreeEntries()` RPC
*`ListAllCommits()` RPC (or `ListCommits()` RPC)
*`FindChangedPaths()` RPC
*`DiffBlobs()` RPC
##### GitLab Shell <=> Gitaly
@@ -484,46 +513,61 @@ _Panel Information_
###### Gitaly / During `/internal/allowed`
**[ListAllBlobs – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=10)**
**[ListAllCommits – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=10)**
This panel displays the average latency in milliseconds for all calls to the `ListAllCommits` RPC, which is responsible (within the context of the feature) for enumerating all new commits of a repository under a certain size limit (i.e. exactly 1MiB).
**[ListCommits – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=11)**
This panel displays the average latency in milliseconds for all calls to the `ListAllBlobs` RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB). This procedure is usually fast because it is mostly used with the size limit set to 0 for checking file sizes of blobs in a certain git push.
This panel displays the average latency in milliseconds for all calls to the `ListCommits` RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB), similar to `ListAllCommits`, but it also loads up file paths for those blobs. The procedure is often slower than `ListAllCommits`.
**[ListBlobs – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=11)**
**[FindChangedPaths – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=9)**
This panel displays the average latency in milliseconds for all calls to the `ListBlobs` RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB), similar to `ListAllBlobs`, but it also loads up file paths for those blobs. The procedure is often slower than `ListAllBlobs` because it loads up blob contents when enumerating them.
This panel displays the average latency in milliseconds for all calls to the `FindChangedPaths` RPC, which is responsible (within the context of the feature) for retrieving changed paths/files and their metadata (i.e. file path and commit sha) for all new commits that are being scanned in a push.
**[GetTreeEntries – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=9)**
**[DiffBlobs – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=panel-43)**
This panel displays the average latency in milliseconds for all calls to the `GetTreeEntries` RPC, which is responsible (within the context of the feature) for retrieving blob metadata (i.e. file path and commit sha) for all blobs that were scanned and found to include a leaked secret.
This panel displays the average latency in milliseconds for all calls to the `DiffBlobs` RPC, which is responsible (within the context of the feature) for retrieving the actual payload (i.e. the diff or delta) of all changed paths/files in new commits scanned in a push.
@@ -655,7 +699,7 @@ If a new component is utilised by the feature, please follow the steps below.
* Explore metrics available for the endpoint or service.
* If no metrics are available, consider [creating them](https://docs.gitlab.com/ee/administration/monitoring/prometheus/) to monitor the performance of the endpoint/service.
* Create a new row for the component in the dashboard you are editing.
* Add as many panels as for available metrics in the new row. Use your best judgement on what is should be added.
* Add available metrics in the new row. Use your best judgement on what is should be added.
* Create a merge request updating this runbook with information about the panel. Use panels above for guidance.