Commit afef1912 authored by Ahmed Hemdan's avatar Ahmed Hemdan 🎡
Browse files

Update Secret Push Protection monitoring runbook

parent ea33da1f
Loading
Loading
Loading
Loading
+67 −23
Original line number Diff line number Diff line
@@ -4,13 +4,14 @@ title: "Secret Push Protection Monitoring"

### When to use this runbook?

This runbook is intended to be used when monitoring the [secret push protection](https://docs.gitlab.com/ee/user/application_security/secret_detection/secret_push_protection/index.html) feature to identify and mitigate any reliability issues or performance regressions that may occur when it is enabled on Gitlab.com. The runbook can also be used to understand more about relevant dashboards below and how to improve them:
This runbook is intended to be used when monitoring the [secret push protection](https://docs.gitlab.com/user/application_security/secret_detection/secret_push_protection/#secret-push-protection-workflow) feature to identify and mitigate any reliability issues or performance regressions that may occur while it is enabled on Gitlab.com. The runbook can also be used to understand more about relevant dashboards below and how to improve them:

* [Secret Push Protection – Overview Dashboard](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1)
* [Secret Push Protection – Memory and GC Monitoring](https://dashboards.gitlab.net/d/abe91e88-a9a4-4483-97f0-bc170c087cfb/spp-memory-and-gc-monitoring)

### What to monitor?

While the feature, in its [current form](https://docs.gitlab.com/ee/architecture/blueprints/secret_detection/#high-level-architecture), doesn't have any external components and is entirely encapsulated within the application server as a dependency, it does interact with a number of components as can be seen in this [push event sequence diagram](https://docs.gitlab.com/ee/architecture/blueprints/secret_detection/#push-event-detection-flow). Those components are:
The feature, in its [current form](../../../../../architecture/design-documents/secret_detection/#high-level-architecture), doesn't have external components and is encapsulated within the application server as [a dependency](https://gitlab.com/gitlab-org/security-products/secret-detection/secret-detection-service/-/blob/main/gitlab-secret_detection.gemspec), it does interact with a number of components as can be seen in this [push event sequence diagram](../../../../../architecture/design-documents/secret_detection/#push-event-detection-flow). Those components are:

* GitLab Shell (Git over SSH):
  * `git-receive-pack`
@@ -20,9 +21,9 @@ While the feature, in its [current form](https://docs.gitlab.com/ee/architecture
  * `SSHReceivePack`
  * `PostReceivePack`
  * `PreReceiveHook`
  * `ListAllBlobs()` RPC
  * `ListBlobs()` RPC
  * `GetTreeEntries()` RPC
  * `ListAllCommits()` RPC ([or `ListCommits()` RPC](https://gitlab.com/gitlab-org/gitlab/-/blob/a7c19f7ae8ed00f512bf7324879ae87d59bb088c/lib/gitlab/gitaly_client/commit_service.rb#L369-370) when no [quarantine directory](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/object_quarantine.md) exist)
  * `FindChangedPaths()` RPC
  * `DiffBlobs()` RPC
* Rails:
  * `/internal/allowed` Endpoint

@@ -30,6 +31,7 @@ Below is a sequence diagram showing the entire workflow whether a `git push` tak

```mermaid
sequenceDiagram
    autonumber
    actor User
    User->>+Workhorse/GitLab Shell: git push
    Workhorse/GitLab Shell->>+Gitaly: tcp/ssh
@@ -51,9 +53,16 @@ sequenceDiagram
        Gitaly->>+Workhorse/GitLab Shell: outcome of push
        Workhorse/GitLab Shell->>+User: outcome of push
    end
    Rails->>+Gitaly: ListBlobs or ListAllBlobs
    Rails->>+Gitaly: ListCommits or ListAllCommits
    Note over Gitaly, Rails: depends on quarantine directory existence
    Gitaly->>+Rails: grpc
    Rails->>+Gitaly: FindChangedPaths
    Note over Gitaly, Rails: returns all changed files for new commits
    Gitaly->>+Rails: grpc
    Rails->>+Rails: Populate PayloadPathsLookupMap with commit sha/file path
    Rails->>+Gitaly: DiffBlobs
    Note over Gitaly, Rails: returns all diff patches for changed files
    Gitaly->>+Rails: grpc
    Rails->>+gitlab-secret_detection: gitlab-secret_detection::Scan
    alt no secret detected
        gitlab-secret_detection->>+gitlab-secret_detection: scan blob
@@ -66,9 +75,7 @@ sequenceDiagram
    else secret detected
        gitlab-secret_detection->>+gitlab-secret_detection: scan blob
        gitlab-secret_detection->>+Rails: fail - secret found
        Rails->>+Gitaly: GetTreeEntries
        Note over Gitaly, Rails: retrieves blobs' file path and commit sha
        Gitaly->>+Rails: grpc
        Rails->>+Rails: Use PayloadPathsLookupMap to retrieve commit sha/file path
        Rails->>+Rails: Format Response
        Rails->>+Gitaly: reject - secret detected
    end
@@ -76,13 +83,17 @@ sequenceDiagram
    Workhorse/GitLab Shell->>+User: outcome of push
```

And here is a workflow diagram explaining how Secret Push Protection works [without `GetTreeEntries()](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/210708) starting with GitLab 18.7:

![Secret Push Protection Workflow](/images/handbook/engineering/development/sec/secure/secret-detection/runbooks/spp-new-workflow.png "Secret Push Protection Workflow")

_Note: `PreReceiveHook` is not to be confused with git's [pre-receive hook](https://git-scm.com/docs/githooks#pre-receive). In fact, the former is a [binary wrapper](https://gitlab.com/gitlab-org/gitaly/-/tree/master/cmd/gitaly-hooks) around the actual git hook. Please read more about the [hook setup](https://gitlab.com/gitlab-org/gitaly/-/blob/master/doc/hooks.md#hook-setup) in Gitaly's documentation._

These components are therefore the main elements we are trying to focus on when monitoring the feature.

### How we monitor the feature?

As discussed above, the functionality spans a number of components. Therefore, are three main tools we could use for monitoring the feature:
As discussed above, the functionality spans a number of components. Therefore, are different tools we use for monitoring the feature:

* Kibana (Logs)
  * [Staging](https://nonprod-log.gitlab.net)
@@ -95,6 +106,24 @@ As discussed above, the functionality spans a number of components. Therefore, a
    * `pubsub-gitaly-inf-gprd`
    * `pubsub-workhorse-inf-gprd`
    * `pubsub-shell-inf-gprd`
* Kibana (Log Views)
  * `gitlab-org/gitlab`:
    * [Logs / All](https://log.gprd.gitlab.net/app/r/s/AM4yh)
    * [Logs / Blocked Pushes](https://log.gprd.gitlab.net/app/r/s/JjteP)
    * [Logs / Lookup Map](https://log.gprd.gitlab.net/app/r/s/qSr52)
    * [Logs / Completed Scans (>= 55 sec)](https://log.gprd.gitlab.net/app/r/s/CdWx2)
    * [Logs / All Completed Scans](https://log.gprd.gitlab.net/app/r/s/xaEdW)
  * All Projects:
    * [Logs / All](https://log.gprd.gitlab.net/app/r/s/cXSeX)
    * [Logs / Blocked Pushes](https://log.gprd.gitlab.net/app/r/s/c5oYC)
    * [Logs / Lookup Map](https://log.gprd.gitlab.net/app/r/s/knonO)
    * [Logs / Completed Scans (>= 55 sec)](https://log.gprd.gitlab.net/app/r/s/VSpA9)
    * [Logs / All Completed Scans](https://log.gprd.gitlab.net/app/r/s/ax6qa)
* Kibana (Visualizations)
  * [Average Duration of Completed Scans](https://log.gprd.gitlab.net/app/lens#/edit/a0b71153-c3a7-4b76-9cd5-c856dd2ef6e1?_g=(filters:!(),refreshInterval:(pause:!t,value:60000),time:(from:now-7d,to:now)))
  * [Maximum Duration of Completed Scans](https://log.gprd.gitlab.net/app/lens#/edit/01389e92-932a-4d1e-9d59-bc1656026800?_g=(filters:!(),refreshInterval:(pause:!t,value:60000),time:(from:now-7d,to:now)))
  * ​​[Completed Scans Duration in 10 Second Increments](https://log.gprd.gitlab.net/app/lens#/edit/6d230ed7-61f0-4453-a4ca-7a8bf6d57b21?_g=(filters:!(),refreshInterval:(pause:!t,value:60000),time:(from:now-7d,to:now)))
  * [Breakdown of Changed Paths over Time](https://log.gprd.gitlab.net/app/lens#/edit/7f5a1b82-8b77-426f-8fe3-736f40da0b7e?_g=(filters:!(),refreshInterval:(pause:!t,value:60000),time:(from:now-7d,to:now)))
* Prometheus/Grafana (Metrics)
  * [Internal API](https://dashboards.gitlab.net/dashboards/f/internal-api/internal-api)
  * [Gitaly](https://dashboards.gitlab.net/dashboards/f/gitaly/gitaly-service)
@@ -425,9 +454,9 @@ The section is divided into four sub-sections as follows, with most focus being
    * Gitaly / Before `/internal/allowed`:
        * `PreReceiveHook`.
    * Gitaly / During `/internal/allowed`:
        * `ListAllBlobs()` RPC
        * `ListBlobs()` RPC
        * `GetTreeEntries()` RPC
        * `ListAllCommits()` RPC (or `ListCommits()` RPC)
        * `FindChangedPaths()` RPC
        * `DiffBlobs()` RPC

##### GitLab Shell <=> Gitaly

@@ -484,46 +513,61 @@ _Panel Information_

###### Gitaly / During `/internal/allowed`

**[ListAllBlobs – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=10)**
**[ListAllCommits – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=10)**

This panel displays the average latency in milliseconds for all calls to the `ListAllCommits` RPC, which is responsible (within the context of the feature) for enumerating all new commits of a repository under a certain size limit (i.e. exactly 1MiB).

_Panel Information_

* Metric: `gitaly:grpc_server_handling_seconds:avg5m`
* Label Filters:
  * `job` = `gitaly`
  * `grpc_method` = `ListAllCommits`
* Operations:
  * Avg: `1000 * avg`
* Legend:
  * `{{method}}`

**[ListCommits – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=11)**

This panel displays the average latency in milliseconds for all calls to the `ListAllBlobs` RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB). This procedure is usually fast because it is mostly used with the size limit set to 0 for checking file sizes of blobs in a certain git push.
This panel displays the average latency in milliseconds for all calls to the `ListCommits` RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB), similar to `ListAllCommits`, but it also loads up file paths for those blobs. The procedure is often slower than `ListAllCommits`.

_Panel Information_

* Metric: `gitaly:grpc_server_handling_seconds:avg5m`
* Label Filters:
  * `job` = `gitaly`
  * `grpc_method` = `ListAllBlobs`
  * `grpc_method` = `ListCommits`
* Operations:
  * Avg: `1000 * avg`
* Legend:
  * `{{method}}`

**[ListBlobs – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=11)**
**[FindChangedPaths – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=9)**

This panel displays the average latency in milliseconds for all calls to the `ListBlobs` RPC, which is responsible (within the context of the feature) for enumerating all blobs of a repository under a certain size limit (i.e. exactly 1MiB), similar to `ListAllBlobs`, but it also loads up file paths for those blobs. The procedure is often slower than `ListAllBlobs` because it loads up blob contents when enumerating them.
This panel displays the average latency in milliseconds for all calls to the `FindChangedPaths` RPC, which is responsible (within the context of the feature) for retrieving changed paths/files and their metadata (i.e. file path and commit sha) for all new commits that are being scanned in a push.

_Panel Information_

* Metric: `gitaly:grpc_server_handling_seconds:avg5m`
* Label Filters:
  * `job` = `gitaly`
  * `grpc_method` = `ListBlobs`
  * `grpc_method` = `FindChangedPaths`
* Operations:
  * Avg: `1000 * avg`
* Legend:
  * `{{method}}`

**[GetTreeEntries – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=9)**
**[DiffBlobs – Average Latency [All Hosts]](https://dashboards.gitlab.net/d/fdk7i56zibv28d/secret-push-protection-e28093-overview?orgId=1&viewPanel=panel-43)**

This panel displays the average latency in milliseconds for all calls to the `GetTreeEntries` RPC, which is responsible (within the context of the feature) for retrieving blob metadata (i.e. file path and commit sha) for all blobs that were scanned and found to include a leaked secret.
This panel displays the average latency in milliseconds for all calls to the `DiffBlobs` RPC, which is responsible (within the context of the feature) for retrieving the actual payload (i.e. the diff or delta) of all changed paths/files in new commits scanned in a push.

_Panel Information_

* Metric: `gitaly:grpc_server_handling_seconds:avg5m`
* Label Filters:
  * `job` = `gitaly`
  * `grpc_method` = `GetTreeEntries`
  * `grpc_method` = `DiffBlobs`
* Operations:
  * Avg: `1000 * avg`
* Legend:
@@ -655,7 +699,7 @@ If a new component is utilised by the feature, please follow the steps below.
* Explore metrics available for the endpoint or service.
* If no metrics are available, consider [creating them](https://docs.gitlab.com/ee/administration/monitoring/prometheus/) to monitor the performance of the endpoint/service.
* Create a new row for the component in the dashboard you are editing.
* Add as many panels as for available metrics in the new row. Use your best judgement on what is should be added.
* Add available metrics in the new row. Use your best judgement on what is should be added.
* Create a merge request updating this runbook with information about the panel. Use panels above for guidance.

#### When a component is no longer relevant