design_diskcache.md 7.16 KB
Newer Older
Paul Okstad's avatar
Paul Okstad committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Disk Cache Design

Gitaly utilizes a disk-based cache for efficiently serving some RPC responses
(at time of writing, only the `SmartHTTPService.InfoRefUploadPack` RPC). This
cache is intended to be used for serving large responses not suitable for a RAM
based cache.

## Cache Invalidation

The mechanisms that enable the invalidation of the disk cache for a repo depend
on special annotations made to the Gitaly gRPC methods. Each method that has
scope "repository" and is operation type "mutator" will cause the specified
repository to be invalidated. For more information on the annotation system,
see the Gitaly protobuf definition [contributing guide].

[contributing guide]: https://gitlab.com/gitlab-org/gitaly/tree/4c27a7f71ba1d91edbc9d321919620887d6a30d3/proto#rpc-annotations

## Repository State

For every repository using the disk cache, a special set of files is maintained
to indicate which cached responses are still valid. These files are stored
in a dedicated **state directory** for each repository:

	${STATE_DIR} = ${STORAGE_PATH}/+gitaly/state/${REPO_RELATIVE_PATH}

Before a mutating RPC handler is invoked, a gRPC middleware creates a "lease"
file in the state directory that signifies a mutating operation is in-flight.
These lease files reside at the following path:

	${STATE_DIR}/pending/${RANDOM_FILENAME}

Upon the completion of the mutating RPC, the lease file will be removed and
the "latest" file will be updated with a random value to reflect the new
"state" of the repository.

	${STATE_DIR}/latest

The contents of latest are used along with several other values to form an
aggregate key that addresses a specific request for a specific repository at a
specific repository state:

```
                               ─────┐

45
      latest         (random value) │
Paul Okstad's avatar
Paul Okstad committed
46
47
48
      RPC request    (digest)       │     ┌──────┐
      Gitaly version (string)       ├─────│SHA256│─────▶ Cache key
      RPC Method     (string)       │     └──────┘
49
      Feature flags  (string)       │
Paul Okstad's avatar
Paul Okstad committed
50
51
52
53

                               ─────┘
```

54
55
56
57
An example for a mutating operation is pushing a new commit to a repository.
As such, any `git push` will regenerate the above-described latest file and
thus the cache key.

Paul Okstad's avatar
Paul Okstad committed
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
## Cache State Machine

The repository state files are used to determine whether the repository is in
a deterministic state (i.e. no mutating RPCs in-flight) and how to find the
valid cached responses for the current repository state. The state machine
diagram follows:

```mermaid
graph TD;
    A[Are there lease files?]-->|Yes|B;
    A-->|No|C;
    B[Are any lease files stale?]-->|Yes|D;
    B-->|No|E;
    C[Does non-stale latest file exist?]-->|Yes|F;
    C-->|No|G;
    D[Remove stale lease files]-->A;
    E[Mutator RPC In-Flight: Cache state indeterministic]
    F[No mutator RPCs In-Flight: Cache state deterministic]
    G[Create/Truncate latest file]-->F

    classDef nonfinal fill:#ccf,stroke-width;
    classDef final fill:#f9f,stroke-dasharray: 5, 5;

    class A,B,C,D,G nonfinal;
    class E,F final;
```

85
**Note:** There are momentary race conditions where an RPC may become in flight
Paul Okstad's avatar
Paul Okstad committed
86
87
88
89
between the time the lease files are checked and the latest file is inspected,
but this is allowed by the cache design in order to avoid distributed locking.
This means that a stale cached response might be served momentarily, but this
slight delay in fresh responses is a small tradeoff necessary to keep the cache
90
91
lockless. The lockless quality is highly desired since Gitaly is often operated on NFS
mounts where file locks are not advisable.
Paul Okstad's avatar
Paul Okstad committed
92
93
94
95
96
97
98
99
100
101

## Cached Responses

When the repository is determined to be in a deterministic state (i.e. no
in-flight mutator RPCs), it is safe to cache responses and retrieve cached
responses. The aggregate key digest is used to form a hexadecimal path to the
cached response in this format:

	${STORAGE_PATH}/+gitaly/cache/${DIGEST:0:2}/${DIGEST:2}

102
**Note:** The first two characters of the digest are used as a subdirectory to
Paul Okstad's avatar
Paul Okstad committed
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
allow the random distribution of the digest algorithm (SHA256) to evenly
distribute the response files. This way, the digest files are evenly
distributed across 256 folders.

## File Cleanup

Since the disk cache introduces a number of new filesystem constructs, both
state files and cached responses, there needs to be a way to clean up these
files when the normal processes are not adequate.

Gitaly runs background workers that periodically remove stale (>1 hour old)
state files and cached responses. Additionally, Gitaly will remove the cached
responses on program start to guard against any chance that the cache
invalidator was not working in a previous run.

118
119
120
121
## Considerations

- Note: this feature is available by default in **Omnibus GitLab 12.10.0** and
  above
122
123
124
125
126
127
128
- The cache will use extra disk on the Gitaly storage locations. This should be
  actively monitored. [Node exporter] is recommended for tracking resource
  usage.
- There may be initial latency spikes when enabling this feature for large/busy
  GitLab instances until the cache is warmed up. On a busy site like gitlab.com,
  this may last as long as several seconds to a minute.

129
The following Prometheus queries (adapted from [GitLab's dashboards])
130
131
132
133
134
135
136
137
138
139
140
141
142
143
will give you insight into the performance and behavior of the cache:

- [Cache invalidation behavior]
    - `sum(rate(gitaly_cacheinvalidator_optype_total[1m])) by (type)`
    - Shows the Gitaly RPC types (mutator or accessor). The cache benefits from
      Gitaly requests that are more often accessors than mutators.
- [Cache Throughput Bytes]
    - `sum(rate(gitaly_diskcache_bytes_fetched_total[1m]))`
    - `sum(rate(gitaly_diskcache_bytes_stored_total[1m]))`
    - Shows the cache's throughput at the byte level. Ideally, the throughput
      should correlate to the cache invalidation behavior.
- [Cache Effectiveness]
    - `(sum(rate(gitaly_diskcache_requests_total[1m])) - sum(rate(gitaly_diskcache_miss_total[1m]))) / sum(rate(gitaly_diskcache_requests_total[1m]))`
    - Shows how often the cache is invoked for a hit vs a miss. A value close to
flowed's avatar
flowed committed
144
      100% is desirable.
145
146
147
148
149
150
151
152
153
154
155
156
157
158
- [Cache Errors]
    - `sum(rate(gitaly_diskcache_errors_total[1m])) by (error)`
    - Shows edge case errors experienced by the cache. The following errors can
      be ignored:
        - `ErrMissingLeaseFile`
        - `ErrPendingExists`

[GitLab's dashboards]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1
[Cache invalidation behavior]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=2
[Cache Throughput Bytes]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=6
[Cache Effectiveness]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=8
[Cache Errors]: https://dashboards.gitlab.net/d/5Y26KtFWk/gitaly-inforef-upload-pack-caching?orgId=1&fullscreen&panelId=12
[Node exporter]: https://docs.gitlab.com/ee/administration/monitoring/prometheus/node_exporter.html
[storage location]: https://docs.gitlab.com/ee/administration/repository_storage_paths.html