Image resizing: collect the data & metrics
Background
To start building an MVC on dynamic image resizing and to make data-backed implementation decisions, we need to understand the numbers.
This issue should focus only on the questions blocking us from the development, and everything else (e.g. cost estimates) should go into separate issues.
Feel free to create smaller issues from this one, e.g. to investigate a particular approach.
Known data
Avatars by image type distribution:
1,sln
1,htm
1,html
49,tiff
292,svg
553,bmp
6531,ico
8317,gif
24242,jpeg
223158,jpg
1455709,png
Related issue: #237865 (closed)
Request distribution: amount, statuses, uniqueness
- requests by status - with
?width=
(60 minutes) =>200
: 121k,304
: 273k => 34r/s for 200 - data sent (60 minutes) => total: 6GB, average: 47kB
- requests by status for jpg/jpeg/png without
?width=
(60 minutes) => count: 60k, unique: 42k - 60 minutes for jpg/jpeg/png with
?width=
=>200
: count: 119k, unique: 66k,304
: count: 271k, unique: 101k - 24 hours for jpg/jpeg/png with
?width=
=>200
: count: 1.9M, unique: 533k,304
: count: 4.35M, unique: 417k - 7 days for jpg/jpeg/png with
?width=
=>200
: count: 10.8M, unique: 1.45M,304
: count: 20.6M, unique: 744k
Because we consider avatars resizing as MVC, requests with ?width=
is our primary interest.
Kibana:
- Average written bytes / by status / in 1h => https://log.gprd.gitlab.net/goto/0f647c5582b69598e4c326ea24e71974 (please note that X-axis values are the mix of the
width
values and other param values, e.g.project_id
) - Unique vs total image requests / 1h => https://log.gprd.gitlab.net/goto/dae2ca2a6e8e776ae86e8be33ecbfbeb
Used image sizes
GL uses many (8+ for avatars only) image sizes across the UI.
We used Kibana to find those sizes (instead of grepping through the code).
- Used image sizes: https://log.gprd.gitlab.net/goto/3119ad1eb57091c27f2bcc9d63e1ee18
- Validation of sizes where they are used: https://log.gprd.gitlab.net/goto/3983d3f78f711c3903df86e54e9a0607
The dedicated issue to consolidate the image sizes across the UI: #227388
Data stored per uploader
# select uploader, sum(size), count(*) from uploads group by 1 order by 1;
uploader | sum | count
------------------------------------------+---------------+----------
AttachmentUploader | 108 | 1
AvatarUploader | 78432265734 | 1652668
DesignManagement::DesignV432x230Uploader | 76715482 | 2753
FileUploader | 5649412584541 | 12680005
ImportExportUploader | 719753462596 | 9346
NamespaceFileUploader | 2252754362 | 6464
PersonalFileUploader | 5702942092 | 17811
(7 rows)
Time: 563071.403 ms (09:23.071)
1 652 668
avatars total | ~80 gb
avatars size total | ~48 kb
is an avatar average size
CDN cache hit rate
We use Cloudflare to serve the static data to our client.
That means that some of the image requests would not even hit the app and would be served directly from the CDN.
We should keep in mind that on-prem installs are configured differently / don't have a CDN), which means there will be a different load.
Infra request issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10795 – have more details and queries examples.
For a single day (July 8th) worth of logs, that returned:
CacheCacheStatus | count |
---|---|
miss | 3092320b |
revalidated | 3067796 |
hit | 891460 |
unknown | 92410 |
updating | 1265 |
expired | 484 |
stale | 2 |
Understanding the statuses: https://support.cloudflare.com/hc/en-us/articles/200172516
Weekly stats (thanks to Igor!): https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10795#note_376728039
We can compute a total cache hit ratio differently depending on whether we consider
revalidated
a hit or not.updating
is always considered a hit, as it's served from cache.
With
revalidated
considered a hit: 50.57%
With
revalidated
considered a miss: 9.92%
How we see MVC
- Start avatars-only
- Start dynamic-only, without caching
- Build time-boxed (e.g. 1 day) PoCs to understand if the approach works
- Knowing the number of different sizes and the request for
WebP
in the future, we need something flexible - We want to test different qualities & resizing methods & buckets & sizes
- We want to make Rails control (e.g. via Feature Flags) resizing method implemented in WH or somewhere else
- We need always keep Safe Rollout in mind: partial rollout + ability to switch a resizing method / disable it via Feature Flags
- The static approach may be one of PoCs, but taking into account various formats and questions unanswered, it would be challenging. Still, we may give PoC a try, if we want
Next steps
- Find How many requests (% and total number) are served directly from the Cloudflare cache => on-prem don't have CDN, will have a different load
- PoC: 48kb avatar (average size) => how many images you could resize in a sec (brute force test). Single-CPU/multi-CPU WH process. => would allow us to answer some questions
- PoC with
imgproxy
(https://github.com/imgproxy/imgproxy) - Check if we do (or not) an image optimization to conserve the space? (may be effective with
.png
) - Understand, what would be the high-DPI specifics
Additional links
- Memory Team Working Hours on the issue, the record: https://youtu.be/2ICU6KLL5W0. I made it private (use GL-Unfiltered account), as we exposed the logs on the video.
- Images used for benchmarks: images.zip