Skip to content

Image resizing: collect the data & metrics

Background

To start building an MVC on dynamic image resizing and to make data-backed implementation decisions, we need to understand the numbers.

This issue should focus only on the questions blocking us from the development, and everything else (e.g. cost estimates) should go into separate issues.

Feel free to create smaller issues from this one, e.g. to investigate a particular approach.

Known data

Avatars by image type distribution:

1,sln
1,htm
1,html
49,tiff
292,svg
553,bmp
6531,ico
8317,gif
24242,jpeg
223158,jpg
1455709,png

Related issue: #237865 (closed)

Request distribution: amount, statuses, uniqueness

  1. requests by status - with ?width= (60 minutes) => 200: 121k, 304: 273k => 34r/s for 200
  2. data sent (60 minutes) => total: 6GB, average: 47kB
  3. requests by status for jpg/jpeg/png without ?width= (60 minutes) => count: 60k, unique: 42k
  4. 60 minutes for jpg/jpeg/png with ?width= => 200: count: 119k, unique: 66k, 304: count: 271k, unique: 101k
  5. 24 hours for jpg/jpeg/png with ?width= => 200: count: 1.9M, unique: 533k, 304: count: 4.35M, unique: 417k
  6. 7 days for jpg/jpeg/png with ?width= => 200: count: 10.8M, unique: 1.45M, 304: count: 20.6M, unique: 744k

Because we consider avatars resizing as MVC, requests with ?width= is our primary interest.

Kibana:

  1. Average written bytes / by status / in 1h => https://log.gprd.gitlab.net/goto/0f647c5582b69598e4c326ea24e71974 (please note that X-axis values are the mix of the width values and other param values, e.g. project_id)
  2. Unique vs total image requests / 1h => https://log.gprd.gitlab.net/goto/dae2ca2a6e8e776ae86e8be33ecbfbeb

Used image sizes

GL uses many (8+ for avatars only) image sizes across the UI.
We used Kibana to find those sizes (instead of grepping through the code).

  1. Used image sizes: https://log.gprd.gitlab.net/goto/3119ad1eb57091c27f2bcc9d63e1ee18
  2. Validation of sizes where they are used: https://log.gprd.gitlab.net/goto/3983d3f78f711c3903df86e54e9a0607

The dedicated issue to consolidate the image sizes across the UI: #227388

Data stored per uploader

# select uploader, sum(size), count(*) from uploads group by 1 order by 1;
                 uploader                 |      sum      |  count   
------------------------------------------+---------------+----------
 AttachmentUploader                       |           108 |        1
 AvatarUploader                           |   78432265734 |  1652668
 DesignManagement::DesignV432x230Uploader |      76715482 |     2753
 FileUploader                             | 5649412584541 | 12680005
 ImportExportUploader                     |  719753462596 |     9346
 NamespaceFileUploader                    |    2252754362 |     6464
 PersonalFileUploader                     |    5702942092 |    17811
(7 rows)
Time: 563071.403 ms (09:23.071)  

1 652 668 avatars total | ~80 gb avatars size total | ~48 kb is an avatar average size

CDN cache hit rate

We use Cloudflare to serve the static data to our client.
That means that some of the image requests would not even hit the app and would be served directly from the CDN.
We should keep in mind that on-prem installs are configured differently / don't have a CDN), which means there will be a different load.

Infra request issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10795 – have more details and queries examples.

For a single day (July 8th) worth of logs, that returned:

CacheCacheStatus count
miss 3092320b
revalidated 3067796
hit 891460
unknown 92410
updating 1265
expired 484
stale 2

Understanding the statuses: https://support.cloudflare.com/hc/en-us/articles/200172516

Weekly stats (thanks to Igor!): https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10795#note_376728039

We can compute a total cache hit ratio differently depending on whether we consider revalidated a hit or not. updating is always considered a hit, as it's served from cache.

With revalidated considered a hit: 50.57%

With revalidated considered a miss: 9.92%

How we see MVC

  1. Start avatars-only
  2. Start dynamic-only, without caching
  3. Build time-boxed (e.g. 1 day) PoCs to understand if the approach works
  4. Knowing the number of different sizes and the request for WebP in the future, we need something flexible
  5. We want to test different qualities & resizing methods & buckets & sizes
  6. We want to make Rails control (e.g. via Feature Flags) resizing method implemented in WH or somewhere else
  7. We need always keep Safe Rollout in mind: partial rollout + ability to switch a resizing method / disable it via Feature Flags
  8. The static approach may be one of PoCs, but taking into account various formats and questions unanswered, it would be challenging. Still, we may give PoC a try, if we want

Next steps

  1. Find How many requests (% and total number) are served directly from the Cloudflare cache => on-prem don't have CDN, will have a different load
  2. PoC: 48kb avatar (average size) => how many images you could resize in a sec (brute force test). Single-CPU/multi-CPU WH process. => would allow us to answer some questions
  3. PoC with imgproxy (https://github.com/imgproxy/imgproxy)
  4. Check if we do (or not) an image optimization to conserve the space? (may be effective with .png)
  5. Understand, what would be the high-DPI specifics

Additional links

  • Memory Team Working Hours on the issue, the record: https://youtu.be/2ICU6KLL5W0. I made it private (use GL-Unfiltered account), as we exposed the logs on the video.
  • Images used for benchmarks: images.zip
Edited by Aleksei Lipniagov