Image resizing: collect the data & metrics

Background

To start building an MVC on dynamic image resizing and to make data-backed implementation decisions, we need to understand the numbers.

This issue should focus only on the questions blocking us from the development, and everything else (e.g. cost estimates) should go into separate issues.

Feel free to create smaller issues from this one, e.g. to investigate a particular approach.

Known data

Avatars by image type distribution:

1,sln
1,htm
1,html
49,tiff
292,svg
553,bmp
6531,ico
8317,gif
24242,jpeg
223158,jpg
1455709,png

Related issue: #237865 (closed)

Request distribution: amount, statuses, uniqueness

requests by status - with ?width= (60 minutes) => 200: 121k, 304: 273k => 34r/s for 200
data sent (60 minutes) => total: 6GB, average: 47kB
requests by status for jpg/jpeg/png without ?width= (60 minutes) => count: 60k, unique: 42k
60 minutes for jpg/jpeg/png with ?width= => 200: count: 119k, unique: 66k, 304: count: 271k, unique: 101k
24 hours for jpg/jpeg/png with ?width= => 200: count: 1.9M, unique: 533k, 304: count: 4.35M, unique: 417k
7 days for jpg/jpeg/png with ?width= => 200: count: 10.8M, unique: 1.45M, 304: count: 20.6M, unique: 744k

Because we consider avatars resizing as MVC, requests with ?width= is our primary interest.

Kibana:

Average written bytes / by status / in 1h => https://log.gprd.gitlab.net/goto/0f647c5582b69598e4c326ea24e71974 (please note that X-axis values are the mix of the width values and other param values, e.g. project_id)
Unique vs total image requests / 1h => https://log.gprd.gitlab.net/goto/dae2ca2a6e8e776ae86e8be33ecbfbeb

Used image sizes

GL uses many (8+ for avatars only) image sizes across the UI.
We used Kibana to find those sizes (instead of grepping through the code).

Used image sizes: https://log.gprd.gitlab.net/goto/3119ad1eb57091c27f2bcc9d63e1ee18
Validation of sizes where they are used: https://log.gprd.gitlab.net/goto/3983d3f78f711c3903df86e54e9a0607

The dedicated issue to consolidate the image sizes across the UI: #227388

Data stored per uploader

# select uploader, sum(size), count(*) from uploads group by 1 order by 1;
                 uploader                 |      sum      |  count   
------------------------------------------+---------------+----------
 AttachmentUploader                       |           108 |        1
 AvatarUploader                           |   78432265734 |  1652668
 DesignManagement::DesignV432x230Uploader |      76715482 |     2753
 FileUploader                             | 5649412584541 | 12680005
 ImportExportUploader                     |  719753462596 |     9346
 NamespaceFileUploader                    |    2252754362 |     6464
 PersonalFileUploader                     |    5702942092 |    17811
(7 rows)
Time: 563071.403 ms (09:23.071)

1 652 668 avatars total | ~80 gb avatars size total | ~48 kb is an avatar average size

CDN cache hit rate

We use Cloudflare to serve the static data to our client.
That means that some of the image requests would not even hit the app and would be served directly from the CDN.
We should keep in mind that on-prem installs are configured differently / don't have a CDN), which means there will be a different load.

Infra request issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10795 – have more details and queries examples.

For a single day (July 8th) worth of logs, that returned:

CacheCacheStatus	count
miss	3092320b
revalidated	3067796
hit	891460
unknown	92410
updating	1265
expired	484
stale	2

Understanding the statuses: https://support.cloudflare.com/hc/en-us/articles/200172516

Weekly stats (thanks to Igor!): https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10795#note_376728039

We can compute a total cache hit ratio differently depending on whether we consider revalidated a hit or not. updating is always considered a hit, as it's served from cache.

With revalidated considered a hit: 50.57%

With revalidated considered a miss: 9.92%

How we see MVC

Start avatars-only
Start dynamic-only, without caching
Build time-boxed (e.g. 1 day) PoCs to understand if the approach works
Knowing the number of different sizes and the request for WebP in the future, we need something flexible
We want to test different qualities & resizing methods & buckets & sizes
We want to make Rails control (e.g. via Feature Flags) resizing method implemented in WH or somewhere else
We need always keep Safe Rollout in mind: partial rollout + ability to switch a resizing method / disable it via Feature Flags
The static approach may be one of PoCs, but taking into account various formats and questions unanswered, it would be challenging. Still, we may give PoC a try, if we want

Next steps

Find How many requests (% and total number) are served directly from the Cloudflare cache => on-prem don't have CDN, will have a different load
PoC: 48kb avatar (average size) => how many images you could resize in a sec (brute force test). Single-CPU/multi-CPU WH process. => would allow us to answer some questions
PoC with imgproxy (https://github.com/imgproxy/imgproxy)
Check if we do (or not) an image optimization to conserve the space? (may be effective with .png)
Understand, what would be the high-DPI specifics

Additional links

Memory Team Working Hours on the issue, the record: https://youtu.be/2ICU6KLL5W0. I made it private (use GL-Unfiltered account), as we exposed the logs on the video.
Images used for benchmarks: images.zip

Edited Aug 17, 2020 by Aleksei Lipniagov