scd-driven keyframes

As we all are well aware, the “scd=1” parameter is a nothingburger since March 2022: it does not do much except printing a warning “SVT-AV1 has an integrated mode decision mechanism to handle scene changes and will not insert a key frame at scene changes”.

It is also said (“Scene Change Detection”) that SVT-AV1 is sufficiently flexible and thus, when it detects a scene change, the encoder automatically relies less on temporally neighboring frames, adapting to scene changes without affecting the GOP structure (frequency of key frames).

So, there are two arguments in favour of not placing keyframes at scene changes:

  1. Keyframes are more costly (in terms of added bytes) than regular frames (even when that regular frame is affected by a scene change and thus it relies less on temporally neighboring frames). A scene change is also not likely to correspond with a boundary between GOPs, and thus such keyframe's intrinsic cost is usually increased by the cost of the disruption caused in the otherwise regular GOP structure.

  2. Typical GOP sizes are large in video files intended for home use (for example, “GOP Size Selection” suggests the range of 5—10 seconds). Scene changes can be more frequent in dynamic videos, and thus, if a key frame is created for every scene change, the overall frequency of key frames is increased and their average cost-per-minute (in terms of added bitrate) is increased proportionally.

Let me offer the following four counter-arguments in favour of scd-driven keyframes:

  1. Keyframes are more costly than regular frames. Scene change frames, while being cheaper than the keyframes, rely less on temporally neighboring frames and thus they are also more costly than the other frames (predictable frames). These two costs are partially caused by similar reasons. (Keyframes cannot be used to predict the previous frames that reside in another closed GOP, but a scene change also makes the temporally adjacent frames less predictable. Irregular keyframes disrupt the regular GOP structures, but a scene change is also likely to appear in the middle of a GOP and to make that GOP's structure less useful across the scene boundary.) Being so similar, these costs are not additive: promoting a scene change (which is already fairly costly) to the keyframe rank looks like a less costly action (in terms of the total cost) than making a keyframe out of a regular frame (where such frame would otherwise be predictable and cheap). One may argue that this change of the total cost is relatively miniscule and is doomed to be diminished by the increased frequency of keyframes. (After all, before March 2022 enabling the “scd=1” parameter really hurt BD-Rate by about 5%.) However, a greater care could be taken not to increase the frequency of keyframes (I imagine that instead of “creating a new keyframe on every scene change” SVT-AV1 could still “create only the necessary keyframes, but create them slighly earlier or slightly later whenever it detects a not too distant scene change”), and then that frequency would not rise because the value of the “keyint” parameter would still be honoured (though only as an average distance) and the total number of keyframes would stay the same.

  2. Keyframes are intra-coded and thus they differ from the adjacent frames. When a keyframe appears at a scene change, its different nature is hidden by that change. When a keyframe appears in the middle of a short or dynamic scene, its different nature does not leave a long-lasting impression and is lost in the kaleidoscope of scene changes. However, when a keyframe appears in the middle of a long scene that itself is not very dynamic, then some viewers are able to perceive the different nature of that keyframe. Sometimes it's a small difference, almost unnoticeable. High CRF values and lower fidelity make keyframes more apparent. Most often such a keyframe is seen as the sudden raise of quality: all the mistakes of motion-compensated prediction, all the insufficiencies of residual coding are suddenly corrected. Revealed by that contrast, it suddenly becomes painfully clear how bad things really have been earlier. (See the libsvtav1-encoded video quote https://take-me-to.space/Y7xuvYv.webm before and after its keyframes at 1:20.143 and at 2:40.279, for example.) Sometimes the effect is the opposite: when a scene change happens soon after the keyframe, then SVT-AV1 is smart enough to decide that the keyframe is useless (in terms of predicting the other frames of its GOP) and to make that keyframe significantly bit-starved in favour of the next frames, and thus the keyframe's quality is visibly worse than the quality of the preceding frames. (See the same example https://take-me-to.space/Y7xuvYv.webm and compare the quality of its keyframe at 1:00.109 with the quality of the previous frame.) Both these visual effects (both the sudden quality gain and the sudden quality loss) are somewhat detrimental to the psychovisual impression of such videos, and both can be easily avoided by not creating any keyframes in the middle of scenes (it is usually possible to create the keyframes earlier or later at scene changes unless the length of some particularly long scene is significantly larger than the value of the “keyint” parameter).

  3. FFmpeg is recently made able to remux AV1 keyframes to AVIF files. Such remuxing is not associated with the usual quality loss (generation loss) of saving frames as lossy images or with the file size bloat of saving frames as lossless images. Such remuxing, because it does not recompress anything, is also much faster than the usual saving of frames. However, only keyframes can be remuxed in static AVIF files; animated AVIF can contain a remuxed video quote, but it also has to start from an existing AV1 keyframe. Hence the results of such remuxing are more meaningful (plot-wise) when the keyframes themselves are placed at scene changes instead of seemingly random moments of the scenes (albeit sampled at very regular “keyint” intervals). This becomes more obvious if a keyframe is too close to the beginning of the next scene and thus the keyframe's quality is too low (such as the keyframe in https://take-me-to.space/Y7xuvYv.webm at 1:00.109 in the previous example), then the quality of a static AVIF remux is degraded (such as the remux https://take-me-to.space/SqW2hMo.avif of the keyframe I just mentioned in the previous example) and an animated AVIF remux does not have a meaningful start because it starts with a very short flash of a frame (or of a few frames) not really related to the next scene.

  4. Some video players (such as MPC-HC for example) have hotkeys or buttons for the commands “Jump Forward (keyframe)” and “Jump Backward (keyframe)”. If the keyframes are placed at regular intervals, then these commands merely provide a faster navigation through the video. These commands can also have a plot-related implied meaning (“skip this scene by jumping to the beginning of the next scene” and “replay this scene from the very beginning”), but only if the keyframes of the video are placed at scene changes.

This lengthy text is a feature request: reconsider re-implementing the scd-driven keyframes, now without introducing any changes to the average frequency of the keyframes.

Postscriptum: the example of an anime quote (hyperlinked above) is the final part of the prominent “inescapable back fat spiral” scene from “Ramen Daisuki Koizumi-san” (episode 9, 16:33.95 — 20:40.073). FFmpeg's settings for that video quote were -crf 60 -b:v 0 -c:v libsvtav1 -svtav1-params keyint=20s:scd=1:lookahead=120:hierarchical-levels=5:tune=0:lp=1 -preset 0 -flags +cgop.