Force to use 16-bit pipeline has unexpected performance impact in LD configuration

What version / commit were you testing with? (git describe can produce this info.)

Commit 6a6a33fd in main branch

What steps will reproduce the problem?

aomenc /cephfs/group/sng-science/sequences_aom/av2ctcfinal/BlueSky_360p25.y4m --passes=1 --lag-in-frames=0 --auto-alt-ref=1 --min-gf-interval=16 --max-gf-interval=16 --gf-min-pyr-height=4 --gf-max-pyr-height=4 --skip=0 --limit=130 --use-fixed-qp-offsets=1 --deltaq-mode=0 --kf-min-dist=9999 --subgop-config-str=ld --enable-tpl-model=0 --end-usage=q --test-decode=fatal --obu --qp=185 --enable-keyframe-filtering=0 -o Bin_cpu-used0use-16bit-internal_T3LB_A4_S01_qp185_s0_f130.bin --bit-depth=8 --psnr --tile-columns=0 --threads=1 --row-mt=1 --test-decode=fatal --cpu-used=0 --use-16bit-internal 1> Enc_cpu-used0use-16bit-internal_T3LB_A4_S01_qp185_s0_f130.log 2>&1

What is the expected output?

Expect the same performance results with tagged version research-v2-rc1.

What do you see instead?

Different results with research-v2-rc1

Please use labels and text to provide additional information.

Edited Aug 26, 2021 by Leo (Liang) Zhao