Remove copy relocation and optimize locally defined symbol access

https://sourceware.org/pipermail/binutils/2021-April/116150.html has explained that we do not need a marker.

clang always emits local access for protected definitions, even on i386/x86-64
protected data+copy relocations never work on non-x86. (glibc has support for arm/aarch64 but binutils doesn't support it)
multiple glibc maintainers have expressed that protected data+copy relocation is fragile even on x86.
gold/ld.lld never supports protected data+copy relocations, even for x86 (https://sourceware.org/bugzilla/show_bug.cgi?id=19823)

GCC x86 can use local access for protected symbols today without breaking anything.

All accesses to protected definitions are local access.

In executable, all accesses to defined symbols are local access.

They already work.

All global function pointers, whose function bodies aren't locally defined, must use GOT.

It can be done without a marker. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593 Some architectures (e.g. i386/ppc32) require a GOT register because they don't support PC-relative instructions. These are legacy architectures. We can leave them unchanged.

Other architectures can default to "use GOT to take the address of an external default visibility function" in -fno-pic mode. We can add an option -fdirect-access-external-function for rare users who want the original -fno-pic behavior.

This has zero cost for most pieces of software. For example, the built clang is byte identical if I make the above change. https://lists.llvm.org/pipermail/llvm-dev/2021-June/150910.html

All read/write accesses to symbols, which aren't locally defined must, use GOT.

clang -fno-pic -fno-direct-access-external-data does this today. The GCC feature request is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98112

There may be some little cost because taking the address of an external default visibility global variable is more frequent, though I don't think it can be a bottleneck of anything.

We can add ld warnings when R_*_COPY is present. Users can add an ld option to suppress the warning. No marker is needed. When ld warning is prevailing, we can add a glibc ld.so warning for R_*_COPY.

Branches to undefined symbols may use PLT.

The 2018 R_X86_64_PLT32 scheme for call/jmp foo has already done this.

Personally I think we can just do R_386_PLT32 for i386 as well but you mentioned that i386 is legacy and should remain unchanged. For other folks, you can find a summary on https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#non-default-visibility-ifunc-and-r_386_pc32

HAVE_LD_PIE_COPYRELOC should be fixed as soon as possible. The patch has sit there for a while: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570139.html

I hope folks can focus on functions/canonical PLT entries as the first step because it will give immediate performance boost. Once canonical PLT entries are eliminated, we can safely build software with ld -Bsymbolic-non-weak-functions (drop many R_*JUMP_SLOT and some function address caused R_*_GLOB_DAT; convert some absolute relocations to R_*_RELATIVE).

Both gold and GNU ld patches are available: https://sourceware.org/pipermail/binutils/2021-May/116748.html

You can find more information on https://maskray.me/blog/2021-05-16-elf-interposition-and-bsymbolic

Here is a testcasebug.tar.xz:

[hjl@gnu-cfl-2 copyreloc-3]$ make
gcc    -c -o lib.o lib.S
gcc  -shared -o libbar.so lib.o
gcc -fPIC -o x1 x.c libbar.so -Wl,-rpath,.
gcc -O2 -o x2 x.c libbar.so -Wl,-rpath,.
./x1
./x2
make: *** [Makefile:6: all] Aborted (core dumped)
[hjl@gnu-cfl-2 copyreloc-3]$

I'd like to see:

There are no R_X86_64_COPY in executable without using -fPIC nor -fPIE.
ld and ld.so should work together to detect the issue caused by R_X86_64_COPY at compile-time and/or run-time.

R_X86_64_COPY removal won't happen overnight. We need ways to detect and mitigate the potential R_X86_64_COPY related issues before R_X86_64_COPY is completely removed.

./x2 demonstrate an issue related to the GCC x86-64 HAVE_LD_PIE_COPYRELOC. Other architectures don't have the problem.

milestones:

eliminate copy relocations for x86-64 -fpie https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570139.html
eliminate canonical PLT entries for -fno-pic (all architectures) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593
eliminate copy relocations for -fno-pic (all architectures; if got indirection cost is a concern, opt out the legacy architectures (i386/ppc32))

We can do 1 and 2 immediately.

After we do 2, we can let ld default to warn for canonical PLT entries (st_shndx==0,st_value!=0). When the ld warning has been there for a while, let ld.so warn for canonical PLT entries.

Distribution-wide default ld -Bsemantic-non-weak-functions is safe after 2.

Copy relocations are a bit subtle because some badly written assembly files may have problems. Some users may prefer performance despite copy relocations on architectures without x86-64 GOTPCRELX/ppc64 TOC optimization.

After we do 3, we can let ld default to warn for R_*_COPY. When the ld warning has been there for a while, let ld.so warn for R_*_COPY.

Note that many action items can be parallelized.

I don't think compiler/assembler need any marker. Many assembly files are written with good -fPIC/-fPIE in mind. They should not need a marker like .note.GNU-stack

A concreted list of action items:

eliminate copy relocations for x86-64 -fpie
- GCC x86-64: default to GOT indirection for external data symbols in -fpie mode. Patch: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570139.html
eliminate canonical PLT entries for -fno-pic (all architectures)
- implement https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593. Optionally add -fdirect-access-external-function for users who want the original behavior.
- make ld default to warn for canonical PLT entries (st_shndx==0, st_value!=0). On i386 this depends on R_386_PLT32 (see below)
- make glibc ld.so warn for canonical PLT entries
eliminate copy relocations for -fno-pic (all architectures; if GOT indirection cost is a concern, opt out the legacy architectures (i386/ppc32))
- implement -fno-direct-access-external-data PR98112
- make -fno-pic default to -fno-direct-access-external-data for most architectures. Some users may prefer performance despite copy relocations without x86-64 GOTPCRELX/ppc64 TOC optimization. They can opt out.
- make ld default to warn for R_*_COPY
- make glibc ld.so warn for R_*_COPY
GCC: treat STV_PROTECTED similar to STV_HIDDEN
- GCC aarch64/arm/x86/...: allow direct access relocations on protected symbols in -fpic mode.
GNU ld: treat STV_PROTECTED similar to STV_DEFAULT in -Bsymbolic mode
- GNU ld aarch64/x86: allow direct access relocations on protected data symbols in -shared mode.
- GNU ld x86: disallow copy relocations on protected data symbols. (I think canonical PLT entries on protected symbols have been disallowed.)

After elimination of canonical PLT entries, we can safely enable distribution-wide default ld -Bsemantic-non-weak-functions. This will improve performance for lots of software, especially for short-lived processes where relocation symbol lookup takes a significant portion.

x2 is a non-PIE executable which has nothing to do with HAVE_LD_PIE_COPYRELOC. A marker provides a way to identify issues with R_X86_64_COPY at link-time as well as run-time. We have used the marker for CET enabling successfully.

gcc -fno-pic -O2 -o x2 x.c libbar.so -Wl,-rpath,. has problems on all architectures
gcc -fpie -O2 -o x2 x.c libbar.so -Wl,-rpath,. has problem only on x86-64. It is related to HAVE_LD_PIE_COPYRELOC

Many distributions configure GCC with --enable-default-pie.

CET has size cost and performance cost. Both SHSTK and IBT have interaction with many applications (stack manipulation, setjmp, JIT, etc). It is good to use an opt-in strategy.

Eliminating canonical PLT entries for -fno-pic has zero cost for most software (taking the address is rare, even rarer after SROA/indirect-to-direct call optimization/inlining/etc), e.g. a bootstrapped clang is byte identical.

For copy relocations elimination, many groups who don't prefer handling GNU_PROPERTY want to benefit from it as well.

Many assembly files are PIC aware. They should not add new markers to enable optimizations.

OK, ultimately I think I'd prefer to see these things fixed instead of keeping the current state as-is. If you want GNU_PROPERTY, I think it is fine as long as it is optional. For example, I can imagine that *BSD/Fushcia/perhaps other ELF OSes which just want to get rid of copy relocations/canonical PLT entries but don't want to deal with GNU_PROPERTY.

Not all GCC binaries are built with --enable-default-pie. Even if they are, they still support -no-pie. R_X86_64_COPY removal should be done for PIE and non-PIE.
R_X86_64_COPY removal on Linux will be done piece by piece. At link-time and run-time, we need to know which .o/.so are R_X86_64_COPY free. We need to track it for both assembly sources as well as high level language sources.

Once (a) HAVE_LD_PIE_COPYRELOC is fixed and (b) x86-64 -fno-pic defaults to -fno-direct-access-external-data, pure C/C++ software will be free of R_X86_64_COPY. (Note: -fpic/-fpie default to -fno-direct-access-external-data)

The remaining is a small number pieces of software with bad assembly (I think most have good assembly). The ld warning/error (think of the binutils configure option --enable-textrel-warning=warning) can expose them. When the ld with warning/error is prevailing, glibc ld.so can start to warn as well.

At run-time, R_X86_64_COPY on symbol foo in executable is a problem only when the shared library, which defines foo, doesn't expect R_X86_64_COPY. Before R_X86_64_COPY is completely removed, a marker on such shared libraries will help ld.so issue an error only when necessary. Otherwise R_X86_64_COPY removal on Linux may be too difficult to happen.

We also need to make sure that a simple rebuild of a shared object with an updated toolchain does not break ABI because the object is no longer compatible with R_X86_64_COPY relocations in a main program.

changed the description

[As I mentioned on https://lists.llvm.org/pipermail/llvm-dev/2021-June/150933.html , I care more about the function case less about the variable case (copy relocations). ] Regarding the function case: does this proposal interact with -fno-semantic-interposition? https://gcc.gnu.org/pipermail/gcc-patches/2021-June/572018.html

Regarding the variable case: How does the proposal change shared object behaviors? I cannot find any which can make executable R_X86_64_COPY incompatible. If folks feel that a GNU PROPERTY is useful for copy relocations, I may not object, but I think it is good to make the function case and STV_PROTECTED fixes separate.

If rebuilding a shared lib that formerly used normal ELF lookup for data object 'foo' now uses local lookup for 'foo', it will break any existing executable that uses COPY for 'foo'. Formerly the version in the executable prevailed and was used by it and the shared lib. If the shared lib then is updated to use local lookup it uses the copy in the shared lib, while the executable still uses the copy in it, so you have two copies and trouble.

So, that cannot be done by default, it's an ABI change.

changed the description

Remove copy relocation, add canonical function address and optimize locally defined symbol access:

All accesses to protected definitions are local access. In executable, all accesses to defined symbols are local access. All global function pointers, whose function bodies aren't locally defined, must use GOT. All read/write accesses to symbols, which aren't locally defined, must use GOT. Branches to undefined symbols may use PLT.

So, I agree with all these items, I think. I would even go so far as to say that these are the intended ELF rules and any deviation from that is actually a bug or at least a quality of implementation issue.

These should be enforced by

Compiler: Add a compiler option, -fsingle-global-definition

But I don't really see why you would need this? The compiler needs to know if the compilation model is for an executable or a shared library, but otherwise every item of your list can be inferred by the compiler right now in such a way that -fsingle-global-definition wouldn't make a difference. (e.g. if it sees a reference without definition, and currently compiles as shared lib it can't infer that the definition will be in this component, with or without -fsingle-global-definition).

I think your wish for markers in assembler and linker is over-engineering things, but if you want to put in the work for that ...

Because this is an ABI-break change, it shouldn't be enabled by default. Here is the proposal Unique-2021.pdf

I would argue the 2016/GCC5 changes of emitting copy-relocs for protected symbols were the ABI change and this proposal now only fixes that bug.

Are you at least agreeing that only the items related to protected symbols are a change?

Also, in the linked presentation I still don't see reasons for a compiler flag. Can you spell out the specific changes that the flag would enable in the compiler, and, for each of these changes also say why you think that that shouldn't simply be done always, even without the flag?

For instance, I will argue that always emitting GOT access for global undefined data should be done even without the flag, i.e. we can rely on linker relaxation for performance?

Here are the updated slidesUnique-2021.pdf

changed the description

Initial findings while compiling Qt 6 with patch https://codereview.qt-project.org/c/qt/qtbase/+/355956. The patch does:

enable -fsingle-global-definition for everything
modify the Q_DECL_EXPORT macro to be __attribute__((visibility("protected"))) instead of "default"
introduce Q_DECL_EXPORT_OVERRIDABLE that remains "default" for the handful of symbols that must be overridable (change incomplete)

The patch also stops forcing -fPIC on everyone, so applications are now free to compile with -fPIE or nothing.

As part of the testing, I built:

one static library (libBootstrap.a)
one shared library (libQt6Core.so.6)
one executable linked to the static library
one executable linked to the shared library

The static parts are unaffected from the regular build. They did get -fsingle-global-definition, but the symbols appear to be unchanged.

The shared library did get PROTECTED symbols. For example:

 5033: 0000000000686aa8      8 OBJECT  GLOBAL PROTECTED     31 QCoreApplication::self@@Qt_6
 5894: 00000000001fc250    373 FUNC    GLOBAL PROTECTED     13 QTimer::start(int)@@Qt_6
 6927: 00000000001fb4b0     47 FUNC    GLOBAL PROTECTED     13 QTimer::timeout(QTimer::QPrivateSignal)@@Qt_6

The test application linked to the shared library was:

#include <QCoreApplication>
#include <QTimer>

int main(int argc, char **argv)
{
    QCoreApplication a(argc, argv);
    QTimer t;
    QObject::connect(&t, &QTimer::timeout, QCoreApplication::instance(), []() {                                                                                 
        QCoreApplication::instance()->quit();
    });
    t.start(250);
    return a.exec();
}

This test will exercise:

data access
function pointer access
calling functions

Compiled as:

g++ -c -pipe -pipe -march=skylake -O3 -g1 -Wall -Wextra -fsingle-global-definition -D_REENTRANT -DQT_NO_DEBUG -DQT_CORE_LIB -I. -I. -I/home/tjmaciei/obj/qt/qt6-release/qtbase/include -I/home/tjmaciei/obj/qt/qt6-release/qtbase/include/QtCore -I. -I/home/tjmaciei/obj/qt/qt6-release/qtbase/mkspecs/linux-g++-optimised -o main.o main.cpp
g++ -pipe -march=skylake -Wl,-O1 -Wl,--as-needed -Wl,-z,relro -Wl,--enable-new-dtags -fuse-linker-plugin -Wl,-O1 -Wl,-rpath,/home/tjmaciei/obj/qt/qt6-release/qtbase/lib -o protected-visibility main.o   /home/tjmaciei/obj/qt/qt6-release/qtbase/lib/libQt6Core.t.so -pthread -lpthread

Note: no -fPIC or -fPIE.

eu-readelf reports these symbols in the executable:

    1: 0000000000000000      0 FUNC    GLOBAL DEFAULT    UNDEF QTimer::start(int)@Qt_6 (2)
    2: 0000000000000000      0 OBJECT  GLOBAL DEFAULT    UNDEF QCoreApplication::self@Qt_6 (2)
    5: 0000000000000000      0 FUNC    GLOBAL DEFAULT    UNDEF QTimer::timeout(QTimer::QPrivateSignal)@Qt_6 (2)

Disassembly where those symbols are used:

  40114d:       mov    0x2e7c(%rip),%rax        # 403fd0 <QCoreApplication::self@Qt_6>
  401154:       mov    $0x18,%edi
  401159:       mov    (%rax),%r12
  40115c:       mov    0x2e75(%rip),%rax        # 403fd8 <QTimer::timeout(QTimer::QPrivateSignal)@Qt_6>
  401163:       movq   $0x0,-0x18(%rbp)
  40116b:       mov    %rax,-0x20(%rbp)
[...]
  4011b7:       mov    $0xfa,%esi
  4011bc:       lea    -0x30(%rbp),%rdi
  4011c0:       call   401030 <QTimer::start(int)@plt>

The function call is unaffected and went through the PLT. The two data references are, as expected, indirect via the GOT. Compare that to a build without the -fsingle-global-definition:

  40116d:       mov    $0x18,%edi
  401172:       movq   $0x401060,-0x20(%rbp)
  40117a:       movq   $0x0,-0x18(%rbp)
  401182:       mov    0x2f0f(%rip),%r12        # 404098 <QCoreApplication::self@@Qt_6>

The first, absolute address (0x401060) is undecorated in the output, but refers to the PLT slot for QTimer::timeout(QTimer::QPrivateSignal). The QCoreApplication::self symbol appears unmodified, but it was subject to copy relocation: it is directly accessed and its absolute address to the .bss section, not the .got.

The .o files show a change in relocation too: with -fsingle-global-definition, both are R_X86_64_REX_GOTPCRELX, whereas without they are R_X86_64_32S and R_X86_64_PC32.

Conclusion: this is working really well and is doing exactly what I wanted it to.

Qt can use clang -fno-pic -fno-direct-access-extern-data with clang>=12.0.0. The GCC feature request is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98112.

I think -fno-direct-access-extern-data is clearer and more specific. -fsingle-global-definition isn't really clear to a casual user why it could mean. Does "global" mean STB_GLOBAL or the combination of all components (exe and dso)?

When taking an external function address in -fno-pic code, I suggest -fno-direct-access-extern-function (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593). Actually, for many arches I suggest that we just use GOT by default, no need for a toggle.

For x86-64 -fpie, we should apply https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570139.html

I can add an equivalent Clang change. Which options should I use to enable the equivalent of H.J.'s proposed change?

I can also confirm that the accesses to those same symbols inside the libQt6Core.so.6 library are direct. Examples:

  1fc725:       lea    -0x127c(%rip),%rdx        # 1fb4b0 <QTimer::timeout(QTimer::QPrivateSignal)>
  1fc72f:       cmp    %rdx,(%rax)

  109fa3:       mov    0x57cafe(%rip),%r12        # 686aa8 <QCoreApplication::self>
  109faa:       test   %r12,%r12

$ ~/src/owntools/relinfo/relinfo.pl lib/libQt6Core.t.so        
lib/libQt6Core.t.so: 5794 relocations, 5489 relative (94%), 303 PLT entries, 1 for local syms (0%), 0 users

Besides the common linker flags, the library was linked with:

-Wl,--version-script
-Wl,--enable-new-dtags
-Wl,--dynamic-list

The dynamic list script is responsible for the single PLT jump slot for a local symbol, which is intended.

mentioned in issue Linux-ABI#1

I created a new issue to add a new GNU property: Linux-ABI#1. Let's move discussion there.

closed

mentioned in commit altaway/qtbase@19b7f854

Remove copy relocation and optimize locally defined symbol access

Designs

Child items ...

Activity