Skip to content

LDB: make string comparisons consistent and transitive

By default, LDB thinks in ASCII only, but for Samba it thinks in a version of UTF-8. This has been achieved via ldb_set_utf8_fns() which registers a casefold function with the LDB object; ldb_init() adds an ASCII one, which Samba always overwrites with wrap_casefold() from lib/ldb-samba. SSSD stays with the ASCII default.

The casefold function is then used by ldb_comparison_fold() to compare strings ignoring case (and "insignificant whitespace", mostly another story). This is where the trouble starts. The UTF-8 casefold function will fail if it meets bad UTF-8, in which case ldb_comparison_fold() has been falling back to memcmp(). This means that anyone who can set an LDB string can wreck the sort order for that attribute. In short, this FIXME is what we fix:

 	b1 = ldb_casefold(ldb, mem_ctx, s1, n1);
 	b2 = ldb_casefold(ldb, mem_ctx, s2, n2);
 
 	if (!b1 || !b2) {
 		/*
 		 * One of the strings was not UTF8, so we have no
 		 * options but to do a binary compare.
 		 *
 		 * FIXME: this can be non-transitive.
 		 *
 		 * consider {
 		 *           CA 8A  "ʊ"
 		 *           C6 B1  "Ʊ"
 		 *           C8 FE  invalid utf-8
 		 *          }
 		 *
 		 * The byte "0xfe" is always invalid in utf-8, so the
 		 * comparisons against that string end up coming this way,
 		 * while the "Ʊ" vs "ʊ" comparison goes via the ldb_casefold
 		 * branch. Then:
 		 *
 		 *  "ʊ" == "Ʊ"     by casefold.
 		 *  "ʊ" > {c8 fe}  by byte comparison.
 		 *  "Ʊ" < {c8 fe}  by byte comparison.
 		 *
 		 * In many cases there are no invalid encodings between the
 		 * upper and lower case letters, but the string as a whole
 		 * might also compare differently due to the space-eating in
 		 * the other branch.
 		 */

The core fix is kind of simple. You don't need to casefold the strings -- you can just compare a character at a time, case-insensitively. If a bad sequence occurs, you only need to switch strategies from that point, not chucking it all away with if (!b1 || !b2). What we choose to do is sort invalid strings toward the end.

A complication is that ldb_comparison_fold() tried to be agnostic about the encoding it was dealing with, but we can't do that any more. So we need to set a comparison function as well as a casefold function in ldb_set_utf8_fns() (as seems to have been the original plan, until 76036d37). To avoid trouble, we add a new ldb_set_utf8_funtions() and deprecate the old one. Nobody outside Samba uses it.

Because of the ABI change, we are not backporting this, so there is no bug number, but this is definitely related to https://bugzilla.samba.org/show_bug.cgi?id=15625.

Checklist

  • Commits have Signed-off-by: with name/author being identical to the commit author
  • (optional) This MR is just one part towards a larger feature.
  • (optional, if backport required) Bugzilla bug filed and BUG: tag added
  • Test suite updated with functionality tests
  • Test suite updated with negative tests
  • Documentation updated
  • CI timeout is 3h or higher (see Settings/CICD/General pipelines/ Timeout)

Reviewer's checklist:

  • There is a test suite reasonably covering new functionality or modifications
  • Function naming, parameters, return values, types, etc., are consistent and according to README.Coding.md
  • This feature/change has adequate documentation added
  • No obvious mistakes in the code

Merge request reports