Skip to content

Revert "MySQL: Multibyte Collation Support"

This reverts commit c8322907 from !15464 (merged)

Based on this comment: https://github.com/wireshark/wireshark/commit/c8322907459d37860042dc5ce7ac65c681692627#commitcomment-141627548

Collations in MySQL are ranging between 1 and 323:

sql> SELECT MIN(ID),MAX(ID) FROM information_schema.collations;
+---------+---------+
| MIN(ID) | MAX(ID) |
+---------+---------+
|       1 |     323 |
+---------+---------+
1 row in set (0.0022 sec)

sql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.4.0     |
+-----------+
1 row in set (0.0004 sec)

This means that the collation ID needs 2 bytes.

Another complication here is that there are character sets and collations. A single character usually has multiple collations, but only one collation is the default. In many places in code and documentation the terms character set and collation are unfortunately mixed.

sql> SELECT MIN(ID),MAX(ID) FROM information_schema.collations WHERE IS_DEFAULT='YES';
+---------+---------+
| MIN(ID) | MAX(ID) |
+---------+---------+
|       1 |     255 |
+---------+---------+
1 row in set (0.0021 sec)

And all default collations have ID's that are <=255.

When connecting to a server the client sends a collation as part of the "HandshakeResponse41" (MySQL) / "Login Request" (Wireshark).

This is documented here.

image

Note that this is source code documentation including a protocol description. However this isn't an official protocol specification.

MySQL Connector/Python doesn't send the collation as 1-byte but as 2-bytes as can be seen in network traces.

The comment mentioned in the beginning correctly states that MySQL Server only reads one byte and not two.

Note that the lower byte is always in the same location, both with a client sending a 1 byte collation or a client sending a 2 byte collation. This is because there is a filler just before the collation which should always be set to all 0's.

So basically this is what the official MySQL Connector/Python is sending:

0x2F 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
<- coll -><-------------------------------------------------------- filler 22 x 0x00 --------------------------------->

This is what MySQL Server reads:

0x2F 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
<col><-------------------------------------------------------- filler 23 x 0x00 -------------------------------------->

So Conn/Py sends 0x012F (utf8mb4_ja_0900_as_cs) and MySQL Server reads 0x2F (latin1_bin).

For collations <=255 the first byte is 0x00 so what Conn/Py sends matches with what MySQL Server reads.

Note that it is even safe for the server to read two bytes, matching what Conn/Py sends even for clients that only send 1 byte as the byte before that would be 0x00.

The questions here are:

  • Is MySQL Connector/Python correct (and the protocol description wrong/outdated, and the server as well)?
  • Or is MySQL Connector/Python wrong?
  • Do we want Wireshark to follow the documented behavior, or the behavior that's seen in the wild by an official connector? Note that 775c3be8 is also there to allow a possible protocol violation that's seen in the wild.

My personal opinions here:

  • As this is only seen in a single connector and not in the server it is more likely that the connector is wrong. This could change later if the server and/or docs gets updated.
  • We could wait for Oracle MySQL to clarify the behavior, but that will probably take too long.
  • I'm ok with reverting this or with keeping this in. If we revert this we probably should have some alert if the filler isn't all 0x00's. In addition to that we should add the updated collation back after this is done.
  • I think we should try to follow the documentation where possible and follow actual behavior of official released MySQL clients and servers where needed.
  • As both MySQL and MariaDB follow the protocol but have slight differences/extensions/etc we can't be too strict.

Related:

Edited by Daniël van Eeden

Merge request reports