Skip to content

Avoid crash by ctdb RELEASE_IP if a connection to a non-public address disconnects first (bug #15523)

Stefan Metzmacher requested to merge samba-team/devel/samba:metze-ctdb into master

https://bugzilla.samba.org/show_bug.cgi?id=15523

I don't know how the problem happens in real setups, but customer logs show, the use-after-free happened there.

From reading the code I was able to trigger it like this: using smbclient //cluster/share -Uuser

All smbclient commands got this in addition: --option="libsmb:client_guid=6112f7d3-9528-4a2a-8861-0ca129aae6c4" It means all connections will be merged to same smbd process in order to allow multichannel to that process.

The cluster looks like this:

# ctdb status
Number of nodes:3
pnn:0 172.31.99.166    OK (THIS NODE)
pnn:1 172.31.99.167    OK
pnn:2 172.31.99.168    OK
Generation:1237225803
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Leader:0
# ctdb ip
Public IPs on node 0
172.31.91.20 2
172.31.91.21 1
172.31.91.22 0
172.31.92.20 2
172.31.92.21 1
172.31.92.22 0

I reproduced the crash like this:

1. (In an extra terminal): smbclient //172.31.99.166/shm -Uroot%test
2. (In an extra terminal): smbclient //172.31.91.22/shm -Uroot%test

Now we have two connections to the same smbd:

# netstat -tpnW | grep '172.31.99.166'
tcp 0 0 172.31.91.166:445 172.31.91.1:42470 ESTABLISHED 1634797/smbd: clien
# netstat -tpnW | grep '172.31.91.22'
tcp 0 0 172.31.91.22:445  172.31.91.1:55462 ESTABLISHED 1634797/smbd: clien

But ctdb don't see the tcp connection for the connection to the public ip:

# ctdb gettickles 172.31.91.22
Connections for IP: 172.31.91.22
Num connections: 0

This is a minor problem and the reason behind need for the extra CTDB_CONTROL_TCP_CLIENT_PASSED

The problem happens like this:

  • the parent smbd accepted the 2nd connection 172.31.91.22:445 172.31.91.1:55462 and forked a child smbd (PID=1634798) (as always)
  • smbd (PID=1634798) uses CTDB_CONTROL_TCP_CLIENT in order to register the connection with ctdb.
  • ctdb gets the CTDB_CONTROL_TCP_CLIENT message and adds the connection to client->tcp_list and forwards it as CTDB_CONTROL_TCP_ADD to all nodes (including itself)
  • ctdb gets the CTDB_CONTROL_TCP_ADD message and adds the connection to vnn->tcp_array
  • smbd (PID=1634798) processed the SMB2 Negotiate request and noticed the client guid was already registered by smbd with PID=1634797.
  • smbd (PID=1634798) sends a MSG_SMBXSRV_CONNECTION_PASS message to PID=1634797 with the socket fd and the content of the SMB2 Negotiate request
  • smbd (PID=1634797) received the MSG_SMBXSRV_CONNECTION_PASS message and also uses CTDB_CONTROL_TCP_CLIENT in order to register the connection with ctdb
  • ctdb gets the CTDB_CONTROL_TCP_CLIENT message and adds the connection to client->tcp_list and broadcasts it as CTDB_CONTROL_TCP_ADD to all nodes (including itself)
  • ctdb gets the CTDB_CONTROL_TCP_ADD message and detects that the connection is already in vnn->tcp_array and the ignores the 2nd registration
  • smbd (PID=1634797) processes the SMB2 Negotiate request again and sends a MSG_SMBXSRV_CONNECTION_PASSED message to PID=1634798
  • smbd (PID=1634798) exits itself, which means the connection to ctdb is also closed
  • ctdb detects that PID=1634798 closed the connection and calls ctdb_takeover_client_destructor_hook() it calls ctdb_remove_connection() for the tcp connection and removes it from vnn->tcp_array too, which means "ctdb gettickles" no longer sees the connection, which is still valid for smbd (PID=1634797)

Now back to reproducing the crash problem:

3. The smbclient from step 1. is stopped with "exit"
   This means the release_ip callback registered for it
   by ctdbd_register_ips() is still registered, but
   its private_data "state" is gone.
4. "ctdb moveip 172.31.91.22 1" moves the ip to
   another node and ctdb sends a message to CTDB_SRVID_RELEASE_IP
   in order to let smbd (PID=1634797) disconnect/exit

The problem is that ctdbd_msg_call_back() calls all registered callback in the order they called register_with_ctdbd(), which means the release_ip function with the stale 'state' is called first and hits the use-after-free.

Now we have an explicit ctdbd_unregister_ips() in order to unregister the callback with the stale state and also inform ctdb via CTDB_CONTROL_TCP_CLIENT_DISCONNECTED that the tcp connection is gone now.

In the case of multichannel passing from the temporary smbd, we need to use ctdbd_passed_ips() and CTDB_CONTROL_TCP_CLIENT_PASSED in order to remove the connection from client->tcp_list, but keep it alive in vnn->tcp_array.

I hope that makes it clear now

Checklist

  • Commits have Signed-off-by: with name/author being identical to the commit author
  • (optional) This MR is just one part towards a larger feature.
  • (optional, if backport required) Bugzilla bug filed and BUG: tag added
  • Test suite updated with functionality tests
  • Test suite updated with negative tests
  • Documentation updated
  • CI timeout is 3h or higher (see Settings/CICD/General pipelines/ Timeout)

Reviewer's checklist:

  • There is a test suite reasonably covering new functionality or modifications
  • Function naming, parameters, return values, types, etc., are consistent and according to README.Coding.md
  • This feature/change has adequate documentation added
  • No obvious mistakes in the code
Edited by Stefan Metzmacher

Merge request reports