mirror: the block-job-cancel command can put qemu to the endless error loop
- QEMU version: upstream version, commit 715167a3
Description of problem
If the destination VM will crash (or network is down) right before the completion of the block device mirroring job (block-job-cancel
), then there will be a possibility to put QEMU in the error loop.
Steps to reproduce
- Run both QEMU VMs: source + target.
- On the target side prepare NBD server for blockdev mirroring process by using QMP commands similar to the one below:
{"execute": "nbd-server-start", "arguments": { "addr": { "data": { "host": "::", "port": "49153" }, "type": "inet" } } }
{ "execute": "nbd-server-add", "arguments": { "device": "drive_main01", "writable": true } }
- On the source side, prepare VM for the migration and start driver mirror job:
{"execute":"migrate-set-capabilities","arguments":{"capabilities":[{"capability":"pause-before-switchover","state":true}]}}
{ "execute": "drive-mirror", "arguments": { "device": "drive_main01", "mode": "existing", "job-id": "job0", "target": "nbd:127.0.0.1:49153:exportname=drive_main01", "sync": "top", "on-source-error": "stop", "on-target-error": "stop", "format": "raw", "speed": 0 } }
- On the source side wait for the
BLOCK_JOB_READY
event:
{"timestamp": {"seconds": 1625586327, "microseconds": 833805}, "event": "BLOCK_JOB_READY", "data": {"device": "job0", "len": 21474836480, "offset": 21474836480, "speed": 0, "type": "mirror"}}
- Start migration on the source side:
{ "execute": "migrate", "arguments": { "uri": "tcp:127.0.0.1:8091" } }
- Wait for the
pre-switchover
state of the migration:
{ "execute": "query-migrate" }
{"return": {"expected-downtime": 300, "status": "pre-switchover", "setup-time": 3, "total-time": 11343, "ram": {"total": 8725020672, "postcopy-requests": 0, "dirty-sync-count": 2, "multifd-bytes": 0, "pages-per-second": 39550, "page-size": 4096, "remaining": 2871296, "mbps": 1073.7734399999999, "transferred": 963647065, "duplicate": 1899491, "dirty-pages-rate": 84, "skipped": 0, "normal-bytes": 944705536, "normal": 230641}}}
- Kill target QEMU to reproduce an issue.
- Cancel the job on the source side:
{ "execute": "block-job-cancel", "arguments": { "device": "job0" } }
Got the endless errror loop:
...
{"timestamp": {"seconds": 1625586487, "microseconds": 413847}, "event": "BLOCK_JOB_ERROR", "data": {"device": "job0", "operation": "write", "action": "stop"}}
{"timestamp": {"seconds": 1625586487, "microseconds": 413865}, "event": "BLOCK_JOB_ERROR", "data": {"device": "job0", "operation": "write", "action": "stop"}}
{"timestamp": {"seconds": 1625586487, "microseconds": 413885}, "event": "BLOCK_JOB_ERROR", "data": {"device": "job0", "operation": "write", "action": "stop"}}
...
Source qemu could be stopped only by using SIGKILL.