potential data corruption when creating / removing qcow2 external snapshot (overlay) of running VM
<!--See https://libvirt.org/bugs.html#how-to-file-high-quality-bug-reports-->
## Software environment
- Operating system: Debian Trixie
- Architecture: amd64
- kernel version: 6.12, in the most recent case - 6.12.43+deb13-amd64
- libvirt version: 11.3.0-3+deb13u1
- qemu version: 10.0.7
- Hypervisor and version: KVM
## Description of problem
Please pardon if this question should have been asked to Debian maintainers or in KVM project; I'll appreciate any suggestions where to seek more help.
I've experienced now at least 4 in-guest data corruptions happening on 3 different servers \[ server-grade amd64 hardware both with Intel Xeon and AMD EPYC CPUs, ECC memory \]. Issues started soon after upgrade from Debian Bookworm to Debian Trixie both on the virtualization servers and within KVM guests.
Corruptions affected 2 VMs - one with MySQL 8.4, another with PostgreSQL 17. In both cases - it's the database servers complaining about corrupted data files and not ext4 file system / kernel.
Affected VMs have ext4 within the VM; images of virtual disks are stored in qcow2 files kept on ext4 file system on the local NVMe drive installed in the physical server. Virtual drives are \~1-1.4TB in size.
All instances of corruption coincided in time with process of VM-level backups which are executed on the physical server for those VMs:
* _virsh snapshot-create-as --domain VMNAME --name "kvmsnap-VMNAME" --no-metadata --atomic --disk-only vdX,external_
* rsync to transfer the vdX.qcow2 to another server
* _virsh blockcommit VMNAME vdX--active --pivot --verbose_
* deleting the snapshot files from directories where the qcow2 files are kept for VMs
This backup procedure is done while the VM is running.
This corruption is not something that I can reproduce easily; i have much more VMs in the same environment, where I did not observe any corruption. Both affected database servers are under quite heavy IO-load - both for reads and writes, perhaps this factor increases probability of running into this problem.
I've already reviewed recent issues in this project, searched elsewhere for descriptions of similar problems but did not find anything.
After the most recent corruption I've upgraded Debian libvirt-related packages to not-yet-released-to-stable version 11.3.0-3+deb13u2. Based on the https://tracker.debian.org/news/1702918/accepted-libvirt-1130-3deb13u2-source-into-proposed-updates/ - i doubt it'll help, but i did not have better ideas.
If corruption happens again - I intend to migrate, at least temporarily, file system within the VM from ext4 to btrfs. This might give me error message in the kernel of the guest VM indicating that I'm dealing with storage-level corruption rather than subtle corruption of memory of the VM guest.
Thanks a lot in advance for any comments or suggestions how to troubleshoot it!
<!--Attach XML configs, logs, stack traces, etc. Compress the files if necessary-->
<!--See https://libvirt.org/kbase/debuglogs.html on how to configure logging-->
issue