Bug Bounty: NVidia Reset Bug

By Dmitry TrifonovAugust 6, 2025
qemulibvirtgpukvmvirtualization

For RTX 5090 and RTX PRO 6000

Hey everyone — we’re building a next-gen GPU cloud for AI developers at CloudRift, and we’ve run into a frustrating issue that’s proven nearly impossible to debug. We’re turning to the community for help.

On some of our nodes with RTX 5090 and RTX PRO 6000 GPUs, the cards occasionally become completely unresponsive — usually after a few days of VM usage or at seemingly random times during startup/shutdown. Once it happens, the GPU can’t be reassigned. The only way out is a complete node reboot.

We’ve ruled out most of the usual suspects: IOMMU quirks, kernel versions, driver bindings, and libvirt misconfigurations. Our H100s, B200s, and older RTX 4090s are solid, but these newer RTX cards are giving us serious trouble.

If you’re familiar with VFIO, QEMU/KVM, NVIDIA driver internals, or PCIe reset edge cases — we’d love your help.

Article image
Dmitry and the team “debugging”

General Information

We’ve tested several machines with RTX 5090 and RTX PRO 6000 based on AMD EPYC Rome and Milan platforms. All exhibit similar issues. GPU gets stuck, and VM creation fails with the following error:

libvirt:  error : internal error: Unknown PCI header type '127' for device '0000:26:00.0'

The relevant errors from dmesg:

[572205.636684] pcieport 0000:40:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
[572206.637508] pcieport 0000:40:01.1: retraining failed
[572207.639663] pcieport 0000:40:01.1: Data Link Layer Link Active not set in 1000 msec
[572208.876663] vfio-pci 0000:26:00.0: not ready 1023ms after FLR; waiting
[572209.964657] vfio-pci 0000:26:00.0: not ready 2047ms after FLR; waiting
[572212.076705] vfio-pci 0000:26:00.0: not ready 4095ms after FLR; waiting
[572216.364619] vfio-pci 0000:26:00.0: not ready 8191ms after FLR; waiting
[572225.068466] vfio-pci 0000:26:00.0: not ready 16383ms after FLR; waiting
[572241.964374] vfio-pci 0000:26:00.0: not ready 32767ms after FLR; waiting
[572275.244028] vfio-pci 0000:26:00.0: not ready 65535ms after FLR; giving up
[572302.229867] watchdog: BUG: soft lockup - CPU#246 stuck for 26s! [worker:1274725]

These likely indicate:

  • A QEMU VM using VFIO failed to reset the PCI device during shutdown or reassignment.
  • The GPU is not responding to the Function Level Reset (FLR).
  • The CPU became stuck while attempting to release the device, resulting in a kernel soft lockup.

Occasionally, we also see an issue with changing the power state.

vfio-pci: Unable to change power state from D0 to D3hot, device inaccessible

PCI Status

The device is still recognized by lspci. The driver is bound.

$ lspci -k | grep -A 2 -i nvidia
...
--
26:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1)
Subsystem: Gigabyte Technology Co., Ltd Device 416f
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
26:00.1 Audio device: NVIDIA Corporation Device 22e8 (rev a1)
Subsystem: NVIDIA Corporation Device 0000
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
--
...

However, we see the issue above with an unknown header type.

$ lspci -vvv -s 0000:26:00.0
26:00.0 VGA compatible controller: NVIDIA Corporation Device 2b85 (rev a1) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd Device 416f
!!! Unknown header type 7f
Interrupt: pin ? routed to IRQ 715
NUMA node: 0
IOMMU group: 51
Region 0: Memory at b8000000 (32-bit, non-prefetchable) [size=64M]
Region 1: Memory at 30040000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at 30052000000 (64-bit, prefetchable) [size=32M]
Region 5: I/O ports at 2000 [size=128]
Expansion ROM at bc000000 [disabled] [size=512K]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau

The sysfs is correctly populated.

$ ls /sys/bus/pci/devices/0000:26:00.0/
aer_dev_correctable class d3cold_allowed iommu max_link_speed power resource resource3_wc sriov_offset subsystem_device
aer_dev_fatal config device iommu_group max_link_width power_state resource0 resource5 sriov_stride subsystem_vendor
aer_dev_nonfatal consistent_dma_mask_bits dma_mask_bits irq modalias remove resource1 revision sriov_totalvfs uevent
ari_enabled consumer:pci:0000:26:00.1 driver link msi_bus rescan resource1_resize rom sriov_vf_device vendor
boot_vga current_link_speed driver_override local_cpulist msi_irqs reset resource1_wc sriov_drivers_autoprobe sriov_vf_total_msix vfio-dev
broken_parity_status current_link_width enable local_cpus numa_node reset_method resource3 sriov_numvfs subsystem

Some sysfs commands provide adequate responses, such as device and vendor information. Some, like PCI link speed, are invalid.

$ cat /sys/bus/pci/devices/0000:26:00.0/device
0x2b85
$ cat /sys/bus/pci/devices/0000:26:00.0/vendor
0x10de
$ cat /sys/bus/pci/devices/0000:26:00.0/current_link_speed
Unknown
$ cat /sys/bus/pci/devices/0000:26:00.0/current_link_width
63

The device config is junk:

$ sudo hexdump -C /sys/bus/pci/devices/0000:26:00.0/config
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|

The device is in D0 (fully on) power state

$ cat /sys/bus/pci/devices/0000:26:00.0/power_state
D0

No AER errors:

$ cat /sys/bus/pci/devices/0000:26:00.0/aer_dev_nonfatal
Undefined 0
DLP 0
SDES 0
TLP 0
FCP 0
CmpltTO 0
CmpltAbrt 0
UnxCmplt 0
RxOF 0
MalfTLP 0
ECRC 0
UnsupReq 0
ACSViol 0
UncorrIntErr 0
BlockedTLP 0
AtomicOpBlocked 0
TLPBlockedErr 0
PoisonTLPBlocked 0
TOTAL_ERR_NONFATAL 0

$ cat /sys/bus/pci/devices/0000:26:00.0/aer_dev_fatal
Undefined 0
DLP 0
SDES 0
TLP 0
FCP 0
CmpltTO 0
CmpltAbrt 0
UnxCmplt 0
RxOF 0
MalfTLP 0
ECRC 0
UnsupReq 0
ACSViol 0
UncorrIntErr 0
BlockedTLP 0
AtomicOpBlocked 0
TLPBlockedErr 0
PoisonTLPBlocked 0
TOTAL_ERR_FATAL 0

Attempting to perform FLR results in the following error in dmesg.

vfio-pci 0000:26:00.0: timed out waiting for pending transaction; performing function level reset anyway

Domain XML and Libvirt Logs

Here is the typical domain XML that we’re using to allocate a VM with libvirt.

<domain type="kvm" xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0">
<cpu mode="host-passthrough" check="none" migratable="on"/>
<name>noble-server-cloudimg-amd64-cuda-docker-570-1754505304</name>
<uuid>462da31a-3c07-4d46-b030-965836864b04</uuid>
<vcpu>15</vcpu>
<clock offset="localtime"/>
<memory unit="KiB">104857600</memory>
<currentMemory unit="KiB">104857600</currentMemory>
<features>
<acpi/>
<apic/>
</features>
<metadata>
<libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
<libosinfo:os id="http://ubuntu.com/ubuntu/24.04"/>
</libosinfo:libosinfo>
<cloudrift:cloudrift xmlns:cloudrift="https://cloudrift.ai/schemas/domain/qemu/1.0">
<disks>
<device>disk</device>
<driver>qcow2</driver>
<source>/media/cloudrift/qemu/images/user/noble-server-cloudimg-amd64-cuda-docker-570-1754505304/noble-server-cloudimg-amd64-cuda-docker-570-1754505304.qcow2</source>
<target>vda</target>
<bus>virtio</bus>
<readonly>false</readonly>
<auto_remove>true</auto_remove>
</disks>
<disks>
<device>cdrom</device>
<driver>raw</driver>
<source>/media/cloudrift/qemu/images/user/noble-server-cloudimg-amd64-cuda-docker-570-1754505304/cloud-init.iso</source>
<target>sda</target>
<bus>sata</bus>
<readonly>true</readonly>
<auto_remove>true</auto_remove>
</disks>
<user_directory>/media/cloudrift/qemu/images/user/noble-server-cloudimg-amd64-cuda-docker-570-1754505304</user_directory>
</cloudrift:cloudrift>
</metadata>
<os>
<type arch="x86_64" machine="q35" pci-hole64-size="1024G">hvm</type>
<boot dev="cdrom"/>
<boot dev="hd"/>
</os>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<controller type="pci" model="pcie-root" index="0"/>
<disk type="file" device="disk">
<driver name="qemu" type="qcow2"/>
<source file="/media/cloudrift/qemu/images/user/noble-server-cloudimg-amd64-cuda-docker-570-1754505304/noble-server-cloudimg-amd64-cuda-docker-570-1754505304.qcow2"/>
<target dev="vda" bus="virtio"/>
</disk>
<disk type="file" device="cdrom">
<driver name="qemu" type="raw"/>
<source file="/media/cloudrift/qemu/images/user/noble-server-cloudimg-amd64-cuda-docker-570-1754505304/cloud-init.iso"/>
<target dev="sda" bus="sata"/>
<readonly/>
</disk>
<controller type="usb" model="qemu-xhci"/>
<input type="tablet" bus="usb"/>
<input type="keyboard" bus="usb"/>
<serial type="pty">
<source path="/dev/pts/0"/>
<target type="isa-serial" port="0">
<model name="isa-serial"/>
</target>
<alias name="serial0"/>
</serial>
<console type="pty" tty="/dev/pts/0">
<source path="/dev/pts/0"/>
<target type="serial" port="0"/>
<alias name="serial0"/>
</console>
<graphics type="vnc" port="5900" autoport="yes" listen="127.0.0.1"/>
<interface type="direct">
<mac address="22:a1:ac:58:18:5e"/>
<source dev="enp194s0f0:" mode="bridge"/>
<target dev="macvtap0"/>
<model type="virtio"/>
<alias name="net0"/>
</interface>
<controller type="pci" model="pcie-root-port" index="1" id="pcie.1">
</controller>
<hostdev mode="subsystem" type="pci" managed="no" multifunction="on">
<driver name="vfio"/>
<source>
<address domain="0x0000" bus="0x26" slot="0x00" function="0x00"/>
</source>
<address type="pci" domain="0x0000" bus="0x00" slot="0x10" function="0" controller="1"/>
<alias name="hostdev_0_0"/>
</hostdev>
<hostdev mode="subsystem" type="pci" managed="no">
<driver name="vfio"/>
<source>
<address domain="0x0000" bus="0x26" slot="0x00" function="0x01"/>
</source>
<address type="pci" domain="0x0000" bus="0x00" slot="0x10" function="1" controller="1"/>
<alias name="hostdev_0_1"/>
</hostdev>
</devices>
<seclabel type="none"/>
</domain>

The corresponding libvirt log

2025-07-25 17:56:05.352+0000: starting up libvirt version: 10.0.0, package: 10.0.0-2ubuntu8.7 (Ubuntu), qemu version: 8.2.2Debian 1:8.2.2+ds-0ubuntu1.7, kernel: 6.11.0-28-generic, hostname: nr-rtp-ai112.maas
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/snap/bin \
USER=root \
HOME=/var/lib/libvirt/qemu/domain-40-noble-server-cloudim \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-40-noble-server-cloudim/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-40-noble-server-cloudim/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-40-noble-server-cloudim/.config \
/usr/bin/qemu-system-x86_64 \
-name guest=noble-server-cloudimg-amd64-cuda-docker-570-1753466164,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-40-noble-server-cloudim/master-key.aes"}' \
-machine pc-q35-6.2,usb=off,dump-guest-core=off,memory-backend=pc.ram,acpi=on \
-accel kvm \
-cpu host,migratable=on \
-m size=104857600k \
-object '{"qom-type":"memory-backend-file","id":"pc.ram","mem-path":"/dev/hugepages/libvirt/qemu/40-noble-server-cloudim","x-use-canonical-path-for-ramblock-id":false,"prealloc":true,"size":107374182400}' \
-overcommit mem-lock=on \
-smp 15,sockets=15,cores=1,threads=1 \
-uuid 6d677ba1-01bb-4326-a688-545f2269d0c0 \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=30,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=localtime \
-no-shutdown \
-boot strict=on \
-device '{"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}' \
-device '{"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}' \
-device '{"driver":"pcie-root-port","port":18,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x2.0x2"}' \
-device '{"driver":"pcie-root-port","port":19,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x2.0x3"}' \
-device '{"driver":"pcie-root-port","port":20,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x2.0x4"}' \
-device '{"driver":"qemu-xhci","id":"usb","bus":"pci.2","addr":"0x0"}' \
-blockdev '{"driver":"file","filename":"/root/.local/share/cloudrift/qemu/images/base/noble-server-cloudimg-amd64-cuda-docker-570.img","node-name":"libvirt-3-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-3-format","read-only":true,"driver":"qcow2","file":"libvirt-3-storage","backing":null}' \
-blockdev '{"driver":"file","filename":"/media/cloudrift/qemu/images/user/noble-server-cloudimg-amd64-cuda-docker-570-1753466164/noble-server-cloudimg-amd64-cuda-docker-570-1753466164.qcow2","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"driver":"qcow2","file":"libvirt-2-storage","backing":"libvirt-3-format"}' \
-device '{"driver":"virtio-blk-pci","bus":"pci.3","addr":"0x0","drive":"libvirt-2-format","id":"virtio-disk0","bootindex":2}' \
-blockdev '{"driver":"file","filename":"/media/cloudrift/qemu/images/user/noble-server-cloudimg-amd64-cuda-docker-570-1753466164/cloud-init.iso","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":true,"driver":"raw","file":"libvirt-1-storage"}' \
-device '{"driver":"ide-cd","bus":"ide.0","drive":"libvirt-1-format","id":"sata0-0-0","bootindex":1}' \
-netdev '{"type":"tap","fd":"31","vhost":true,"vhostfd":"34","id":"hostnet0"}' \
-device '{"driver":"virtio-net-pci","netdev":"hostnet0","id":"net0","mac":"5e:4f:b8:a7:56:98","bus":"pci.1","addr":"0x0"}' \
-chardev pty,id=charserial0 \
-device '{"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0}' \
-device '{"driver":"usb-tablet","id":"input0","bus":"usb.0","port":"1"}' \
-device '{"driver":"usb-kbd","id":"input1","bus":"usb.0","port":"2"}' \
-audiodev '{"id":"audio1","driver":"none"}' \
-vnc 127.0.0.1:2,audiodev=audio1 \
-device '{"driver":"cirrus-vga","id":"video0","bus":"pcie.0","addr":"0x1"}' \
-global ICH9-LPC.noreboot=off \
-watchdog-action reset \
-device '{"driver":"vfio-pci","host":"0000:e1:00.0","id":"hostdev0","bus":"pci.4","multifunction":true,"addr":"0x0"}' \
-device '{"driver":"vfio-pci","host":"0000:e1:00.1","id":"hostdev1","bus":"pci.4","addr":"0x0.0x1"}' \
-device '{"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.5","addr":"0x0"}' \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2025-07-25 17:56:05.352+0000: Domain id=40 is tainted: high-privileges
char device redirected to /dev/pts/4 (label charserial0)
2025-07-25T17:56:48.729749Z qemu-system-x86_64: terminating on signal 15 from pid 3511 (/usr/sbin/libvirtd)
2025-07-25 17:56:50.931+0000: shutting down, reason=destroyed

What Have We Tried?

  1. The FLR reset, as mentioned, doesn’t respond or do anything.
  2. The PCI rescan doesn’t help. Neither a full re-scan nor just the upstream bridge. This solution might also affect other virtual machines running on the host.
  3. We tried to bind the nvidia driver to get the GPU out of the limbo state, but it fails to attach. The device goes to D3Cold, and we see Unable to change power state from D3Cold to D0, device is inaccessible in dmesg.
  4. We attempted to create a VM with and without 1G hugepages. Either leads to a sporadic failure.
  5. We’re binding GPUs to vfio-pci on boot and not changing the driver at runtime. We set managed="no" attribute in the domain XML to prevent libvirt from re-binding drivers.
  6. We haven’t tried vendor-reset. Our understanding is that it is only compatible with AMD GPUs.
  7. We haven’t tried modifying VBIOS. It is a pretty risky change, and our software runs in a variety of data centers across different providers. It is a tall order to reflash all GPUs, though; we are happy to consider it if anyone has evidence that this might help.
  8. We haven’t tried ducktaping SMBus pins on the GPU. We haven’t found evidence that this solution would help in this scenario, but we are happy to reconsider. Like the solution above, it is a tall order to ask providers to perform this hack on their entire GPU fleet.

Relevant Information Online

How to Participate

Prize: $1000 USD for a working mitigation or fix. In case the fix cannot be found, we’ll send the money to anyone who helps us understand the root cause, reproduce the issue, or assist in other ways.

Career Opportunity: We’re actively hiring a systems engineer and would be glad to interview anyone who helps us to fix the issue or provides valuable guidance. Check out the position here.

Goal: Prevent or recover from the unrecoverable GPU state post-VM shutdown/init (FLR failure, “unknown header type 7f”, etc.).

Join us: Head to our #bug-bounty Discord channel for full system logs and more information from the development team.

Contact: bug-bounty@cloudrift.ai

Help us make GPU virtualization more robust for the AI community — and claim your bounty in the process.