Per-kernel speedup vs PyTorch eager

FP32 on RTX 5090. Ratio = eager_us / backend_us (higher is faster). Eager is the baseline at 1.0 (dashed line). Sorted by deplodock ratio: wins at top, losses at bottom.