-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extrem Slow Write in VM KVM/QEMU #62
Comments
Does your VM sit on top of RAID storage, or networked storage, or anything else that might make flushing data to stable storage incur a significant latency? VDO is pretty aggressive (probably more than necessary) about flushing any write-back caches and waiting for stuff to make it to stable storage. XFS, on the other hand, tries very hard to minimize flushes, sometimes doing them only every 30 seconds or so. If you’re dealing with storage that doesn’t support flushes, the issue is probably somewhere else. Some things to look at first within the VM would be: /sys/block//queue/write_cache This virtual file may say “write back” if caching is enabled, “write though” if we don’t need to worry about flushing writes. sar -d 1 This command (or the iostat equivalent) should show us, second by second, whether there’s I/O activity happening from VDO to storage, and how fast, or whether the issue is somewhere else. Another thing to check is if the VM is consuming a lot of resources in the host environment, triggered by VDO usage patterns -- for example, lots of cycles processing interrupts. Are you able to see if qemu is using lots of CPU cycles, or if the host environment reports excessive numbers of interrupts? Our performance testing has generally been on real, fast hardware (targeting server farm environments) rather than VMs, so it’s possible there are issues we haven’t seen. |
nillin42:
So VDO’s backing storage isn’t a raw disk device directly, it’s a file in ext4, by way of qemu disk emulation, correct? File system overhead might be an issue, especially if “writethrough” translates to “synchronous I/O”. (I don’t know if it does or not.) Two things I would suggest trying:
Either of these, or the combination, might improve things, depending where the slowdown is coming from.
/proc/interrupts has counters per interrupt type and per cpu; rapidly increasing numbers may indicate a lot of interrupts. |
So I change the qcow2 file to direct access to a partition- also without lvm layer: diskpart-->virtio/scsi-vm --> lvm/vdo-->xfs, |
tigerblue77 wrote:
It’s worth experimenting. From what I’ve seen with basic write-back caching, the results could go either way. With a RAID 6 configuration, my guess would be the cache may be useful for collecting full or nearly-full stripes together for writing (simplifying checksum calculation), whereas VDO’s write pattern can be a bit more random -- we shuffle things a bit to try to write consecutive blocks together, but close-but-not-consecutive writes aren’t reordered, and VDO doesn’t know anything about the RAID stripe configuration.
No, I wouldn’t say that. I just meant that it’s one area where VDO’s performance (or EXT4/XFS on VDO) is likely to look poor compared to file systems on raw storage, because we do so much flushing of buffers. XFS gets away with fairly infrequent flushes (e.g., a couple every 30 seconds), yet XFS-on-VDO will send many flushes to the underlying storage, perhaps several per second. Without having dug into it deeply yet, I would blithely assert :-) that in the XFS-on-VDO case, we probably shouldn’t need to issue many flushes when we haven’t received a flush from XFS (or whatever’s above us, in the general case). Though it may take some significant work on the code to get it to send fewer flushes, safely. It may be the case (I haven’t checked) that EXT4 might send more flushes to VDO than XFS, but if it’s not a high rate I doubt it makes much difference. All that said, I haven’t actually done performance tests of different filesystems atop VDO to see how they fare. If you do, I’d be interested in the results. The tests I've done generally involve writing directly to the VDO device.
This (neighborhood of 200MB/s, with a fair bit of variability moment to moment) looks more like what I might expect from VDO, depending on the hardware. Though sending zero blocks is a bit of a cheat, as we don’t store zero blocks except as a special mapping from the logical address. OTOH, if it was a newly created device, the first writes have to populate the block-address-mapping tree, even if you’re writing zero blocks, and we’ve noticed a bit of a performance hit in the tree initialization until it’s fully allocated, at least for the logical-address region your test cares about. I’ve gotten over 1GB/s write throughput (of nonzero data) in a higher end server configurations, but it took some tweaking of thread counts, CPU affinities, and other stuff.
So we got about 40% of the raw disk throughput with VDO? Not too bad to start with. If you want to tune it, I’d look at whether any of VDO’s threads are using a lot of CPU time. For the thread types with thread-count parameters, aside from the “bio” threads, if the CPU usage is over 50%, bump up the thread count to distribute the load better; if it’s below… oh, maybe 20%… try dropping it by one, to reduce the thread switching needed.
If your system has a NUMA configuration, there’s a lot of tweaking of CPU affinities to look at too. Actually, it might help in a non-NUMA configuration with many cores, but I haven’t explored that. Cache contention between CPU modules is generally a bigger issue than between cores in the same CPU. (And in tools like “top” it looks like busy threads, because they stay on CPU while waiting to load data.) Some XFS tweaks may help, too -- since XFS will frequently rewrite its logs and they’re unlikely to usefully deduplicate, you could add a non-VDO volume in your RAID volume group, alongside VDO, to use as an external XFS log. |
nillin42 wrote:
I’m sorry to hear you’re having such problems setting it up. Writing 20 GB in 10 minutes is about 34 MB/s. That’s not great, but it’s a 10-fold increase over your initial report. I assume when you say “dedup is disabled” you mean XFS deduplication? Or did you somehow get a configuration with VDO’s deduplication turned off? I ran a test in a VM running Rocky 8.7 (host environment: ThinkPad T580 laptop running Fedora 35, disk images stored in ext4), created using “vagrant” and this configuration:
I don’t know what sorts of bugs you encountered setting it up, but this part was pretty straightforward for me. A reboot of the VM was needed because the update pulled in a new kernel. After creating an LVM VDO volume in the second disk, creating an XFS file system on it (no extra command-line options), and creating a 17 GiB test file (tar image of /etc and /usr, some of it compressible and some not, replicated 10 times with 3 extra bytes in between so as to shift the contents relative to block boundaries and not create identical runs of blocks), I tried copying it into the XFS file system:
So, about 6 minutes to copy and sync. That’s about 46 MiB/s. Now, VDO has an unfortunate extra speed penalty when it’s first created, when the block-address-mapping tree hasn’t been allocated yet, as I mentioned in my reply to tigerblue77. The first time an address in a range is written to (including a zero block but not including discards), one or more blocks in the tree may need to be allocated and written. Currently, VDO serializes every step of this, and it slows those first writes down a bit. So I tried another test: Without tearing down the VDO device, which now has a bunch of the tree blocks allocated, I unmounted the file system, ran “blkdiscard” on the VDO volume, created a new file system and mounted it. I also created a new test file which shifted the contents of the previous test file by one byte, to prevent trivial deduplication against blocks that had been stored earlier in my test. And I copied the file again:
More than a minute faster, at almost 57 MiB/s. Not stellar, but for good performance, as I indicated earlier, I’d want to remove the virtualization layer and file system layer underlying it.
I don’t know about dd vs cp, off the top of my head, but copying from /dev/urandom could trip another performance issue. When you read from /dev/urandom, that process cranks through the PRNG code in the kernel to generate your data. I tried a dd writing directly to the VDO volume (no XFS, but not a freshly created VDO volume, so block map tree already allocated) in my Rocky VM. Doing a dd from /dev/urandom to the VDO volume took just over 30s. Doing a dd from /dev/urandom to a file in the root file system took nearly 30s, and the dd process showed nearly 100% CPU utilization. (For any kernel hackers in the audience, “perf” says nearly all of the CPU cycles are going to _raw_spin_unlock_irqrestore.) Doing a dd from that file to the VDO volume took just under 10s. A copy and sync of a random file to XFS-in-VDO took around the same. For comparison, doing a dd from /dev/urandom to a file in the host environment instead of the VM, but on the same hardware, takes under 3.5s. Also, when you write to VDO, we make a copy of the data block in order to do compression and deduplication work after the initial acknowledgement of the write operation, and currently that copy is performed in the calling thread. So if the dd thread is both reading from urandom and writing directly to VDO, it incurs a lot of extra CPU overhead that, with a little care, can be distributed instead. For these reasons, if you want to test with randomly generated data, I suggest writing that data to a file first, especially if you’re using a virtualization environment. With fio’s user-mode PRNG or in a non-virtualized environment, it’s less of an issue, but worth keeping in mind.
Once again I’d wonder about the block map tree allocation issue. XFS allocation patterns might also play into it but I haven’t investigated. I tried this test (still in my VM) and it kept completing in just a few seconds. So I increased the size 10-fold, to 2500M. The first run took almost 70s, but the next one took under 20s. I deleted the file and tried again: 59s, then 11s. With repeated tests, or writing to a second file, the numbers vary, but the general pattern seems to hold.
If this test also used /dev/urandom, or if fio is generating random data, then no; it should’ve tried, but it would’ve failed because random bytes aren’t compressible. (And VDO doesn’t store the compressed version unless it compresses by a little better than 2:1.) Unless you’re making multiple copies of the same random file, deduplication would’ve also consumed CPU cycles and wound up saving no space. So every block seen would be new and would require a full block to store. It’s not really a good demo of VDO’s capabilities, though of course we should be able to handle it with okay throughput. For our internal testing, we use a modified version of fio that can be told what sort of duplication pattern to generate in the synthesized data, and we generally specify how compressible the test data should be.
That sounds amiss. VDO is actually a moderate consumer of CPU cycles, for hashing, index scanning, and compressing, whether deduplication and compression are successful or not. Much of the CPU usage is distributed across a bunch of kernel threads, but overall usage adds up. In my test above, the 2 CPUs given to the test VM ran at a little over 50% utilization. If you’re not seeing much CPU utilization at all, most likely that means it’s not getting data very fast (e.g., dd spending too much time reading from urandom), or it’s not able to write and flush data to the backing storage very fast and so the whole pipeline backs up (in which case, compressible or duplicated data may get you better performance).
Our primary target is server type environments with lots of storage and lots of processing power, running directly on real hardware -- file servers, virtualization servers, stuff like that. But it should work on smaller systems too, unless they’re really underpowered. On a really small system, there might not be enough duplication of data for VDO to be worth the overhead. (We really don’t focus much effort on compression-only setups, but we’ve discussed it.) Most of our team’s lab test machines are either VMs or server systems with lots of RAM and CPU, so a test on “normal” hardware (whatever that means) instead of a server is a little tricky for me to set up. |
Oh, I mis-spoke above... in the tests I did a couple years ago, it was btrfs that only flushes its data to disk every 30 seconds, not XFS. |
thanks to make a test environment and post the results. My goal was to test VDO with compression and without deduplication enabled, but to make deduplication offline by using xfs because i think vdo deduplication works similar ressource intensiv than that of zfs. My target use case is virtualization host in HCI. I'm not sure but u use vdo with active deduplication. Then the results are not so bad if your underlying disk performs with around 100MB/s. |
Hi Developer,
I wanted to test Performance of VDO in a Virtual Machine. My test environment was Debian 11 KVM/QEMU. I created a VM with Q35, Virtio SCSI with 20 GB and Rocky 8.5 and 9.1 OS. Created a VDO with compression and ontop a XFS FS. Copied files from second vdisk (other Harddisk) to VDO/XFS and got writes of 3 MB/s. Without VDO I got nearly 100MB/s. There are no CPU or RAM bottlenecks. Exists there a problem with VDO in virtual machines? There is no hint in the Redhat docu.
regards
The text was updated successfully, but these errors were encountered: