Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vioscsi] Fix SendSRB regression and refactor for optimum performance [viostor] Backport SendSRB improvements #1135

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

benyamin-codez
Copy link
Contributor

Fixes regression in SendSRB of [vioscsi]

Related issues:
#756
#623
#907
...
[likely others]

Regression was introduced in #684 between 7dc052d and fdf56dd.
Specifically due to commits f1338bb and fdf56dd (considered together).

PR includes:

  1. Send SRB fix
  2. Refactoring for optimum performance
  3. WPP instrumentation improvements
  4. SendSRB improvements backported into [viostor]
  5. Minor cleanup

The WPP improvements include:

  1. Macro optimisation
  2. Exit fn() insurance
  3. Prettify trace output
  4. New macros for FORCEDINLINE functions

Look out for 1x TODO + 2x FIXME... RFC please.

Tested on:

  • QEMU emulator version 9.0.2 (pve-qemu-kvm_9.0.2-2) ONLY
  • Linux 6.8.12-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) x86_64 GNU/Linux
  • Hypervisor: Intel-based ONLY
  • Guest: WindowsServer2019 (Win10 build) ONLY
  • Directory backing ONLY
  • Promox "VirtIO SCSI Single" with iothreads (multiple HBAs) was used mostly
  • Limited testing done on single HBA with multiple LUNs (iothread=0)
  • AIO = threads ONLY - UNTESTED on io_uring or native AIO (suspect this will fix issues here)
  • Local storage backings (raw and qcow) ONLY, untested on networked backings

YMMV with other platforms and AIO implementations (but I suspect they will be fine).

Performance is significantly higher, even when compared to version 100.85.104.20800 in virtio-win package 0.1.208.
I suspect this is mainly because this fix unlocks other improvements made when the regression was introduced.

Occasional throughput of 12+GB/s were achieved in guest (64KiB blocks, 32 outstanding queues and 16 threads).

Sequential reads performed at just below the backing storage max throughput at 7GB/s (1MiB blocks, 8 outstanding queues and 1 thread). Random reads came in at 1.1GB/s, 270K IOPS at 1850 µs latency for (4KiB blocks, 32 outstanding queues and 16 threads). Write performance was approximately 93% of read performance.

I tested on a variety of local storage mediums but the above numbers are for a 7.3GB/s Read / 6GB/s Write 1M IOPS R/W NVMe SSD. So throughput was as expected but IOPS were only ~27%.... It will be interesting to see what numbers come up on a decently sized NVMe RAID0/10 JBOD...! Certainly beware of running on old crusty spindles..! 8^d

It is also worth mentioning that the ETW overhead when running an active trace with all flags set is about 8%.

Freedom for Windows guests once held captive...! 8^D

cc: @vrozenfe @JonKohler @sb-ntnx @frwbr @foxmox

Related external issues (at least confounded by this regression):

https://bugzilla.proxmox.com/show_bug.cgi?id=4295
https://bugzilla.proxmox.com/show_bug.cgi?id=4295
https://bugzilla.kernel.org/show_bug.cgi?id=199727
...
[likely others]

@vrozenfe
Copy link
Collaborator

@benyamin-codez

Thanks a lot.

Would it be possible to split this PR into two separate - WPP and SRB related ones?

Can you please share the qemu command line and the performance numbers?

If possible it will be nice to compare IOs/throughput numbers in relation to the number of queues, queue depth, IO size, CPU usage and IO latency for the old and new code. (However, if there is any problem with that I will try to test it by myself).

Best,
Vadim.

@benyamin-codez
Copy link
Contributor Author

@vrozenfe

Happy to help Vadim.

Can do the split. Gives me chance to fix ProcessQueue() too (https://github.com/virtio-win/kvm-guest-drivers-windows/pull/1135/files#diff-b8f1732536a5020c949209736b628a0af83022fbbded42c0eaef7bc3149dd6c3L1475-R1493)...

I can probably run up some comparisons again too.
I'll use diskspd and post the results applicable to each split with the new PRs. How does that sound...?

I've used several different qemu command lines for various scenarios including with blkdebug to simulate failures.
Would you like me to post the one coincident with the performance stats?

It's likely to be all using threads with a HBA per disk. That ok? I can share my stat collector script if that helps.

Cheers,
Ben

@JonKohler
Copy link
Contributor

A few passing comments. First, fantastic work, I jumped out of bed seeing this this morning, very exciting stuff.

Some house keeping, small stuff:

  • Let's split this up as Vadim said. Smaller and more narrow the merrier. As you've seen, driver code is complex, it lives forever, and regressions/nuances can come up weeks/months/years later.
  • On each commit, can we do a bit of long-hand on the commit msg themselves? Code is code of course, but its good to get a bit of explanation on what the intentions, observations, impacts, etc
  • On each commit, include a signed-off-by (i.e. git commit -s)
  • Avoid merge commits (IMHO), each PR should be just regular ole commits only.

Past that, RE sharing scripts/cmdlines/data - yes please! more the merrier, and it will only help the overall community here get better and stronger over time.

RE performance - very exciting stuff, this gets me going. When talking about it, its great to talk about test X, with IO pattern ABC with both before and after. Maybe even throw it in a simple table in markdown in the PR? Bonus points for exact incantations (like diskspd.exe -blah -here -stuff) so that others can reproduce on other configurations

Copy link
Contributor

@JonKohler JonKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, Ben - did a first pass


if (!Srb) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do we ever hit this scenario where we are coming into SendSRB with a null Srb? Meaning, are there any callers that actually can do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed like a sanity check to me. It lives on, so maybe someone saw one in the past. We can add a WPP macro here to check and use a reserved flag and just look for that in the trace. It would take pretty extreme performance testing, and more to the point - fairly thorough fail condition testing. Maybe raise a new issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it is a kind of sanity check. Static driver verifier was failing without this check.

}

EXIT_FN_SRB();
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this extra indent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nor the indent for the brace at L75.


if (!Srb) {
EXIT_FN_SRB();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXIT_FN_SRB() is an alias for debuggin, right? if so, can we align/indent appropriately here? Means it easier to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question here I think is: does it stay or does it go...? Another call slows the driver down.
Maybe this is technically a macro, so normally they wouldn't be indented methinks.
@vrozenfe can probably answer this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth mentioning that we can probably add some conditional blocks like #ifndef TRACE_VQLOCKS or #ifndef TRACE_SRBS to wrap these. Then indenting might be less of an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other option here, if everyone is ok with using it would be to use goto label_at_end_of_function; and then drop label_at_end_of_function: just before the WPP macro call. I'm fine with that. It's really what goto in C is supposed to be used for and cannot be used to jump out of the function like in other langs. I see it in [virtio] in some places. Many projects frown on using goto though, so...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... goto is also faster - which I'm guessing we all like...

Copy link
Contributor Author

@benyamin-codez benyamin-codez Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can just drop an appropriately indented RhelDbgPrint(); and leave it at that.
If we use say RhelDbgPrint(TRACE_LEVEL_RESERVED8, " After KickPrep Notify State : %d \n", notify); that will also work with WPP provided TRACE_LEVEL_RESERVED8 is defined in WPP_DEFINE_CONTROL_GUID(). You can then just toggle that flag to enable or disable the trace.

Any of these work for me. What I think is most important is that there is working tracing...

vioscsi/helper.c Show resolved Hide resolved
vioscsi/helper.c Show resolved Hide resolved
RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
RhelDbgPrint(TRACE_LEVEL_WARNING, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
}
VioScsiLockManager(DeviceExtension, MessageID, &LockHandle, FALSE, VIOSCSI_VQLOCKOP_UNLOCK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the bug here that we're simply unlocking the queue too soon, such that when there is the error path here, things can get jumbled up and blow up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, but probably something underlying in the virtqueue implementation too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change doesn't just touch the error path but also the happy path (res >= 0):

  • Without this change (on 54442f0), a possible sequence of events for the happy path is
    VioScsiVQLock -> virtqueue_add_buf -> VioScsiVQUnlock -> virtqueue_kick_prepare -> virtqueue_notify
    
    so virtqueue_kick_prepare and virtqueue_notify happen outside the lock.
  • With this change, the sequence of events is
    VioScsiVQLock -> virtqueue_add_buf -> virtqueue_kick_prepare -> VioScsiVQUnlock ->  virtqueue_notify
    
    So with this change, virtqueue_kick_prepare happens inside the lock, virtqueue_notify happens outside the lock.

I'm no VirtIO expert so I can't really say whether virtqueue_kick_prepare is supposed to happen inside or outside the lock. But FWIW, to me it looks like the Linux virtio scsi driver also kicks inside the lock and notifies outside the lock:

https://github.com/torvalds/linux/blob/b311c1b497e51a628aa89e7cb954481e5f9dced2/drivers/scsi/virtio_scsi.c#L484

Also, it looks like until and including 7dc052d, the virtqueue_kick_prepare used to happen inside the lock:

notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]) ? TRUE : notify;

and happens outside the lock starting with f1338bb (this commit is also mentioned by @benyamin-codez)

if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {

Could the early unlock be the cause for the vioscsi "Reset to device \Device\RaidPortN" + IO lockups?

Copy link
Contributor Author

@benyamin-codez benyamin-codez Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the early unlock be the cause for the vioscsi "Reset to device \Device\RaidPortN" + IO lockups?

Yes, under high load conditions, this could certainly be a trigger.

It's probably worth pointing out that virtqueue_kick_prepare() and virtqueue_kick() are two different things. An actual kick includes the virtqueue_notify(), which executes the notification callback on the VQ.

Here is the latter from VirtIOPCICommon.c:

void virtqueue_notify(struct virtqueue *vq)
{
    vq->notification_cb(vq);
}

void virtqueue_kick(struct virtqueue *vq)
{
    if (virtqueue_kick_prepare(vq)) {
        virtqueue_notify(vq);
    }
}

However, virtqueue_kick_prepare() - which in our case calls virtqueue_kick_prepare_split() - prepares the buffer and returns TRUE if the device needs to be notified. Here is the return comparator for reference:

if (_vq->vdev->event_suppression_enabled) {
    return wrap_around || (bool)vring_need_event(vring_avail_event(&vq->vring), new, old);
} else {
    return !(vq->vring.used->flags & VIRTQ_USED_F_NO_NOTIFY);
}

...and:

/* The Host uses this in used->flags to advise the Guest: don't kick me when
 * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
 * will still kick if it's out of buffers.  */
#define VIRTQ_USED_F_NO_NOTIFY	1

It also bears mentioning, that even when a notification callback is required it is completely up to the device as to what it will do. It may well just ignore the kicks...

RhelDbgPrint(TRACE_LEVEL_WARNING, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
}
VioScsiLockManager(DeviceExtension, MessageID, &LockHandle, FALSE, VIOSCSI_VQLOCKOP_UNLOCK);
if (notify){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we not want to kick the underlying backend in the failure case?

In the QUEUE_FULL branch above, notify will be false, yea? In which case, we wont send a kick. I'd imagine we would want to kick in unilaterally, just in case it needs to be hit with a hammer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there was a notify = TRUE; at the foot of the QUEUE_FULL branch previously. I didn't notice any change with or without it in testing. I can have another go around though, but it could be in virtqueue implementation (which I'm presently looking at).

Copy link
Contributor Author

@benyamin-codez benyamin-codez Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it appears that res will only ever (famous last words) be a 0 (SUCCESS) or a 28 (ENOSPC: No space left on device), i.e. SCSISTAT_QUEUE_FULL. The test should really be if (res == 0) {}...

As no data was added to a buffer, does it really need any further action?

You might observe in my reply to @frwbr above that a call to virtqueue_kick() just calls virtqueue_kick_prepare() and so a virtqueue_notify(vq) won't happen without the former returning 0.

We could call a virtqueue_notify(vq), but the device won't find anything there.

My thoughts were that it would just consume compute resources and time.

if (res >= 0) {
notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, in this case, we're only //potentially// kicking the underlying device if we're successful in enqueueing work.

This function also has a membarrier, which are are now doing under the lock, vs outside the lock in the previous code

I certainly love it when stuff is faster, but it isn't immediately clearer to me why this code would be more performant than the existing code. Is the idea that we're changing the amount of kicks happening for the better? Would love to know your observations here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the breaking change following the v208 package.
This is where I'm now focusing my efforts per update in #756...
The underlying issue does certainly seem to be about managing virtqueue notifications...

@@ -185,6 +191,7 @@ DeviceReset(

ENTER_FN();
if (adaptExt->dump_mode) {
EXIT_FN();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on my comments about indentation if we can get away with it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's wait to see what Vadim says.

@@ -384,46 +393,48 @@ ENTER_FN();
SP_INTERNAL_ADAPTER_ERROR,
__LINE__);
RhelDbgPrint(TRACE_LEVEL_FATAL, " CANNOT READ PCI CONFIGURATION SPACE %d\n", pci_cfg_len);
EXIT_FN();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole part should be a separate PR. Not that its bad, just seems orthogonal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue was that in the trace I could see entry into the functions, but when leaving early, it indicated a potential problem, because the corresponding exit trace was missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jon, if you were actually talking about the section below, it was because it was in orphaned braces.

@benyamin-codez
Copy link
Contributor Author

@JonKohler

Thanks for the contribution and review Jon.

I'll take the housekeeping on board and I'm happy to share my stat collector scripts and results too.

Give me a couple of days - I have some other priorities to deal with...

Copy link

@frwbr frwbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @benyamin-codez for your efforts and for submitting this PR!

If I understand correctly this PR combines a possible fix for the vioscsi "Reset to device \Device\RaidPortN was issued" messages + IO lockups with refactoring to improve performance and WPP instrumentation (it was already suggested to split up the PR in smaller ones). Personally I'm most interested in the "Reset to device" fix (though performance improvements are of course also nice). So I tested the reproducer on current master and with this PR applied:

  • On 54442f07 the reproducer produces the vioscsi warnings pretty reliably (>=9 out of 10 runs)
  • With this PR applied on top (benyamin-codez@976c325) the reproducer has not produced vioscsi warnings so far!

So this looks promising, but I'd like to understand the why -- I've also added a comment inline.

Thanks again!

RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
RhelDbgPrint(TRACE_LEVEL_WARNING, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
}
VioScsiLockManager(DeviceExtension, MessageID, &LockHandle, FALSE, VIOSCSI_VQLOCKOP_UNLOCK);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change doesn't just touch the error path but also the happy path (res >= 0):

  • Without this change (on 54442f0), a possible sequence of events for the happy path is
    VioScsiVQLock -> virtqueue_add_buf -> VioScsiVQUnlock -> virtqueue_kick_prepare -> virtqueue_notify
    
    so virtqueue_kick_prepare and virtqueue_notify happen outside the lock.
  • With this change, the sequence of events is
    VioScsiVQLock -> virtqueue_add_buf -> virtqueue_kick_prepare -> VioScsiVQUnlock ->  virtqueue_notify
    
    So with this change, virtqueue_kick_prepare happens inside the lock, virtqueue_notify happens outside the lock.

I'm no VirtIO expert so I can't really say whether virtqueue_kick_prepare is supposed to happen inside or outside the lock. But FWIW, to me it looks like the Linux virtio scsi driver also kicks inside the lock and notifies outside the lock:

https://github.com/torvalds/linux/blob/b311c1b497e51a628aa89e7cb954481e5f9dced2/drivers/scsi/virtio_scsi.c#L484

Also, it looks like until and including 7dc052d, the virtqueue_kick_prepare used to happen inside the lock:

notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]) ? TRUE : notify;

and happens outside the lock starting with f1338bb (this commit is also mentioned by @benyamin-codez)

if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {

Could the early unlock be the cause for the vioscsi "Reset to device \Device\RaidPortN" + IO lockups?

@benyamin-codez
Copy link
Contributor Author

@frwbr

Thanks for dropping your post above, Friedrich, and for taking the time to build and test it out.

For those interested, check out my inline reply.

@benyamin-codez
Copy link
Contributor Author

@vrozenfe @JonKohler @sb-ntnx @frwbr @foxmox

Please bear in mind that despite fixing the issue mentioned in #756 the performance is not reliable.

Given I've got limited time over the next few days, and I'm not aware of the timing for the next release, would you prefer I spend the time I do have on submitting the new SendSRB fix PR for review or hunt down the issue affecting performance first. Personally, I'm happy to continue working on it here until it's ready for prime time. One reason for this position is that I don't think it is as stable as the 100.85.104.20800 version. The extra performance the fix permits has affected stability and reliability.

Some pictures might help explain what I mean... Each of the following show the end point of multiple runs of my stat collector. There should be 6 distinct peaks in I/O for each test run; the same for CPU... All are using aio=threads with iothread enabled. You will observe that it is very difficult to get stats that don't require caveats and interpretation.

This one is running just the SRB fix (this one doesn't show an example of CPU utilisation - my bad 8^d ):

SrbFix_only

This one is version 100.85.104.20800:

v100 85 104 20800

This one was a test running the code in this draft PR, but with the optimisations for ProcessQueue() removed:

ProcessQ_without_inline

This one is the same, but is running two HBAs with iothread enabled:

ProcessQ_without_inline_2_disk_mq

This last one might appear more reliable, but this is because it is only operating with throughput of 9 to 17% of the others which are all single HBA with iothread enabled, and you should notice it eventually becomes unreliable too. So this is why I am focusing on this issue at the moment.

Also, re the stat collector scripts, can someone suggest a suitable place in the tree, e.g. /vioscsi/stat-collector or /Tools/stat-collector or similar? I'll then do a PR for them too.

Let me know...

@benyamin-codez
Copy link
Contributor Author

Fail on Disk Stress (LOGO). Not surprised. 8^(

Comment on lines 197 to -211

element = &adaptExt->processing_srbs[QueueNumber];
VioStorVQLock(DeviceExtension, MessageId, &LockHandle, FALSE);
if (virtqueue_add_buf(vq,
&srbExt->sg[0],
srbExt->out, srbExt->in,
&srbExt->vbr, va, pa) >= 0) {
res = virtqueue_add_buf(vq,
&srbExt->sg[0],
srbExt->out, srbExt->in,
&srbExt->vbr, va, pa);

if (res >= 0) {
notify = virtqueue_kick_prepare(vq);
element = &adaptExt->processing_srbs[QueueNumber];
InsertTailList(&element->srb_list, &srbExt->vbr.list_entry);
element->srb_cnt++;
VioStorVQUnlock(DeviceExtension, MessageId, &LockHandle, FALSE);
#ifdef DBG
InterlockedIncrement((LONG volatile*)&adaptExt->inqueue_cnt);
#endif
result = TRUE;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the element = &adapt->processing_srbs[QueueNumber]; to after the virtqueue_kick_prepare() might be an issue here... I will refactor and retest...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same for [vioscsi]... I will check there first.

Copy link
Contributor Author

@benyamin-codez benyamin-codez Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already done in my present [vioscsi] WIP... Will look at [viostor] once [vioscsi] is stable.

    //
    // Moved ouside conditional so (res != VQ_ADD_BUFFER_SUCCESS) can access it for RemoveTailList(&element->srb_list)
    element = &adaptExt->processing_srbs[index];
    //
    // res, i.e. virtqueue_add_buf() will only
    // return 0 () or 28 (ENOSPC), i.e. No space left on device.
    // Therefore setting comparator from >= to ==
    // Also defined VQ_ADD_BUFFER_SUCCESS (as 0) in helper.h
    if (res == VQ_ADD_BUFFER_SUCCESS) {
        //
        // Kicker returns like this:
        // 
        //  if (_vq->vdev->event_suppression_enabled) {
        //      return wrap_around || (bool)vring_need_event(vring_avail_event(&vq->vring), new, old);
        //  } else {
        //      return !(vq->vring.used->flags & VIRTQ_USED_F_NO_NOTIFY);
        //  }
        //
        /* The Host uses this in used->flags to advise the Guest: don't kick me when
         * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
         * will still kick if it's out of buffers.
         * 
         #define VIRTQ_USED_F_NO_NOTIFY	1
         */
        // So ....
        //
        notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]);
        RhelDbgPrint(TRACE_LEVEL_RESERVED8, " After KickPrep Notify State  : %d \n", notify);
        //
        // Moved ouside conditional so else can access
        //element = &adaptExt->processing_srbs[index];
        InsertTailList(&element->srb_list, &srbExt->list_entry);
        element->srb_cnt++;
    } else {
        //
        // So to reiterate:
        // This only gets hit if virtqueue_add_buf() returns 28, i.e. the buffer is full.
        //
        // Put the DbgPrint above rest to ensure we get problem data. Also split the long line.
        RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n"\
        , QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
        //
        // Adding equivalent for InsertTailList() as the 
        // working version 100.85.104.20800 driver
        // had ExInterlockedInsertHeadList() here
        // ---> NOT required
        //RemoveTailList(&element->srb_list);
        ScsiStatus = SCSISTAT_QUEUE_FULL;
        SRB_SET_SRB_STATUS(Srb, SRB_STATUS_BUSY);
        SRB_SET_SCSI_STATUS(Srb, ScsiStatus);
        StorPortBusy(DeviceExtension, 10);
        CompleteRequest(DeviceExtension, Srb);
        //
        // Don't think these are necessary as the buffer is full.
        //notify = TRUE;
        //virtqueue_notify(adaptExt->vq[QueueNumber]);
        //
        // ...but do we need this...? CompleteRequest() takes 
        // the SRB and finalises it, so best to wipe ours methinks...
        // 
        srbExt = NULL;
    }

@frwbr
Copy link

frwbr commented Aug 22, 2024

FWIW, I have run my reproducer against a vioscsi build with this PR applied on top of 54442f0 a few more times, and still haven't seen the vioscsi "Reset to device \Device\RaidPortN was issued" messages + IO lockups from #756 so far. I have also applied only the hunk I suspect to be responsible (diff to 54442f0 is here) and have not seen the issue either.

Given I've got limited time over the next few days, and I'm not aware of the timing for the next release, would you prefer I spend the time I do have on submitting the new SendSRB fix PR for review or hunt down the issue affecting performance first.

IMHO, if this PR indeed contains a fix for the IO lockup issues from #756 (this would be the "SendSRB fix" you mention?), it would be amazing to get the fix in soon. So, if time permits, it would be great if you could open an isolated PR for the SendSRB fix. If the fix is indeed in this hunk, I'd expect the PR to be relatively small and thus easier to review than a combination of multiple things. Also, I'd hope the fix wouldn't have a huge performance impact (but this is just hope speaking, I haven't actually run any performance tests myself).

Though the other improvements to performance and tracing from this PR also sound promising, in comparison to the IO lockups they seem to be of lower priority. But this is just my opinion, YMMV :)

One reason for this position is that I don't think it is as stable as the 100.85.104.20800 version. The extra performance the fix permits has affected stability and reliability.

Can you elaborate what you mean by "stability and reliability" here? Do you mean the performance is not as uniform over time (more "bursty")?

Some pictures might help explain what I mean... Each of the following show the end point of multiple runs of my stat collector. There should be 6 distinct peaks in I/O for each test run; the same for CPU... All are using aio=threads with iothread enabled. You will observe that it is very difficult to get stats that don't require caveats and interpretation.

Sorry, I'm not sure what I should be looking at in the screenshots and how to compare the different results. Could you explain in more detail what setup was tested in which scenario (maybe with the code git revision and QEMU command line), and share the diskspd outputs?

Fail on Disk Stress (LOGO). Not surprised. 8^(

Which version of the code fails, and in which sense does it fail?

Again, many thanks for you work!

@JonKohler
Copy link
Contributor

Agreed - stability and lockup style bugs must come before optimization type work. If we can isolate that, let's peel that to a separate very skinny commit/PR and get that cooking first

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Aug 22, 2024

@JonKohler @frwbr, I agree with you both. I think we are all on the same page.

I've got some answers to your questions Friedrich. I'm pretty sure I'm on a path now to getting it sorted. I've got to step away for a couple of hours, but I'll post here as soon as I can.

In a meantime, a picture to show what has to be compared. I'm sure you will agree that makes stats pretty useless unless you do relatively short stints over a very long period.

highlight_inconsist_v100 85 104 20800

@frwbr
Copy link

frwbr commented Aug 23, 2024

@JonKohler @frwbr, I agree with you both. I think we are all on the same page.

I've got some answers to your questions Friedrich. I'm pretty sure I'm on a path now to getting it sorted. I've got to step away for a couple of hours, but I'll post here as soon as I can.

Great! I myself will be away from computers for ~2 weeks now, but will be happy to review/test when I'm back. Again, thanks for your work!

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Aug 24, 2024

For those watching:

I've spent a couple more hours tracing this through, and it looks like we have at least three issues.

First, the issue of notifications. That seems to be solved for the most part with fixes already in this PR.

Second, there are some not insignificant issues with spinlock management. This is what I am focused on at the moment.

If it wasn't obvious to anyone, we only issue StorPortReleaseSpinLock() for non-InterruptLock type spinlocks:

VOID
//FORCEINLINE
VioScsiVQUnlock(
IN PVOID DeviceExtension,
IN ULONG MessageID,
IN PSTOR_LOCK_HANDLE LockHandle,
IN BOOLEAN isr
)
{
ENTER_FN();
if (!isr) {
StorPortReleaseSpinLock(DeviceExtension, LockHandle);
}
EXIT_FN();
}

That's an issue, but interestingly, if we fix that, we can reliably reproduce kvm: virtio: zero sized buffers are not allowed when using DpcLock type locks in SendSRB(). This appears to be due to multiple asynchronous DPCs or higher level interrupts being issued without the bounds of sufficient queuing (I have seen both in traces). Using, or should I say reverting to, InterruptLock type spinlocks resolves this issue - likely because other types of spin-locks cannot override these. There is also another potential source in SendSRB() that might produce the same error coincident to any such spin-lock contention:

if (!Srb)
return;

I should mention that using MSI-X based spinlocks don't appear to cause any errors - noting I haven't properly stress tested them yet - but graphical representations of the I/O they produce look problematic and erratic. They also didn't immediately solve the irregularities in performance. We use InterruptSynzhronizePerMessage so the concurrent use of InterruptLock type spinlocks with StorPortAcquireSpinLock or using StorPortSynchronizeAccess per, e.g. SendTMF() are potential problems. Checking the impacts of different vector allocations is on my TODO list.

The third issue looks to be related to be the use of virtqueue assignment. It looks like there are several issues here which I am slowly unpacking... It appears in some places it is possible that a virtqueue might be replaced mid-process upon entering a new function. This could be another cause of kvm: virtio: zero sized buffers are not allowed.

The work continues in a limited fashion for the next few days...

@benyamin-codez
Copy link
Contributor Author

For those watching:

I think I'm almost done and hope to provide a report here in the next 12 to 24 hours.

I'll then split up the commits into new PRs.

The performance reliability issue turned out to be a bit of red herring. Proper preparation of the the diskspd test files with a 2GiB size and random data contents using diskspd.exe -w100 -Zr -d30 <testfile> resolved that for the most part. A pic showing consistent I/O patterns:

smooth_sailing

@benyamin-codez
Copy link
Contributor Author

A somewhat belated progress report:

So I've had trouble reproducing faults with diskspd.exe test files containing random data. The faults do occur with zeroed test files, which I find quite interesting...

It suggests there might be one or more race conditions at play when dealing with zeroed files / data. This could also potentially be a source for the kvm: virtio: zero sized buffers are not allowed errors, but note that here I am only speculating. In any case, we might be seeing errors during file allocation operations e.g. as zeroed-out extents are added, or where extents are verified to be zeroed. IIRC, this does occur both in SQL and ESE databases where errors have been reported.

In relation to my fixes: apart from the initial notification fix in SendSRB(), I have the fix for VioScsiVQUnlock() and some further improvements for DPC processing.

It's noteworthy that AFAICT, we really only do DPC over MSI-X now. It would seem that a call to use an InterruptLock type spinlock would now be a most atypical operation. Although I built a spinlock manager function capable of managing all types of spinlocks, including MSI/MSI-X spinlocks, the reality is that DpcLock type spinlocks are likely the only ones we would now explicitly call (StorPort automatically holds InterruptLock and StartIoLock spinlocks while executing some functions).

I also have a large amount of refactoring, much of which is perhaps of little benefit, but might be of some use. Some of the useful stuff is probably my tracing and instrumentation improvements. I will probably drop a branch on my fork with all my edits if anyone is interested. At the moment I am rebasing this work against master to produce smaller, targeted commits for new PRs. I'm pretty sure I also fixed a couple of orphaned sections from previous squashings, which this work will find (provided I didn't create the orphans from my own refactoring).

I did check virtqueue assignments and also MSI-X CPU affinity, and found them to be working correctly in a wide variety of operational scenarios. In my refactored solution, I did implement StorPortInitializePerfOpts() earlier in VioScsiHwInitialize(), i.e. first, but implementing the necessary instrumentation against master showed that master worked anyway with it the other way around.

I'm just making one last revisit to PhysicalBreaks and VIRTIO_MAX_SG and related relationships, but my environment includes some disks with .../queue/max_hw_sectors_kb:32767 which I presume results in scsi_config.max_sectors = 0xFFFF (65,535) via 32,767*1,024/512=0xFFFE ==> + 1 sector = 0xFFFF. For me I also get .../queue/max_segments:168 for these devices which was unexpected. I also have .../queue/max_hw_sectors_kb:128 for some other devices, but a mix of .../queue/max_segments:60 (SATA3 HDDs) and .../queue/max_segments:33 (NVMe SSDs) for these. I suspect QEMU is reporting the highest maximum of all attached devices + 1, i.e. scsi_config.max_sectors = 0xFFFF (65,535) and seg_max appears to simply be MAX_CPU - VIRTIO_SCSI_REQUEST_QUEUE_0... ¯\_(ツ)_/¯

To that end, can someone please confirm from a trace what they get during GetScsiConfig() (it follows right after PCI capability processing during driver initialisation, so near the top of the trace) for seg_max and max_sectors..?
The relevant results of the following host-side commands would also likely be of some assistance:
grep "" /sys/block/*/queue/max_segments
grep "" /sys/block/*/queue/max_hw_sectors_kb

@JonKohler and @frwbr (and anyone else so inclined), are you able to help with this?

@vrozenfe, is there any reason we don't enable the following?
STOR_PERF_DPC_REDIRECTION_CURRENT_CPU
STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO
I found these worked well provided VIRTIO_SCSI_CONTROL_QUEUE != MessageId=0...
Also, do we still need ConfigInfo->Dma32BitAddresses = TRUE for IA-32 support...?

Hope to wrap this up in short order... 8^d

@vrozenfe
Copy link
Collaborator

vrozenfe commented Sep 9, 2024

@benyamin-codez

Technically enabling STOR_PERF_DPC_REDIRECTION_CURRENT_CPU and STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO work quite well for storport miniport drivers can improve performance a bit more. I tested it a bit on viostor (virtio-blk) driver a long time ago but didn't add the relevant code to the virtio-scsi driver. In any case to make STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO code working we need to have some per-cpu list to keep SRBs, extend StartIo routine a bit, and make sure that we complete those SRBs in the per-cpu queues in case of bus/lun reset.

Regarding Dma32BitAddresses. On a real hardware HBA it should be a must to let the DMA engine to work properly. I don't know it enabling or disabling this bit will have any implication for qemu virtio, never tired to turn it off, but I know for sure that some HW vendors run this code on top of FPGA implemented virtio adaptors and this is the reason why we will keep it turned on :)

@frwbr
Copy link

frwbr commented Sep 16, 2024

Hi @benyamin-codez!

Unfortunately I know very little about Windows internals, so can't comment on your observations w.r.t. the different types of spin locks.

So I've had trouble reproducing faults with diskspd.exe test files containing random data. The faults do occur with zeroed test files, which I find quite interesting...

This is quite interesting. I haven't run the reproducer with all-zero test files so far, only with random files. Do I understand correctly these faults are not fixed by the notifications patch?

To that end, can someone please confirm from a trace what they get during GetScsiConfig() (it follows right after PCI capability processing during driver initialisation, so near the top of the trace) for seg_max and max_sectors..? The relevant results of the following host-side commands would also likely be of some assistance: grep "" /sys/block/*/queue/max_segments grep "" /sys/block/*/queue/max_hw_sectors_kb

@JonKohler and @frwbr (and anyone else so inclined), are you able to help with this?

Here is the trace:

 --> VioScsiFindAdapter.
 --> VioScsiWmiInitialize.
 <-- VioScsiWmiInitialize.
 --> InitHW.
MessageControl.TableSize = 6
MessageControl.FunctionMask = 0
MessageControl.MSIXEnable = 1
 MessageTable = 1
 PBATable = 2049
 CapabilityID = 9, Next CapOffset = 84
 CapabilityID = 9, Next CapOffset = 70
 CapabilityID = 9, Next CapOffset = 60
 CapabilityID = 9, Next CapOffset = 50
 CapabilityID = 9, Next CapOffset = 40
 msix_enabled = 1
 <-- InitHW.
 --> GetScsiConfig.
 seg_max 254
 num_queues 4
 max_sectors 65535
 cmd_per_lun 128
 event_info_size 16
 sense_size 96
 cdb_size 32
 max_channel 0
 max_target 255
 max_lun 16383
 <-- GetScsiConfig.
 --> SetGuestFeatures.
 <-- SetGuestFeatures.
StorPortRegistryRead returned 0x0, Len = 0
 NumberOfPhysicalBreaks 513
 MaximumTransferLength 2097152
StorPortRegistryRead returned 0x0, Len = 0
StorPortRegistryRead returned 0x0, Len = 0
 Queues 4 CPUs 4
 breaks_number = 201  queue_depth = 100
 StorPortGetUncachedExtension uncachedExtensionVa = FFFF9C89A0F49000 allocation size = 84096
 Page-aligned area at FFFF9C89A0F49000, size = 49152
 Pool area at FFFF9C89A0F55000, size = 30848
 pmsg_affinity = 0000000000000000
 pmsg_affinity = FFFF9C89A0EEB510 Status = 0
 <-- VioScsiFindAdapter.

The VM has only one disk attached via SCSI, it is a raw disk and on the host side it is an LVM Logical Volume (/dev/dm-13) on a Volume Group on /dev/nvme4n1. See here for the {max_hw_sectors_kb,max_segments}.

Regarding the max_sectors=65535 -- could this simply be the default value set by QEMU here?

Hope this helps. Happy to provide more data if needed.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 16, 2024

Thanks @frwbr !
That's very helpful.

So I've had trouble reproducing faults with diskspd.exe test files containing random data. The faults do occur with zeroed test files, which I find quite interesting...

This is quite interesting. I haven't run the reproducer with all-zero test files so far, only with random files. Do I understand correctly these faults are not fixed by the notifications patch?

The patch works. When using an unpatched version, I found it more difficult to produce the fault with random data, so perhaps YMMV. Maybe try with test files prepared with diskspd.exe -w100 -Zr -d30 <testfile>. I used this with a 2GiB test file. If your file is larger you might need a longer run than 30s.

Regarding the max_sectors=65535 -- could this simply be the default value set by QEMU here?

It is, and seg_max is right above it (via seg_max_adjust), which you will see chooses between the legacy behaviour of 126 (128 - 2) and the more recent 254 (256 - 2), per QEMU virtio_scsi_get_config. The change notes point out that this relates to queue depth and thus the number of request queues, i.e. num_queues = num_cpus, and so MAX_CPU - VIRTIO_SCSI_REQUEST_QUEUE_0 (256 - 2).

This is mostly performance related, as setting the correct NOPB (NumberOfPhysicalBreaks) is important to performance and correcting the shortcomings of this is done via custom QEMU mods or using the PhysicalBreaks registry setting. This is because NOPB should always be seg_max - 1 seg_max + 1, and MaximumTransferLength should be seg_max * PAGE_SIZE (EDIT: or NOPB * PAGE_SIZE).

The following table should be insightful:

sectors          size = Bytes    /2^10 = KiB   /2^10 = MiB    : KiB      /page = segments
65536 (0x10000h) *512 = 33554432 /1024 = 32768 /1024 = 32     : 33554432 /4096 = 8192 *** OUTSIDE BOUNDS ***
65528 (0x0FFF8h) *512 = 33550336 /1024 = 32764 /1024 = 31.996 : 33550336 /4096 = 8191 qemu hint (scsi_config.max_sectors) limit (last segment before 0xFFFF)
49152 (0x0C000h) *512 = 25165824 /1024 = 24576 /1024 = 24     : 25165824 /4096 = 6144
32768 (0x08000h) *512 = 16777216 /1024 = 16384 /1024 = 16     : 16777216 /4096 = 4096
16384 (0x04000h) *512 =  8388608 /1024 =  8192 /1024 = 8      :  8388608 /4096 = 2048
 8192 (0x02000h) *512 =  4194304 /1024 =  4096 /1024 = 4      :  4194304 /4096 = 1024
 4096 (0x01000h) *512 =  2097152 /1024 =  2048 /1024 = 2      :  2097152 /4096 =  512 *** MAX_PHYS_SEGMENTS LIMIT ***
 2048 (0x00800h) *512 =  1048576 /1024 =  1024 /1024 = 1      :  1048576 /4096 =  256
 2032 (0x007F0h) *512 =  1040384 /1024 =  1016                :  1040384 /4096 =  254 qemu scsi_config.seg_max limit (max_cpu - 2)
 1344 (0x00540h) *512 =   688128 /1024 =   672                :   688128 /4096 =  168
 1024 (0x00400h) *512 =   524288 /1024 =   512                :   524288 /4096 =  128  
  512 (0x00200h) *512 =   262144 /1024 =   256                :   262144 /4096 =   64
  480 (0x001E0h) *512 =   245760 /1024 =   240                :   245760 /4096 =   60
  448 (0x001C0h) *512 =   229376 /1024 =   224                :   229376 /4096 =   56                                                                    
  264 (0x00108h) *512 =   135168 /1024 =   132                :   135168 /4096 =   33
  256 (0x00100h) *512 =   131072 /1024 =   128                :   131072 /4096 =   32

In your NVMe scenario, you should have aligned values of seg_max = 32, NOPB = 31 32 33 and MaximumTransferLength = 128KiB.
EDIT: This cannot be seg_max = 33 or your MaximumTransferLength will be 132KiB and exceed your max_hw_sectors_kb of 128KiB.

So I've worked on this a bit and I have a new solution, in part it looks like this:

EDIT:
The NOPB calculations in this are effectively off-by-one (-1).
They should be one page higher, i.e. NOPB = seg_max+1 and not NOPB = seg_max or even NOPB = seg_max-1...
I will repost below once corrected.

ULONG              nopb_candidate[3] = { 0 };
ULONG              max_segments_b4_alignment;
//...
if (!adaptExt->dump_mode) {
    /* Allow user to override max_physical_breaks via reg key
        * [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
        * "PhysicalBreaks"={dword value here}
        * 
        * *** This should be VIRTIO_MAX_SG - 1, approximated by the maximum number of memory pages (typ. 4KiB each) - 1 ***
        */
    if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_MAX_PH_BREAKS, FIELD_OFFSET(ADAPTER_EXTENSION, max_physical_breaks))) {
        /* We +1 to convert to segments from NOPB */
        adaptExt->max_segments = adaptExt->max_physical_breaks + 1;
        
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate was specified in the registry : %lu | max_segments : %lu \n", 
                        adaptExt->max_physical_breaks, adaptExt->max_segments);
        #endif
    } else {
        /* Grab the VirtIO reported maximum SEGMENTS value from the HBA and put it somewhere mutable */
        adaptExt->max_segments = adaptExt->scsi_config.seg_max;
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate was NOT specified in the registry. We will attempt to derive the value...\n");
        #endif
    }

    /* Use our maximum SEGMENTS value OR use PHYS_SEGMENTS... */
    nopb_candidate[1] = (adaptExt->indirect) ? (adaptExt->max_segments - 1) : (PHYS_SEGMENTS - 1);
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate derived from MAX SEGMENTS : %lu \n", nopb_candidate[1]);
    #endif

    /* Grab the VirtIO reported maximum SECTORS value from the HBA to start with */
    nopb_candidate[2] = (adaptExt->scsi_config.max_sectors * SECTOR_SIZE / PAGE_SIZE) - 1;
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_physical_breaks candidate derived from MAX SECTORS (QEMU/KVM hint) : %lu \n", nopb_candidate[2]);
    #endif
    
    /* Start with a comparison of equality */
    if (nopb_candidate[1] == nopb_candidate[2]) {
        nopb_candidate[0] = nopb_candidate[1];
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the candidates were the same value : %lu \n", nopb_candidate[0]);
        #endif
    } else if (nopb_candidate[2] > 0 && nopb_candidate[2] < (MAX_PHYS_SEGMENTS - 1)) {
        nopb_candidate[0] = nopb_candidate[2];
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the QEMU/KVM hint method (scsi_config.max_sectors) was used to select the candidate : %lu \n", nopb_candidate[0]);
        #endif
    } else {
        /* Take the smallest candidate */
        nopb_candidate[0] = min((nopb_candidate[1]), (nopb_candidate[2]));
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : init - the smallest candidate was selected : %lu \n", nopb_candidate[0]);
        #endif
    }

    /* Check the value is within SG list bounds */
    nopb_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, nopb_candidate[0]), (VIRTIO_MAX_SG - 1));
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : within SG list bounds : %lu\n", nopb_candidate[0]);
    #endif

    /* Check the value is within physical bounds */
    nopb_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, nopb_candidate[0]), (MAX_PHYS_SEGMENTS - 1));
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " nopb_candidate[0] : within physical bounds : %lu\n", nopb_candidate[0]);
    #endif        

    /* Update max_segments for all cases */
    adaptExt->max_segments = nopb_candidate[0] + 1;
    max_segments_b4_alignment = adaptExt->max_segments;

    /* Do byte alignment (using integer division) if necessary */
    if (max_segments_b4_alignment > (PAGE_SIZE / SECTOR_SIZE)) {
        adaptExt->max_physical_breaks = (((max_segments_b4_alignment / (PAGE_SIZE / SECTOR_SIZE)) * (PAGE_SIZE / SECTOR_SIZE)) - 1);
        if (max_segments_b4_alignment != (adaptExt->max_physical_breaks + 1)) {
            adaptExt->max_segments = adaptExt->max_physical_breaks + 1;
        }
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, 
                        " Sector byte alignment : SECTOR_SIZE = %lu Bytes, PAGE_SIZE = %lu KiB, max_segments : original = %lu, aligned = %lu, max_physical_breaks : original = %lu, aligned = %lu \n", 
                        SECTOR_SIZE, (PAGE_SIZE / 1024), max_segments_b4_alignment, adaptExt->max_segments, nopb_candidate[0], adaptExt->max_physical_breaks);
        #endif
    }
}
ConfigInfo->NumberOfPhysicalBreaks = adaptExt->max_physical_breaks;
/* MaximumTransferLength should be calculated from segments not breaks... */
ConfigInfo->MaximumTransferLength  = adaptExt->max_segments * PAGE_SIZE;

#if !defined(RUN_UNCHECKED)
RhelDbgPrint(TRACE_LEVEL_INFORMATION, " NumberOfSegments : %lu | NumberOfPhysicalBreaks : %lu | MaximumTransferLength : %lu Bytes (%lu KiB) \n", 
                (ConfigInfo->NumberOfPhysicalBreaks + 1), 
                ConfigInfo->NumberOfPhysicalBreaks, 
                ConfigInfo->MaximumTransferLength, 
                (ConfigInfo->MaximumTransferLength / 1024));
#endif

This is only part of the solution, as there is a cascading off-by-one error in VioScsiFindAdapter that required fixing, but this solution will automatically set seg_max, NOPB and MaximumTransferLength to best performance settings based on the PhysicalBreaks registry entry.

The problem becomes that even if enabling per HBA registry settings by modding VioScsiReadRegistryParameter to use a per HBA scope, all HBAs will still look in the same registry key (Parameters\Device-1). This becomes a performance issue when you have a mix of physical disks with different requirements, e.g. look at my dev box:

The .../queue/max_segments:168, .../queue/max_hw_sectors_kb:32767 devices are SATA3 HDDs
The .../queue/max_segments:60, .../queue/max_hw_sectors_kb:128 devices are MegaRAID LUNs
The .../queue/max_segments:33, .../queue/max_hw_sectors_kb:128 devices are NVMe SSDs

I can only set NOPB for the lowest max_hw_sectors. However, the MegaRAID LUNs perform best with seg_max = 60, NOPB = 59 and MaximumTransferLength = 240KiB, and so on... Note: To be clear, on execution this particular example is changed to align to seg_max = 56, NOPB = 55 and MaximumTransferLength = 224KiB.
EDIT: This alignment is unnecessary when NOPB is set correctly.

I am presently working on this but it might be a fundamental architectural barrier. I have a few suspects, like they all share the same hba_id and port, but it's early days...

@vrozenfe, can you provide a hint as to how to get each HBA instance to act separately, with it's own memory allocation and NOPB settings...?

I also have fixes for memory allocation, including for max_queues to cater for CPU hotplug. I added a registry setting to enable this so the default is to retain the existing behaviour (only allocate for present CPUs). Otherwise, it's just the previously mentioned notification fix in SendSRB, some performance tuning, some DPC fixes, lots of DPC refactoring and tons of instrumentation.

@vrozenfe
Copy link
Collaborator

FYI there is a RH Jira issue reported internally https://issues.redhat.com/browse/RHEL-56722 and which I'm trying to fix soon.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 17, 2024

@vrozenfe

Per my comments in my last post above. Could the fact using a per HBA scope in VioScsiReadRegistryParameter (StorPortRegistryRead) requires a reg key of Parameters\Device-1 be a hint for what is perhaps going on here?

Normally these would be Parameters\Device0, Parameters\Device1, and so on... We seem to have -1 instead...

Thanks for the hints in your last post here. My observations:

In any case to make STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO code working we need to have some per-cpu list to keep SRBs, extend StartIo routine a bit, and make sure that we complete those SRBs in the per-cpu queues in case of bus/lun reset.

From what I could tell, this seems to be already taken care of by STOR_PERF_DPC_REDIRECTION_CURRENT_CPU and CompletePendingRequests (which I renamed to CompletePendingRequestsOnReset). It appeared to work well in my tests...

Regarding Dma32BitAddresses. On a real hardware HBA it should be a must to let the DMA engine to work properly. I don't know it enabling or disabling this bit will have any implication for qemu virtio, never tired to turn it off, but I know for sure that some HW vendors run this code on top of FPGA implemented virtio adaptors and this is the reason why we will keep it turned on :)

I found enabling this manually (StorPort does it automagically) resulted in a loss of performance.

I came up with this:

/* NOTE: When unset we get -5k IOPS +30us latency (best case)...! */
ConfigInfo->Master                       = TRUE; // +7k IOPS -50us latency
ConfigInfo->ScatterGather                = TRUE; // +12k IOPS -75us latency

/* Allow user to force use of Dma32BitAddresses via reg key
* [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
* "UseDma32BitAddresses"={any dword value here - the value is ignored}
* 
* WARNING: Manually setting this increases latency and reduces IOPS
* 
* NOTE: StorPort normally sets this to TRUE anyway.
*       So let StorPort do it for maximum performance.
*       Only provided in the event StorPort does not enable the feature and it is required.
*/
if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_USE_DMA32BITADDRESSES, FIELD_OFFSET(ADAPTER_EXTENSION, use_Dma32BitAddresses))) {
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_USE_DMA32BITADDRESSES was FOUND in the registry. We will set ConfigInfo->Dma32BitAddresses to TRUE. \n");
    #endif
    ConfigInfo->Dma32BitAddresses = TRUE; // -15k IOPS +100us latency (worst case)
} else {
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_USE_DMA32BITADDRESSES was NOT FOUND in the registry. We will let StorPort manage the Dma32BitAddresses setting. \n");
    #endif
}

/*
    * WARNING: Do not set this.
    *          All of these options increase latency and reduce IOPS:
    * 
ConfigInfo->DmaWidth                     = 0; //<-- This should be zero as initialised by StorPort
ConfigInfo->DmaWidth                     = Width32Bits; //<-- This should be zero as initialised by StorPort
ConfigInfo->DmaWidth                     = Width64Bits; //<-- This should be zero as initialised by StorPort
*/

The #if !defined(RUN_UNCHECKED) conditionals strip all tracing... As should be expected, running without tracing produces slightly better performance than even the driver included in the v208 package.

I also implemented some tracing with #if !defined(RUN_UNCHECKED) || defined(RUN_MIN_CHECKED), mainly for non-verbose tracing in initialisation routines. This tracing has no observable effect on performance.

I'm not sure what this would mean for packaging, but the performance gains of the above are not negligible. I've seen improvements with 4KiB random I/O of upto 50k IOPS and latency reduced by up to 250us.

In any case, my plan is to sort out the per HBA stuff (one way or the other), then I'll drop the whole lot in a branch of my fork to split into new branches and make PRs from. In the interim, I hope to split the notification fix in SendSRB and some perhaps consequential DPC fixes into two separate PRs - probably by week's end. Would that work for you?

@vrozenfe
Copy link
Collaborator

@benyamin-codez
If I understand your question about "get each HBA instance to act separately" correctly then adjusting the transfer length according the block limits reported by the underlying scsi device for the scsi pass-through case as mentioned in https://issues.redhat.com/browse/RHEL-56722 can probably fix some of the problems.

Best regards,
Vadim.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 17, 2024

@vrozenfe

FYI there is a RH Jira issue reported internally https://issues.redhat.com/browse/RHEL-56722 and which I'm trying to fix soon.

There's definitely an off-by-one cascading error in VioScsiFindAdapter.
The problem is likely because NOPB is set to 64 not 63...

EDIT: The problem is likely because NOPB is not set correctly.
Or by default set to 254 which is not aligned...
EDIT: Alignment is not an issue when NOPB is set correctly.

To not rely on the PhysicalBreaks registry entry will require QEMU changes.
Methinks once vioscsi is corrected, a PR for QEMU would be the next logical step.

By the way, I found the Parameters\Device-1 by issuing a StorPortRegistryWrite(). Then spent several hours trying to get REG_SZ working to drop HBA location information to map to Device Manager so the correct PhysicalBreaks registry entries could be made more easily. I tried ASCII, ANSI and Unicode. None of them worked. REG_NONE and REG_MULTI_SZ worked unreliably with some garbage thrown in from conversions. In the end I went back to REG_DWORD and just dropped adaptExt->slot_number in it after InitHW() had returned.

@benyamin-codez
Copy link
Contributor Author

@vrozenfe

@benyamin-codez If I understand your question about "get each HBA instance to act separately" correctly then adjusting the transfer length according the block limits reported by the underlying scsi device for the scsi pass-through case as mentioned in https://issues.redhat.com/browse/RHEL-56722 can probably fix some of the problems.

Apologies for the cross posting... 8^d

That's not really what I meant.

Per my last posts:

...how to get each HBA instance to act separately, with it's own memory allocation and NOPB settings...?

Could the fact using a per HBA scope in VioScsiReadRegistryParameter (StorPortRegistryRead) requires a reg key of Parameters\Device-1 be a hint for what is perhaps going on here?

Normally these would be Parameters\Device0, Parameters\Device1, and so on... We seem to have -1 instead...

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 17, 2024

I thought it might help if I posted a preview of the tracing highlighting the NOPB section.

I should note this is new behaviour from my current WIP build.

Here is the first page:

nvme_log_first_page

Here is a page showing the relevant NOPB section:

nvme_log_nopb_32-31

Here is the next page:

nvme_log_next_page

Here is a page showing what happens without setting the PhysicalBreaks registry setting, i.e. the default NOPB of 247 (starting at 253):

nvme_log_nopb_default_253-247

Here is another showing NOPB set to 255 via PhysicalBreaks:

nvme_log_nopb_255

Here is another showing NOPB set to 511 via PhysicalBreaks:

nvme_log_nopb_511

Here are some quick CrystalDiskMark stats:

20240917_cdm_nvme_compare

I'll try to drop my WIP build in the next few hours, then some split PRs to follow.

EDIT: These NOPB values should all be one page two pages more. I will correct, retest and report below.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 17, 2024

Here is the v208 driver for comparison.
Subsequent versions have the same behaviour.

cdm_checked_p_nopb_various_v208

EDIT:
The NOPB figures to the right of the >> are in hexadecimal, so 513 = 0x201h and 33 = 0x21h, i.e. (correctly) off-by-one.
The queue depth here is 256 (0x100h), i.e. MAX_CPU, which is the same in most versions.

Had to do a few fixes for single CPU configuration and found another DPC fix in my travels...
Will try to get my WIP build posted shortly...
8^d

@benyamin-codez
Copy link
Contributor Author

I dropped my WIP build if anyone wants to have a look:
benyamin-codez@a095c2d

Splits to follow...

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 18, 2024

@vrozenfe

If possible it will be nice to compare IOs/throughput numbers in relation to the number of queues, queue depth, IO size, CPU usage and IO latency for the old and new code. (However, if there is any problem with that I will try to test it by myself).

wrt my stat collector scripts (which should provide all you asked for above), where would be best to drop them in the tree?
Under vioscsi\stat-collector or similar, or maybe under ..\Tools\stat-collector...?

I also have some blkdebug tooling to perform block I/O error injection, per this and this (which the former refers to). Would you like me to drop that somewhere (else) too? It's pretty ugly, but I can clean it up.

@benyamin-codez
Copy link
Contributor Author

@vrozenfe

Considering further where a good spot in the tree might be, perhaps one of the following would work better:
vioscsi\tests
..\Tools\vioblk\tests

The script can then be called stat-collector.bat or similar, with room for future scripts like stat-aggregator.pl, etc.

This might also be a good location for any blkdebug or similar kit too.

@benyamin-codez benyamin-codez deleted the SendSRB-fix branch September 19, 2024 00:45
@benyamin-codez benyamin-codez restored the SendSRB-fix branch September 19, 2024 00:51
@benyamin-codez
Copy link
Contributor Author

@vrozenfe

My bad on the rename. Maybe kill those checks.... 8^d

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 19, 2024

A PR for the SendSRB notification fix has been raised: #1150

@benyamin-codez
Copy link
Contributor Author

A PR for two DPC fixes has been raised: #1152

@benyamin-codez
Copy link
Contributor Author

A PR for minor SendSRB refactoring has been raised: #1153

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 20, 2024

For anyone concerned that performance has diminished from earlier tests published in other places, please be aware that the results above were on a shared disk with other simultaneous I/O occurring. What is important is the comparison with the v208 (100.85.104.20800) driver.

Here is a comparison on a dedicated NVMe disk:

cdm_other_bus_segs_32_compare

It is also important to realise that my NOPB references are likely in error. I will explain this in the next post here. but essentially the NOPB is ignored by StorPort. If specified, StorPort relies on MaximumTransferLength instead, which I calculate on max_segs / max_segments. In the v208 version of the driver this is SP_UNINITIALIZED_VALUE and will be calculated by StorPort to be the same value as I use (at least in this instance), based on the NOPB v208 reports to StorPort.

wrt to unchecked vs. checked build comparisons, the v208 version of the driver has very little tracing during I/O operations. My checked build has about 50% more tracing at the start and much, much more during any I/O.

@benyamin-codez
Copy link
Contributor Author

benyamin-codez commented Sep 25, 2024

Regarding NumberOfPhysicalBreaks (NOPB), after checking in the debugger and tracing Srb->DataTransferLength under various I/O conditions I was able to confirm I was (incorrectly) off-by-one. This has been corrected in my latest build, which I am presently performance testing. I am using the v208 build and master + PRs #1150 and #1152 as comparisons. I can confirm the latter is NOT (incorrectly) off-by-one (using same techniques).

It's noteworthy that both the SCSI Port and StorPort documentation is incorrect wrt NOPB:

SCSI Port:

Specifies the maximum number of breaks between address ranges that a data buffer can have if the HBA supports scatter/gather. In other words, the number of scatter/gather lists minus one. By default, the value of this member is SP_UNINITIALIZED_VALUE, which indicates the HBA can support an unlimited number of physical discontinuities. If the port driver sets a value for this member, the miniport driver can adjust the value lower but no higher. If this member is SP_UNINITIALIZED_VALUE, the miniport driver must reset this member according to the HBA's scatter/gather capacity, with zero representing no scatter/gather support.

When in fact SCSI Port needs:
NOPB = MaximumSGList = ((MAX_BLOCK_SIZE)/PAGE_SIZE) + 1 and not NOPB = MaximumSGList - 1

StorPort:

Maximum number of physical pages the storage adapter can manage in a single transfer (in other words, the extent of its scatter/gather support). By default, the value of this member is 0x11. The miniport driver must reset this member according to the storage adapter's capability.

When in fact StorPort needs:
NOPB = max_pages_in_single_xfer + 1 and not NOPB = max_pages_in_single_xfer.

This thread is also telling: microsoft/Windows-driver-samples#6 (comment)

So whilst the documentation says one thing, the off-by-one behaviour required by StorPort is to maintain backwards compatibility.

Some more links (plus how this relates to MaximumTransferLength):
https://github.com/microsoft/Windows-driver-samples/blob/d65f6bb4041fa74f969e78ce2cfc230a71e824b4/storage/miniports/storahci/src/entrypts.c#L621
https://github.com/microsoft/Windows-driver-samples/blob/d65f6bb4041fa74f969e78ce2cfc230a71e824b4/storage/class/classpnp/src/xferpkt.c#L65-L73
https://github.com/microsoft/Windows-driver-samples/blob/d65f6bb4041fa74f969e78ce2cfc230a71e824b4/storage/class/classpnp/src/class.c#L3438-L3451
https://github.com/microsoft/Windows-driver-samples/blob/d65f6bb4041fa74f969e78ce2cfc230a71e824b4/storage/class/classpnp/src/retry.c#L735-L749

In my refactoring of this part of VioScsiFindAdapter(), I first started with moving from "breaks" to "segments", which here is analagous to "pages". I think there is much value to this for many reasons, not least of which is the above legacy behaviour. I also moved preference from sectors to segments, which also holds value when adaptExt->scsi_config.max_sectors reports more than 0x1000 (4096) and which is equivalent to 512 segments (MAX_PHYS_SEGMENTS). This is important because QEMU sets this value to 0xFFFF (65,535). I'm unsure as to the rationale for this, but note this is off-by-one for a common ../queue/max_segment_size value.

Whilst under the mistaken belief that NOPB should be what the documentation says, I worked around this by using ConfigInfo->AlignmentMask = FILE_BYTE_ALIGNMENT and performing a factor 8 remapping on the number of segments. Whilst I have reverted to using FILE_LONG_ALIGNMENT, I have retained the factor 8 remapping as an optional feature as it can improve performance in some scenarios.

I've made the following changes:

adaptExt->max_physical_breaks is now adaptExt->max_segments
MAX_PHYS_SEGMENTS is now PHYS_SEGMENTS_LIMIT

#define VIRTIO_MAX_SG                    (PHYS_SEGMENTS_LIMIT+1)
#define REGISTRY_MAX_PH_SEGMENTS         "MaxPhysicalSegments"
#define REGISTRY_FACTOR8_REMAP           "PerformFactor8Remap"
#define VIRTIO_SCSI_QUEUE_LAST           (MAX_CPU - VIRTIO_SCSI_REQUEST_QUEUE_0)

..and my WIP now uses:

if (!adaptExt->dump_mode) {
    /* Allow user to override max_segments via reg key
     * [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
     * "MaxPhysicalSegments"={dword value here}
     * OR the legacy value:
     * "PhysicalBreaks"={dword value here}
     * 
     * ATTENTION: This should be the maximum number of memory pages (typ. 4KiB each) in a transfer
     *            Equivalent to any of the following:
     *            NumberOfPhysicalBreaks - 1 (NOPB includes known off-by-one error)
     *            VIRTIO_MAX_SG - 1
     *            MaximumSGList - 1 (SCSI Port legacy value)
     */
    if ((VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_MAX_PH_BREAKS, FIELD_OFFSET(ADAPTER_EXTENSION, max_segments))) ||
        (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_MAX_PH_SEGMENTS, FIELD_OFFSET(ADAPTER_EXTENSION, max_segments)))) {            
        /* Grab the maximum SEGMENTS value from the registry */
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate was specified in the registry : %lu \n", adaptExt->max_segments);
        #endif
    } else {
        /* Grab the VirtIO reported maximum SEGMENTS value from the HBA and put it somewhere mutable */
        adaptExt->max_segments = adaptExt->scsi_config.seg_max;
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate was NOT specified in the registry. We will attempt to derive the value...\n");
        #endif
    }

    /* Use our maximum SEGMENTS value OR use PHYS_SEGMENTS... */
    if (adaptExt->indirect) {
        max_segs_candidate[1] = adaptExt->max_segments;
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate derived from MAX SEGMENTS (as reported by QEMU/KVM) : %lu \n", max_segs_candidate[1]);
        #endif
    } else {
        max_segs_candidate[1] = PHYS_SEGMENTS;
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate derived from PHYS_SEGMENTS : %lu \n", max_segs_candidate[1]);
        #endif
    }

    /* Grab the VirtIO reported maximum SECTORS value from the HBA to start with */
    max_segs_candidate[2] = (adaptExt->scsi_config.max_sectors * SECTOR_SIZE) / PAGE_SIZE;
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segments candidate derived from MAX SECTORS (QEMU/KVM hint per VirtIO standard) : %lu \n", max_segs_candidate[2]);
    #endif
    
    /* Choose the best candidate... */
    if (max_segs_candidate[1] == max_segs_candidate[2]) {
        /* Start with a comparison of equality */
        max_segs_candidate[0] = max_segs_candidate[1];
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : init - the candidates were the same value : %lu \n", max_segs_candidate[0]);
        #endif
    } else if ((max_segs_candidate[2] > 0) && (max_segs_candidate[2] < PHYS_SEGMENTS_LIMIT)) {
        /* Use the value derived from the QEMU/KVM hint if it is below the PHYS_SEGMENTS_LIMIT */
        max_segs_candidate[0] = max_segs_candidate[2];
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : init - the QEMU/KVM hint method (scsi_config.max_sectors) was used to select the candidate : %lu \n", max_segs_candidate[0]);
        #endif
    } else {
        /* Take the smallest candidate */
        max_segs_candidate[0] = min((max_segs_candidate[1]), (max_segs_candidate[2]));
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : init - the smallest candidate was selected : %lu \n", max_segs_candidate[0]);
        #endif
    }

    /* Check the value is within SG list bounds */
    max_segs_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, max_segs_candidate[0]), (VIRTIO_MAX_SG - 1));
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : within SG list bounds : %lu\n", max_segs_candidate[0]);
    #endif

    /* Check the value is within physical bounds */
    max_segs_candidate[0] = min(max(SCSI_MINIMUM_PHYSICAL_BREAKS, max_segs_candidate[0]), PHYS_SEGMENTS_LIMIT);
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " max_segs_candidate[0] : within physical bounds : %lu\n", max_segs_candidate[0]);
    #endif        

    /* Update max_segments for all cases */
    adaptExt->max_segments = max_segs_candidate[0];
    max_segments_b4_alignment = adaptExt->max_segments;

    /* Factor 8 Remapping - may increase performance in certain settings
     * 
     * Allow user to specify factor8_remap via reg key
     * [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
     * "PerformFactor8Remap"={any dword value here - the value is ignored}
     * 
     * NOTE: This is generally only required when the NumberOfPhysicalBreaks (NOPB) value 
     *       reported to StorPort is incorrect (off-by-one) and the number of segments is 
     *       not a product of 8 (where max_segments modulo 8 would not be equal to zero).
     */       
    if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_FACTOR8_REMAP, FIELD_OFFSET(ADAPTER_EXTENSION, factor8_remap))) {
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_FACTOR8_REMAP was FOUND in the registry. We will remap the max_segments value. \n");
        #endif
        if (max_segments_b4_alignment > 8) {
            adaptExt->max_segments = ((max_segments_b4_alignment / 8) * 8);
        } else {
            // We should never hit this because adaptExt->max_segments >= SCSI_MINIMUM_PHYSICAL_BREAKS (16)
            adaptExt->max_segments = 8;
        }
        #if !defined(RUN_UNCHECKED)
        if (max_segments_b4_alignment != (adaptExt->max_segments)) {
            RhelDbgPrint(TRACE_LEVEL_INFORMATION, " The max_segments value was remapped using a factor of 8... \n");
            RhelDbgPrint(TRACE_LEVEL_VERBOSE, 
                            " Factor 8 Remapping : max_segments : original = %lu, aligned = %lu \n", max_segments_b4_alignment, adaptExt->max_segments);
        } else {
            RhelDbgPrint(TRACE_LEVEL_INFORMATION, " The max_segments value did not require factor 8 remapping. \n");
        }
        #endif
    } else {
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_FACTOR8_REMAP was NOT FOUND in the registry. We will not remap the max_segments value. \n");
        #endif
    }
}
/* Here we enforce legacy off-by-one NumberOfPhysicalBreaks (NOPB) behaviour for StorPort.
 * This behaviour was retained in StorPort to maintain backwards compatibility.
 * This is analogous to the legacy MaximumSGList parameter in the SCSI Port driver.
 * Where:
 * MaximumSGList = ((MAX_BLOCK_SIZE)/PAGE_SIZE) + 1
 * The default x86/x64 values being:
 * MaximumSGList = (64KiB/4KiB) + 1 = 16 + 1 = 17 (0x11)
 * The MAX_BLOCK_SIZE limit is no longer 64KiB, but 2048KiB (2MiB):
 * NOPB or MaximumSGList = (2048KiB/4KiB) + 1 = 512 + 1 = 513 (0x201)
 * 
 * ATTENTION: The MS NOPB documentation for both the SCSI Port and StorPort drivers is incorrect.
 * 
 * As max_segments = MAX_BLOCK_SIZE/PAGE_SIZE we use:
 */
ConfigInfo->NumberOfPhysicalBreaks = adaptExt->max_segments + 1;

/* Here we use the efficient single step calculation for MaximumTransferLength
 * 
 * The alternative would be:
 * ConfigInfo->MaximumTransferLength = adaptExt->max_segments;
 * ConfigInfo->MaximumTransferLength <<= PAGE_SHIFT;
 * ...where #define PAGE_SHIFT 12L
 * 
 */
ConfigInfo->MaximumTransferLength = adaptExt->max_segments * PAGE_SIZE;

Now returning to working on Parameter\Device(d) registry reads...

@vrozenfe
Copy link
Collaborator

@vrozenfe

Per my comments in my last post above. Could the fact using a per HBA scope in VioScsiReadRegistryParameter (StorPortRegistryRead) requires a reg key of Parameters\Device-1 be a hint for what is perhaps going on here?

Normally these would be Parameters\Device0, Parameters\Device1, and so on... We seem to have -1 instead...

Thanks for the hints in your last post here. My observations:

In any case to make STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO code working we need to have some per-cpu list to keep SRBs, extend StartIo routine a bit, and make sure that we complete those SRBs in the per-cpu queues in case of bus/lun reset.

From what I could tell, this seems to be already taken care of by STOR_PERF_DPC_REDIRECTION_CURRENT_CPU and CompletePendingRequests (which I renamed to CompletePendingRequestsOnReset). It appeared to work well in my tests...

Regarding Dma32BitAddresses. On a real hardware HBA it should be a must to let the DMA engine to work properly. I don't know it enabling or disabling this bit will have any implication for qemu virtio, never tired to turn it off, but I know for sure that some HW vendors run this code on top of FPGA implemented virtio adaptors and this is the reason why we will keep it turned on :)

I found enabling this manually (StorPort does it automagically) resulted in a loss of performance.

I came up with this:

/* NOTE: When unset we get -5k IOPS +30us latency (best case)...! */
ConfigInfo->Master                       = TRUE; // +7k IOPS -50us latency
ConfigInfo->ScatterGather                = TRUE; // +12k IOPS -75us latency

/* Allow user to force use of Dma32BitAddresses via reg key
* [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
* "UseDma32BitAddresses"={any dword value here - the value is ignored}
* 
* WARNING: Manually setting this increases latency and reduces IOPS
* 
* NOTE: StorPort normally sets this to TRUE anyway.
*       So let StorPort do it for maximum performance.
*       Only provided in the event StorPort does not enable the feature and it is required.
*/
if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_USE_DMA32BITADDRESSES, FIELD_OFFSET(ADAPTER_EXTENSION, use_Dma32BitAddresses))) {
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_USE_DMA32BITADDRESSES was FOUND in the registry. We will set ConfigInfo->Dma32BitAddresses to TRUE. \n");
    #endif
    ConfigInfo->Dma32BitAddresses = TRUE; // -15k IOPS +100us latency (worst case)
} else {
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_USE_DMA32BITADDRESSES was NOT FOUND in the registry. We will let StorPort manage the Dma32BitAddresses setting. \n");
    #endif
}

/*
    * WARNING: Do not set this.
    *          All of these options increase latency and reduce IOPS:
    * 
ConfigInfo->DmaWidth                     = 0; //<-- This should be zero as initialised by StorPort
ConfigInfo->DmaWidth                     = Width32Bits; //<-- This should be zero as initialised by StorPort
ConfigInfo->DmaWidth                     = Width64Bits; //<-- This should be zero as initialised by StorPort
*/

@benyamin-codez
By any chance, can you try enabling v3 DMA

ConfigInfo->FeatureSupport |= STOR_ADAPTER_DMA_V3_PREFERRED;

and check if it makes any difference?

Thanks,
Vadim.

@vrozenfe
Copy link
Collaborator

@vrozenfe

Considering further where a good spot in the tree might be, perhaps one of the following would work better: vioscsi\tests ..\Tools\vioblk\tests

The script can then be called stat-collector.bat or similar, with room for future scripts like stat-aggregator.pl, etc.

Sounds good.
Thank you,
Vadim.

This might also be a good location for any blkdebug or similar kit too.

@benyamin-codez
Copy link
Contributor Author

@vrozenfe

Thanks for getting back to me regarding a place in the tree for scripts to live.
Per my query above, please confirm whether vioscsi\tests or Tools\vioblk\tests (or an alternative) would be best suited.

I see both @kostyanf14 and @YanVugenfirer are CODEOWNERS for the Tools directory. Perhaps they have a view?

@benyamin-codez
Copy link
Contributor Author

@vrozenfe

@benyamin-codez
By any chance, can you try enabling v3 DMA

ConfigInfo->FeatureSupport |= STOR_ADAPTER_DMA_V3_PREFERRED;

and check if it makes any difference?

Thanks,
Vadim.

Thanks for having a look Vadim. Can you please confirm what it is that you want me to check...?

I can say that setting STOR_ADAPTER_DMA_V3_PREFERRED:

  1. Has no effect on whether Parameter\Device(d) works. I still get Parameter\Device-1 and if I issue StorPortGetSystemPortNumber() it returns with STOR_STATUS_INVALID_DEVICE_STATE. This happens anywhere in VioScsiFindAdapter(), VioScsiHwInitialize() and even VioScsiStartIo(), so I'm still hunting this down; &
  2. In the debugger, for ConfigInfo, I see FeatureSupport = 0x8 (which is the DMA v3 preference), Dma32BitAddresses = 0x1 (unchanged from init) and DmaWidth = 0x0 (Width8Bits and unchanged from init).

Did you want me to check whether the performance impact of setting Dma32BitAddresses = TRUE still exists with STOR_ADAPTER_DMA_V3_PREFERRED enabled? Or was it also DmaWidth, or even Master and ScatterGather as well?
I previously checked all of these because the StorPort doco states they should be left to StorPort to initialise.

Also thought it worthwhile to share what I now have for this part. Given the default is to use Dma32BitAddresses, I had moved from force use of Dma32BitAddresses to ONLY use Dma32BitAddresses, which also means Dma64BitAddresses = 0 should be set. That being said, I'm happy to confirm the performance impact of setting STOR_ADAPTER_DMA_V3_PREFERRED, just let me know precisely what it is that you would like me to test.

    /* NOTE: When unset we get -5k IOPS +30us latency (best case)...! */
    ConfigInfo->Master        = TRUE; // +7k IOPS -50us latency
    ConfigInfo->ScatterGather = TRUE; // +12k IOPS -75us latency

    /* Allow user to restrict driver to Dma32BitAddresses ONLY via reg key
    * [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device]
    * "OnlyDma32BitAddresses"={any dword value here - the value is ignored}
    * 
    * ATTENTION: StorPort normally sets ConfigInfo->Dma32BitAddresses to TRUE.
    *            Recommended to let StorPort manage this setting for maximum performance.
    * 
    * WARNING:   Manually setting ConfigInfo->Dma32BitAddresses has a negative impact on 
    *            performance, increasing latency and reduces IOPS (up to -15k IOPS +100us latency)
    *            Setting this reg key disables Dma64BitAddresses.
    */
    if (VioScsiReadRegistryParameter(DeviceExtension, REGISTRY_ONLY_DMA32BITADDRESSES, FIELD_OFFSET(ADAPTER_EXTENSION, OnlyDma32BitAddresses))) {
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_ONLY_DMA32BITADDRESSES was FOUND in the registry. We will set ConfigInfo->Dma32BitAddresses to TRUE. \n");
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_ONLY_DMA32BITADDRESSES was FOUND in the registry. We will set ConfigInfo->Dma64BitAddresses to 0. \n");
        #endif
        ConfigInfo->Dma32BitAddresses = TRUE;
        ConfigInfo->Dma64BitAddresses = 0;
    } else {
        #if !defined(RUN_UNCHECKED)
        RhelDbgPrint(TRACE_LEVEL_VERBOSE, " REGISTRY_ONLY_DMA32BITADDRESSES was NOT FOUND in the registry. We will let StorPort manage the Dma32BitAddresses setting. \n");
        #endif
    }

    /* WARNING: Setting ConfigInfo->DmaWidth increases latency and reduces IOPS.
     *          StorPort sets this to Width8Bits (0x0) but ignores it when ConfigInfo->Master = TRUE
     * 
     * ConfigInfo->DmaWidth = Width32Bits; //<-- This should be zero (Width8Bits) as initialised by StorPort
    */

    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " StorPort : ConfigInfo->Master : %s \n", (ConfigInfo->Master) ? "ON" : "OFF");
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " StorPort : ConfigInfo->ScatterGather : %s \n", (ConfigInfo->ScatterGather) ? "ON" : "OFF");
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " StorPort : ConfigInfo->Dma32BitAddresses : %s \n", (ConfigInfo->Dma32BitAddresses) ? "ON" : "OFF");
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " StorPort : ConfigInfo->DmaWidth : %lu \n", ConfigInfo->DmaWidth);
    #endif

#if defined(NTDDI_WIN10_VB) && (NTDDI_VERSION >= NTDDI_WIN10_VB)
    ConfigInfo->DmaAddressWidth = 64; // <-- Still yet to test on Win11 target...
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " ConfigInfo->DmaAddressWidth : %s \n", ConfigInfo->DmaAddressWidth);
    #endif
#else
    #if !defined(RUN_UNCHECKED)
    RhelDbgPrint(TRACE_LEVEL_VERBOSE, " ConfigInfo->DmaAddressWidth is NOT supported in this version of the driver. \n");
    #endif
#endif

    if (ConfigInfo->Dma64BitAddresses == SCSI_DMA64_SYSTEM_SUPPORTED) {
        ConfigInfo->Dma64BitAddresses = SCSI_DMA64_MINIPORT_FULL64BIT_SUPPORTED;
    }

@vrozenfe
Copy link
Collaborator

vrozenfe commented Oct 3, 2024

Did you want me to check whether the performance impact of setting Dma32BitAddresses = TRUE still exists with STOR_ADAPTER_DMA_V3_PREFERRED enabled?

Yes. Ideally we need to check if toggling Dma32BitAddresses has effect on both x86 and x64 platforms (and even probably on arm64) with and without STOR_ADAPTER_DMA_V3_PREFERRED enabled. But it is going to be a quite massive test case, so I will try to run it by myself.

All the best,
Vadim.

@YanVugenfirer
Copy link
Collaborator

@vrozenfe @JonKohler @sb-ntnx @frwbr @foxmox

....

Also, re the stat collector scripts, can someone suggest a suitable place in the tree, e.g. /vioscsi/stat-collector or /Tools/stat-collector or similar? I'll then do a PR for them too.

Let me know...

I think similar to https://github.com/virtio-win/kvm-guest-drivers-windows/tree/master/NetKVM/DebugTools - put the scripts under DebugTools in /vioscsi/

@benyamin-codez
Copy link
Contributor Author

Did you want me to check whether the performance impact of setting Dma32BitAddresses = TRUE still exists with STOR_ADAPTER_DMA_V3_PREFERRED enabled?

Yes. Ideally we need to check if toggling Dma32BitAddresses has effect on both x86 and x64 platforms (and even probably on arm64) with and without STOR_ADAPTER_DMA_V3_PREFERRED enabled. But it is going to be a quite massive test case, so I will try to run it by myself.

All the best, Vadim.

@vrozenfe

Thanks for the clarification, Vadim. That will indeed be quite the undertaking...

As for x64, I ran a few more tests on my Win10 target but with a couple more variables as follows:

Test was 4KiB random, 32 queues, 16 threads on 4KiB segments with 2GB test file (random fill).
Result is average latency in microseconds following a minimum of 15 tests (5*3) per configuration.

PERF = STOR_PERF_DPC_REDIRECTION_CURRENT_CPU + STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO
Dma32 = Dma32Addreses
Dma64 = Dma64Addreses
FULL64BIT = SCSI_DMA64_MINIPORT_FULL64BIT_SUPPORTED
Bo3 = Best of Three, i.e. the best result for this combination of IOTHREAD and PERF.
V3 = STOR_ADAPTER_DMA_V3_PREFERRED
Asterix denotes lowest value.
Dash denotes worthy mention.

IOTHREAD | PERF | Dma32    | Dma64     | Bo3 | V3_ON | V3_OFF

YES      | YES  | FORCED   | FULL64BIT |     | 1868  | 1853*
YES      | YES  | STORPORT | FULL64BIT |  *  | 1844* | 1854
YES      | YES  | FORCED   | 0         |     | 1915* | 1919

NO       | YES  | FORCED   | FULL64BIT |  -  | 2022  | 1978*
NO       | YES  | STORPORT | FULL64BIT |  *  | 1977* | 2014
NO       | YES  | FORCED   | 0         |     | 2046  | 1998*

YES      | NO   | FORCED   | FULL64BIT |  -  | 1844  | 1843*
YES      | NO   | STORPORT | FULL64BIT |  *  | 1864  | 1830*
YES      | NO   | FORCED   | 0         |     | 1884* | 1887

NO       | NO   | FORCED   | FULL64BIT |     | 1987* | 1997
NO       | NO   | STORPORT | FULL64BIT |  *  | 1967* | 1984
NO       | NO   | FORCED   | 0         |     | 2067  | 2026*

Despite the effect of STOR_ADAPTER_DMA_V3_PREFERRED not being wholly consistent (and the differences are somewhat interesting), it would appear that letting StorPort manage the Dma32Addresses setting (which it sets to TRUE anyway) is still the best catch-all option. I note the latency values are higher than expected due to additional tracing.

Best regards,
Ben

@lixianming19951001
Copy link

@benyamin-codez
Hi benyamin-codez,

I hope you're doing well! I wanted to ask if this PR can theoretically fix issue 907. Additionally, is there a plan for when this PR will be merged into master?

Thank you for your help!

Best regards,
Li

@benyamin-codez
Copy link
Contributor Author

@benyamin-codez Hi benyamin-codez,

I hope you're doing well! I wanted to ask if this PR can theoretically fix issue 907. Additionally, is there a plan for when this PR will be merged into master?

Thank you for your help!

Best regards, Li

@lixianming19951001
Hi Li,
Yes, I believe #907 can benefit from fixes mentioned here, and more specifically those mentioned in PRs #1150 and #1162.
To that end, please see PR #1174.
I will be closing this PR without merging as the fixes have since been merged in other PRs.
Best regards,
Ben

@benyamin-codez
Copy link
Contributor Author

A PR to reduce spinlock management complexity has been raised: #1175

@benyamin-codez
Copy link
Contributor Author

A PR to improve tracing capability has been raised: #1176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants