Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 fix FPU's float-to-signed-integer corner case #943

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

stnolting
Copy link
Owner

Fixing #942

stnolting added 2 commits July 6, 2024 05:55
fix "random" operand generator
@stnolting stnolting added bug Something isn't working as expected HW Hardware-related labels Jul 6, 2024
@stnolting stnolting linked an issue Jul 6, 2024 that may be closed by this pull request
@stnolting stnolting requested a review from mikaelsky July 6, 2024 03:59
@mikaelsky
Copy link
Collaborator

@stnolting if you are okay with being patient for a day I can check the corner case tomorrow (US West Cost time). 99% its okay, there is just something nagging about why I added the extra 2 conditions :)
The good news is its pretty fast to run a check, relatively speaking. Take ~2 hours or so to run the full zfinx compliance test suite with RVVI ISA compares just to ensure that we aren't missing/breaking another corner case :)

@stnolting stnolting changed the title 🐛 fix FPU's float-to-signed-integer croner case 🐛 fix FPU's float-to-signed-integer corner case Jul 6, 2024
@mikaelsky
Copy link
Collaborator

@stnolting okay so challenge number 1. When I try and replicate the bug locally I get the following failures. These look like corner case rounding failures somewhere and not necessarily in the hardware btw.

xcelium> run
VIRTUAL_UART: Start Float to signed integer
VIRTUAL_UART: 63: opa = 0x3f000000, opb = 0x00000000 : ref[SW] = 0x00000001 vs. res[HW] = 0x00000000 [FAILED]
VIRTUAL_UART: 191: opa = 0xbf000000, opb = 0x00000000 : ref[SW] = 0xffffffff vs. res[HW] = 0x00000000 [FAILED]
VIRTUAL_UART: Float to signed integer finished!

0x3F abd 0xBF-> +/- 2^-64 respectively. So the question is why the expectation is +/-1 as a 32-bit int vs 0. Without knowing the rounding mode it will be a bit tricky. For almost any round this should end up as a +/- 0 unless the integer used is a long/64-bit. This as the "1" will be shifted soooo far away for 0.5LSB that it shouldn't result in an error. So unless there is a rule in IEEE that it has to be +/- 1 unless the float is 0 exact (which seems odd) there might be an issue here.
It would be good to check the SW float library used, I assume Berkeley softfloat?

I'll spend some time getting the compare stuff to go on my end on the test case. We should also diff our FPU versions as mine is 300/400 lines shorter than yours it seems. We definitely aren't getting the same results. Maybe I missed a commit/messed up a commit?

@stnolting
Copy link
Owner Author

Without knowing the rounding mode it will be a bit tricky.

The default rounding mode (at least for the hardware) is "round to nearest, ties to even".

I'll spend some time getting the compare stuff to go on my end on the test case. We should also diff our FPU versions as mine is 300/400 lines shorter than yours it seems. We definitely aren't getting the same results. Maybe I missed a commit/messed up a commit?

That's great! Thank you very much! 👍

I'm on the latest version of the main branch (so no FPU hot-fix) and this is what I get when running a section of @Quma78's code:

Start Float to signed integer
63: opa = 0x3f000000, opb = 0x00000000 : ref[SW] = 0x00000000 vs. res[HW] = 0x00000000 [ok]
191: opa = 0xbf000000, opb = 0x00000000 : ref[SW] = 0x00000000 vs. res[HW] = 0x00000000 [ok]
208: opa = 0xd0000000, opb = 0x00000000 : ref[SW] = 0x80000000 vs. res[HW] = 0x00000000 [FAILED]
Float to signed integer finished!

With the fix from this branch the last test case (208) also passes.

@mikaelsky
Copy link
Collaborator

Interesting, let me check my quick hack test setup. Its a bit worrisome that the softfloat we are using disagree what the result should be in the test case, whereas the HW agrees across test case. This might hint at a different problem.

I'll get the latest branch on my local machine and do some diff's today to see whats what. It could be something as simple as rounding mode being different between the two cases. I agree the "default mode" should be round to nearest even (normal round), but because of RISCV ISA "fun" we need to look at the GCC assembly to figure out what the heck it actually did as it either sets it directly in the opcode, change the default round mode, or uses the out-of-reset round mode. Did I mention I hate RISCV float rounding modes ;)

Anywho, I believe the reason for the exception I added was to deal with a specific rounding mode that likely isn't tested in the simple test you have setup. Hence why I'm a tad cautious :)

A note here though is: 0xD0 or (208) is -1.0 * 2^(160-128) or -1.0 * 2^32 which is the sign bit corner case. Which is indeed what that if statement is trying to catch. The exception handling (my shoddy memory here) might be to deal with the case where we are doing a convert to unsigned int. Thinking we should expand the test case with both float to signed and unsigned int.. and trigger all the rounding modes as well, to ensure we aren't missing something.
I did check the compliance tests and I believe 0xD0 exponent is indeed in there, so curious as to why its suddenly popping up.

@mikaelsky
Copy link
Collaborator

After setting up my new threadripper pro home workstation with rocky 9 and building everything from scratch - fun :) I can now run the test native and can see the errors from 0xd0 - 0xFF.

As a side effect I know have a rocky 9/RHEL9 build of the riscv tool chain. If it make sense I can upload it to the pre-built section.
image

I looked at the intrinsic file to see if it was an issue with rint as it returns a float. Tried rintf and lrintf with no change in behavior.

I checked that the rounding mode is set equal (round-to-nearest in both GCC and HW, default/reset value for both)
In the intrinsix RM is set to 0 which forces round-to-nearest. The alternative is setting it the DYN and relying on the FR CSR.

I then changed the conversion to use the default C case operation:

int32_t fp_test;
...
fp_test = (int32_t)res_hw.float_value;
...
err_cnt += verify_result(i, opa.binary_value, 0, (uint32_t)fp_test, res_hw.binary_value);

This results in a different failure mode:

#3: FCVT.W.S (float to signed integer)...
200: opa = 0xc8000000, opb = 0x00000000 : ref[SW] = 0x80000000 vs. res[HW] = 0xfffe0000 [FAILED]
201: opa = 0xc9000000, opb = 0x00000000 : ref[SW] = 0x80000000 vs. res[HW] = 0xfff80000 [FAILED]
202: opa = 0xca000000, opb = 0x00000000 : ref[SW] = 0x80000000 vs. res[HW] = 0xffe00000 [FAILED]
203: opa = 0xcb000000, opb = 0x00000000 : ref[SW] = 0x80000000 vs. res[HW] = 0xff800000 [FAILED]
204: opa = 0xcc000000, opb = 0x00000000 : ref[SW] = 0x80000000 vs. res[HW] = 0xfe000000 [FAILED]
205: opa = 0xcd000000, opb = 0x00000000 : ref[SW] = 0x80000000 vs. res[HW] = 0xf8000000 [FAILED]
206: opa = 0xce000000, opb = 0x00000000 : ref[SW] = 0x80000000 vs. res[HW] = 0xe0000000 [FAILED]
207: opa = 0xcf000000, opb = 0x00000000 : ref[SW] = 0x00000000 vs. res[HW] = 0x80000000 [FAILED]

Note: for this failure mode, as soon as we hit 0xD0 exponent we get correct matching conversion. The hint here is that something is off somewhere for sure. Not 100% convinced its the hardware yet as I can make it pass by using standard c float casting.

Anywho next up is to see why my work version doesn't fail. The work version utilizes the berkeley soft-float library for float emulation vs math.h. The berkeley soft-float library is the same the float.h uses to fill in the gaps in the zfinx support, e.g. float divide and doubles.

@mikaelsky
Copy link
Collaborator

After a bit more research on math.h we find that the rounding mode of float to int conversions is not round to nearest:
https://en.cppreference.com/w/c/numeric/fenv/FE_round

image

Not sure how to fully interpret this, but it seems like we should be setting the rounding mode to round towards zero if we are using rint for float conversions? or does it only affect casting? It explains why the cast operator experiment is different than rint at least.

For rint, reading the documentation:
https://en.cppreference.com/w/c/numeric/math/rint

image

From this is seem rint is using "current rounding mode"

Now reading the manual further we get:
image

As we are converting outside the size of an integer the return value would be: "an implementation-defined value is returned"
I'm guessing this is 0x8000_0000 and not 0x0000_0000 as our current hardware returns.
We should update the test to check whether an FE_INVALID is raised and whether gcc's math.h returns the value expected by RISCV F/Zfinx extension.

I did a small dive into rint to see whats going on there.. its mostly straight up assembly, yeah :| There is a double rint(double x) C example where the return value, if the exponent is >51 (remember double) is just X straight up.
https://www.netlib.org/fdlibm/s_rint.c

I'll dive a bit more. Early indications that we might not actually be looking at a "bug" but a "feature" in math.h. I still need to run the Imperas with softfloat vs the bug to confirm whether it is a bug bug :)

@stnolting
Copy link
Owner Author

As a side effect I know have a rocky 9/RHEL9 build of the riscv tool chain. If it make sense I can upload it to the pre-built section.

I am not sure if I want to continue with my pre-built toolchains. The X-pack project provides excellent toolchains - so why reinvent the wheel? ;)

In the intrinsix RM is set to 0 which forces round-to-nearest. The alternative is setting it the DYN and relying on the FR CSR.

Good point. I think we should adjust the intrinsics and use the floating point CSR for configuration. This is so much more flexible.

After a bit more research on math.h [...]

Holy cricket! Thanks for all your work!
Seems like we are entering floating point hell here... 🙈

We should update the test to check whether an FE_INVALID is raised and whether gcc's math.h returns the value expected by RISCV F/Zfinx extension.

👍

I'll dive a bit more. Early indications that we might not actually be looking at a "bug" but a "feature" in math.h. I still need to run the Imperas with softfloat vs the bug to confirm whether it is a bug bug :)

Thanks again! ❤️

@mikaelsky
Copy link
Collaborator

I am not sure if I want to continue with my pre-built toolchains. The X-pack project provides excellent toolchains - so why reinvent the wheel? ;)

And here I thought reinventing the wheel was what engineering was all about ;) Makes sense. Forwarded the x-pack link to my firmware SDK person, might be we just pivot to that vs internal build.

After a bit more research on math.h [...]

Holy cricket! Thanks for all your work! Seems like we are entering floating point hell here... 🙈

Welcome to FPU :) I knew what I was signing up for, so not too surprised but still learning. I've yet to pull in favors from my friends that are building Sharc and Tensilica DSPs, but might at some point :)
I will continue to do some more deep diving into the topic. Right now I'm suspecting that the RISCV math.h library was copy->pasted from the x86 library and that might be the root cause. Still not closing the hardware bug avenue, will need to re-read the RISCV spec and the IEEE spec again again.
This is a good bug find though. The fact that we are deep diving this much and learning a lot means, even though it isn't a HW bug, we learned a lot on the debugging journey :)

@stnolting
Copy link
Owner Author

Welcome to FPU :) I knew what I was signing up for, so not too surprised but still learning.

😅

This is a good bug find though. The fact that we are deep diving this much and learning a lot means, even though it isn't a HW bug, we learned a lot on the debugging journey :)

I can only agree.

Btw, I have modified the Zfinx intrinsics to use the dynamic rounding mode (actual rounding mode defined by fcsr). This makes it easier to play around with all the different rounding modes.

stnolting added 2 commits July 8, 2024 21:32
rounding mode defined by fcsr "rm" bits
@mikaelsky
Copy link
Collaborator

So finally got around to run the testcase locally with RVVI comparison to OVP sim from Imperas.

A few details about my setup is that I'm likely calling gcc with the zfinx extension enabled. This probably means that rint gets replaced with the zfinx extension for everything but corner case handling

xcelium> run
UVM_INFO @ 3564.00 ns: reporter [riscv_tb] Sending reset to reference model
UVM_INFO @ 3564.00 ns: reporter [riscv_tb] Calling rvviTracer.riscv_init
VIRTUAL_UART: Start Float to signed integer
VIRTUAL_UART: 63: opa = 0x3f000000, opb = 0x00000000 : ref[SW] = 0x00000001 vs. res[HW] = 0x00000000 [FAILED]
VIRTUAL_UART: 191: opa = 0xbf000000, opb = 0x00000000 : ref[SW] = 0xffffffff vs. res[HW] = 0x00000000 [FAILED]
VIRTUAL_UART: Float to signed integer finished! 

Notice the 2 failures reported seem to be corner case related. But the Imperas compare reports:

Info (IDV) ---------------------------------------------------
Info (IDV) ImperasDV VERIFICATION REPORT
Info (IDV)   Instruction retires   : 44,625
Info (IDV)   Traps                 : 0
Info (IDV)   Interrupt events      : 0
Info (IDV)   Ending cycle count    : 205,741
Info (IDV)                               Sets / Compares
Info (IDV)     PC                  :   44,625 / 44,625
Info (IDV)     Instruction         :   44,625 / 44,625
Info (IDV)     GPR                 :   33,362 / 33,329
Info (IDV)     CSR                 :      531 / 524
Info (IDV)     FPR                 :        0 / 0 (disabled)
Info (IDV)     VR                  :        0 / 0 (disabled)
Info (IDV)  
Info (IDV)   Total compares        : 123,103
Info (IDV)   Mismatches            : 0
Info (IDV) ---------------------------------------------------

ImperasDVasync finished: Tue Jul  9 07:58:01 2024

From the reporting you can see the number of compares and mismatches being "0". So Imperas reference implementation "ovpsim" -https://github.com/riscv-admin/riscv-ovpsim- that uses the Berkeley softfloat reference library (the same as Sail and Spike) reports no failures.
Note: the sets vs compares can seem off, that is caused by some instructions updating more than 1 GPR and CSR for a given instruction. FPR is disabled as we don't have the F extension. VR is disabled as we don't have the vector extension.

Next step is to dig into why we see the failures vs the math.h library from RISCV. Also I should probably try a run where I'm not enabling the zfinx extension in gcc.

@stnolting
Copy link
Owner Author

A few details about my setup is that I'm likely calling gcc with the zfinx extension enabled. This probably means that rint gets replaced with the zfinx extension for everything but corner case handling

But doesn't that mean you are comparing Zfinx (wrapped in some math library) against Zfinx (the intrinsics)? 🤔

Which hardware are you using? The "default" one (aka from the main branch) or the "fix" from this PR?

@mikaelsky
Copy link
Collaborator

But doesn't that mean you are comparing Zfinx (wrapped in some math library) against Zfinx (the intrinsics)? 🤔

With gcc compiling with Zfinx enabled, or HW float if you will, means the compiler will likely utilize the fcwt.ws.s instruction and not the emulated function in math.h. Basically the "SW" vs "HW" compare function would always match as its calling the same RISCV instruction under the hood. As the math.h likely has an exception handler is why we see a trigger that causes the 2 failures.
Remember rint is calling a C function that inherently uses float types. So gcc will off-course use any float hardware it can find :)

Which hardware are you using? The "default" one (aka from the main branch) or the "fix" from this PR?
I'm using our internal out-of-date branch. But it has the correct FPU, not the fixed FPU from the PR.
The exercise here is saying: ok we will run the test with the non-fixed FPU and instead of comparing against the math.h implementation we will compared against an actual RISCV reference ISA. The RISCV reference ISA is supposed to handle the FPU correctly per RISCV spec.
The reasoning here is: if we behave correctly per RISCV spec then its likely that the error we are seeing in the rint function is an error in the math.h "implementation specific" results - which I suspect are taken from x86 - and not a bug as we are adhering to the intent of the RISCV specification.

The next step is to deep dive into the RISCV F/Zfinx extension spec and the IEEE 768 float spec to ensure that we are indeed behaving right and its rooted in a mistake in math.h based on yet another ancient x86 float "feature".
If the spec says rint from math.h is right, that is a bug in the RISCV reference ISA. Which would also be bad.

for now...
@mikaelsky
Copy link
Collaborator

So this is the riscv spec notes on float to signed integer conversion:
image

From this for FCVT.W.S if the number is out-of-range for a negative input, which I believe is the case, then the resulting integer should be 2^-31 or 0x8000_0000.
Which seems to be happening in my local FPU, my compares don't fail for this corner.

So next step is for me to do a diff. The errors I have seem to be rounding related vs the SW implementation. Need to dig some more, but might be a mistake in the C library, as I'm comparing against the berkeley reference implementation.

@stnolting did you touch the FPU after my checkin? no blame :) just trying to understand why my implementation is something like 400 lines shorter :)

@stnolting
Copy link
Owner Author

From this for FCVT.W.S if the number is out-of-range for a negative input, which I believe is the case, then the resulting integer should be 2^-31 or 0x8000_0000.

But this is not happening when using the default FPU from the main branch, right?! 🤔

did you touch the FPU after my checkin? no blame :) just trying to understand why my implementation is something like 400 lines shorter :)

Wait, your local version is shorter? It should be the other way around I think. I just trimmed some trailing spaces and reworked the header. No real rtl code edits from my side.

@stnolting stnolting added the stale No updates for a long time label Aug 29, 2024
@mikaelsky
Copy link
Collaborator

@stnolting I apologize for the Very long delay here. Got stuck in some hairy support case combined with a 2 vacation back to the homeland in europe :)

So I did a diff between the FPU in the tip of trunk and my local one. Besides a number of stylistic differences the one that seems to fall out is:
image

Where I use hex for the mantissa encoding and you have changed it to a binary value... well that and I sim with xcelium and not ghdl.. but that shouldn't matter I would think.

The rest of the diffs are stylistic and license text shortening except for these sections which are tied to the CSR addressing.

image

image

image

@stnolting
Copy link
Owner Author

I apologize for the Very long delay here. Got stuck in some hairy support case combined with a 2 vacation back to the homeland in europe :)

Oh no worries! The "stale" label was just a memory for me and by no means a subtle trigger. 😅

Where I use hex for the mantissa encoding and you have changed it to a binary value... well that and I sim with xcelium and not ghdl.. but that shouldn't matter I would think.

I have no idea why I have changed that... Anyway, it should not matter at all.

The rest of the diffs are stylistic and license text shortening except for these sections which are tied to the CSR addressing.

I think I just moved some of the "coarse" CSR addressing logic out of the FPU. There should be no functional difference.

@mikaelsky
Copy link
Collaborator

I hadn't even noticed the stale flag :) this bug as been on my mind for a while now. Its especially frustrating that I cannot recreate it in our ASIC sim environement.

I agree the addressing doesn't matter :) more a "this is the only major difference I see". Basically the diff tells me they 2 FPUs are essentially the same.

As for the hex vs non-hex encoding. The only difference I could see would be (and not I could have counted wrong) the hex version has 24 "0"s and the non-hex version only has 23 "0"s. Not sure if I remember the width of the mantessia at this stage. I thought it was 24-bits as we haven't post shifted anything yet.
Granted this should have resulted in a compile error as the left and right hand-side are different widths. This isn't verilog where different left and right hand sides are allowed.

Now my base core is still way old. I think my last sync point was sometime in February, beyond bug-fixes. As this is an FPU specific issue this shouldn't matter.
The good news is we haven't found issues in a quite a while :) I do need to update the RVVI as we've added full support for IRQ and started "punishing" the core with randomized IRQs. This btw. is super hard to make work, but the Imperas ISA has some fancy tricks up its sleeves to allow this to happen.

Next steps will be to try and recreate the environment with ghdl vs xcelium and do trace dump of the FPU to compare the internal state in the 2 cases. If all else fails it could be a tool issues. Super unlikely though.

@stnolting
Copy link
Owner Author

I agree the addressing doesn't matter :) more a "this is the only major difference I see". Basically the diff tells me they 2 FPUs are essentially the same.

So you are comparing your local FPU with the one from the main branch, right?

As for the hex vs non-hex encoding.

Please note that the "fix" from #943 does not contain this mantissa = 0 check anymore (as part of the supposed "bug fix" triggered by #942).

Now my base core is still way old. I think my last sync point was sometime in February, beyond bug-fixes. As this is an FPU specific issue this shouldn't matter.

👍

The good news is we haven't found issues in a quite a while :)

That sounds great! So we can discard #943 as this seems so be "just" a toolchain/library issue? 🤔

I do need to update the RVVI as we've added full support for IRQ and started "punishing" the core with randomized IRQs. This btw. is super hard to make work, but the Imperas ISA has some fancy tricks up its sleeves to allow this to happen.

I'm very curious about the findings from this! I've tried feeding the core with permanent interrupt requests before (some of it is still available in the default processor check program), but I've never taken it to the extreme.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected HW Hardware-related stale No updates for a long time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FPU Float to signed integer
2 participants