Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sse4.2: added the implementation for mm_cmpestra #295

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

masterchef2209
Copy link
Member

@masterchef2209 masterchef2209 commented May 16, 2020

sse4.2: added the implementation for mm_cmpestra

I have created this PR from a new branch as old branch's commit history became very messy.
The implementation is same as old branch.
I am trying to add tests in such a way that each block(case) of code is tested for 2 different values of imm8 and also for both size=8 and size=16 variants.
So for cmpestra there would be 14 cases as written in a comment in the tests file.
I have named the test cases functions to indicate which block of code they would be surely testing.
I have only implemented 4/14 for now, I will implement the rest after getting feedback.
I can also merge different cases in one big test function but I think it looks cleaner this way, if anything needs to be changed please inform me.
Thank You
PS: I am using generator code :)

@mr-c
Copy link
Collaborator

mr-c commented May 16, 2020

[7/94] gcc-10 -Itest/x86/40b6604@@simde-tests-x86-emul@sta -Itest/x86 -I../test/x86 -I. -I.. -fdiagnostics-color=always -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -std=c99 -g -march=native -DSIMDE_ENABLE_NATIVE_ALIASES -DSIMDE_NATIVE_ALIASES_TESTING -fPIC -fopenmp-simd -DSIMDE_ENABLE_OPENMP -Wno-psabi -DSIMDE_NO_NATIVE -MD -MQ 'test/x86/40b6604@@simde-tests-x86-emul@sta/sse4.2.c.o' -MF 'test/x86/40b6604@@simde-tests-x86-emul@sta/sse4.2.c.o.d' -o 'test/x86/40b6604@@simde-tests-x86-emul@sta/sse4.2.c.o' -c ../test/x86/sse4.2.c
../test/x86/sse4.2.c: In function ‘test_simde_mm_cmpestra_equalany_8’:
../test/x86/sse4.2.c:150:11: warning: implicit declaration of function ‘_mm_cmpestra_8_’; did you mean ‘_mm_cmpestra’? [-Wimplicit-function-declaration]
  150 |       r = _mm_cmpestra_8_(test_vec[i].a, test_vec[i].la, test_vec[i].b, test_vec[i].lb, 122);
      |           ^~~~~~~~~~~~~~~
      |           _mm_cmpestra
../test/x86/sse4.2.c: In function ‘test_simde_mm_cmpestra_equalany_16’:
../test/x86/sse4.2.c:231:11: warning: implicit declaration of function ‘_mm_cmpestra_16_’; did you mean ‘_mm_cmpestra’? [-Wimplicit-function-declaration]
  231 |       r = _mm_cmpestra_16_(test_vec[i].a, test_vec[i].la, test_vec[i].b, test_vec[i].lb, 123);
      |           ^~~~~~~~~~~~~~~~

https://github.com/nemequ/simde/pull/295/checks?check_run_id=680612596#step:7:10

@mr-c
Copy link
Collaborator

mr-c commented May 16, 2020

With regards to https://github.com/nemequ/simde/pull/295/checks?check_run_id=680612576#step:6:15 , was that supposed to be a negative test? @nemequ do you want negative (should fail to compile) tests?

@nemequ
Copy link
Member

nemequ commented May 16, 2020

With regards to https://github.com/nemequ/simde/pull/295/checks?check_run_id=680612576#step:6:15 , was that supposed to be a negative test? @nemequ do you want negative (should fail to compile) tests?

It would be nice, but it would require some work to get it such tests to work, I don't know that it's worth the effort. I can't remember if µnit supports negative tests or not, but for this type of thing that wouldn't be enough anyways; you'd need support in the build system.

I think this was just a bug that CI caught.

Copy link
Member

@nemequ nemequ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think you're on the right track here.

Just make sure you don't throw away your test case generator code accidentally! You might want to push it to a branch just in case.

Sincerely,
Someone who has accidentally trashed code on many, many occasions

simde/x86/sse4.2.h Show resolved Hide resolved
@nemequ
Copy link
Member

nemequ commented May 16, 2020

sse4.2: added the implementation for mm_cmpestra

I have created this PR from a new branch as old branch's commit history became very messy.
The implementation is same as old branch.

FWIW, you can force push to the branch to update the PR. The PR on GitHub will look a bit messy, but I really don't care about that. The history in the git repo will be nice and clean, and that's all I care about :)

If you don't know how to do that just let me know, I can walk you through it.

@masterchef2209
Copy link
Member Author

Looks good, I think you're on the right track here.

Just make sure you don't throw away your test case generator code accidentally! You might want to push it to a branch just in case.

Sincerely,
Someone who has accidentally trashed code on many, many occasions

Yes, I should push it in a branch, just in case.

@masterchef2209
Copy link
Member Author

sse4.2: added the implementation for mm_cmpestra
I have created this PR from a new branch as old branch's commit history became very messy.
The implementation is same as old branch.

FWIW, you can force push to the branch to update the PR. The PR on GitHub will look a bit messy, but I really don't care about that. The history in the git repo will be nice and clean, and that's all I care about :)

If you don't know how to do that just let me know, I can walk you through it.

I will remember that, thanks. I have been using force push for a while now.

@masterchef2209
Copy link
Member Author

masterchef2209 commented May 16, 2020

I think I did some mistake while generating test or there is some bug in my code, so some test functions are failing, I will try to fix it now.

simde/x86/sse4.2.h Outdated Show resolved Hide resolved
@masterchef2209 masterchef2209 force-pushed the hidayat/may16_1 branch 2 times, most recently from 1805269 to 60e10dc Compare May 17, 2020 08:18
@masterchef2209
Copy link
Member Author

I think everything is fine now, couple of jobs are failing in travis ci, but I think it is not related to PR, I shall quickly implement all the tests.
This is the generater code I am using with some changes for each test : Link

nemequ pushed a commit that referenced this pull request May 17, 2020
@nemequ
Copy link
Member

nemequ commented May 17, 2020

The clang -Wsign-conversion diagnostic on the OpenMP pragma is interesting. I've filed LLVM bug #45959. To work around it in SIMDe we can use something like:

diff --git a/simde/simde-common.h b/simde/simde-common.h
index 1f40c3a..7247011 100644
--- a/simde/simde-common.h
+++ b/simde/simde-common.h
@@ -690,6 +690,7 @@ typedef SIMDE_FLOAT64_TYPE simde_float64;
 #    if defined(SIMDE_ARCH_AARCH64)
 #      define SIMDE_BUG_CLANG_45541
 #    endif
+#    define SIMDE_BUG_CLANG_45959
 #  endif
 #  if defined(HEDLEY_EMSCRIPTEN_VERSION)
 #    define SIMDE_BUG_EMSCRIPTEN_MISSING_IMPL /* Placeholder for (as yet) unfiled issues. */
diff --git a/simde/x86/sse4.2.h b/simde/x86/sse4.2.h
index 6299cd5..e3ceb7c 100644
--- a/simde/x86/sse4.2.h
+++ b/simde/x86/sse4.2.h
@@ -167,11 +167,16 @@ simde_mm_cmpestra_8_(simde__m128i a, int la, simde__m128i b, int lb, const int i
       int_res_1 = 0xff;
       for(int i = 0 ; i < upper_bound ; i++){
         int k = i;
+        HEDLEY_DIAGNOSTIC_PUSH
+        #if defined(SIMDE_BUG_CLANG_45959)
+          #pragma clang diagnostic ignored "-Wsign-conversion"
+        #endif
         SIMDE_VECTORIZE_REDUCTION(&:int_res_1)
         for(int j = 0 ; j < (upper_bound-i) ; j++){
           int_res_1 &=  (((bool_res_.i8[k] >> j) & 1 ) << i) ;
           k += 1;
         }
+        HEDLEY_DIAGNOSTIC_POP
       }
       break;
   }
@@ -272,11 +277,16 @@ simde_mm_cmpestra_16_(simde__m128i a, int la, simde__m128i b, int lb, const int
       int_res_1 = 0xffff;
       for(int i = 0 ; i < upper_bound ; i++){
         int k = i;
+        HEDLEY_DIAGNOSTIC_PUSH
+        #if defined(SIMDE_BUG_CLANG_45959)
+          #pragma clang diagnostic ignored "-Wsign-conversion"
+        #endif
         SIMDE_VECTORIZE_REDUCTION(&:int_res_1)
         for(int j = 0 ; j < (upper_bound-i) ; j++){
           int_res_1 &= (((bool_res_.i16[k] >> j) & 1) << i) ;
           k += 1;
         }
+        HEDLEY_DIAGNOSTIC_POP
       }
       break;
   }

I'm more concerned with the other errors on the native cases. I can reproduce them locally with CFLAGS='-sse4.2' CXXFLAGS='-sse4.2' cmake .. && make -j && ./run-tests /x86/sse4_2/mm_cmpestra. It looks like the native versions generate a different result for some tests.

Also, you should probably tweak your test case generation code to occasionally generate a 0 in the inputs. Hitting the end of the string is a common case, there should be tests for it.

@masterchef2209
Copy link
Member Author

masterchef2209 commented May 17, 2020

Also, you should probably tweak your test case generation code to occasionally generate a 0 in the inputs. Hitting the end of the string is a common case, there should be tests for it.

I will include those cases.

@masterchef2209
Copy link
Member Author

It looks like the native versions generate a different result for some tests.

I am concerned with this, what can be the reason for this? Is there anything wrong with the implementation. I feel like the code does what is given in the IIG, but is there something that I am missing?

@nemequ
Copy link
Member

nemequ commented May 17, 2020

This is why we have tests, and why it's so important that they be exhaustive. I didn't see anything wrong with the code on my previous reviews, either.

I am concerned with this, what can be the reason for this? Is there anything wrong with the implementation.

If the native tests don't match the emul tests, there is either a problem with the test cases or the implementation. SIMDe should always match what the instructions actually do, which may not be exactly what the IIG shows.

In the past, the IIG has been a somewhat dubious source. My understanding is that the latest version of the IIG is actually verified against Intel's implementation or internal specifications, so it should be accurate. The only real exception is when an 8-bit immediate arugment is outside the expected range.

I spent a bit of time investigating this. The first thing I did was figure out exactly which test was failing. For /x86/sse4_2/mm_cmpestra_equalordered_8, it's the 7th one. Decoding the flags, I get SIMDE_SIDD_SBYTE_OPS | SIMDE_SIDD_MASKED_NEGATIVE_POLARITY | SIMDE_SIDD_LEAST_SIGNIFICANT. So now I know which cases to take a closer look at.

Byte vs. word seemed okay. I verified that it was being dispatched to the correct function; you can just add a quick printf in the code, which is what I did. If you define SIMDE_NO_INLINE SIMDe functions won't be inlined and you should be able to set a break point in you debugger, but I usually don't resort to that for something as simple as this.

So, next was to look at the polarity. Taking a closer look, the problem jumped right out. The code looks like:

    if (polarity & 1) {
      if ((polarity >> 1) & 1) {

So, first we check if the least significant bit is set. If it is, we then right shift by one bit and see if the least significant bit is set (or, to put it another way, see if the second least siginificant bit is set).

But that's not right. Those two bits are used for the _OPS flags. We need the the 5th and 6th bits (4th and 5th if we're counting from 0). If we change the code to

    if (polarity) {
      if (polarity & SIMDE_SIDD_MASKED_POSITIVE_POLARITY) {

It works. We've already masked off the irrelevant bits when you set polarity at the beginning of the function, so all we need to do for the first check is see if polarity is non-zero. For the second test, we don't need to bother with a shift, or hard-code values. Per the IIG, we're only trying to see if imm8[5] is set. imm8[5] maps to _SIDD_MASKED_POSITIVE_POLARITY, so just test polarity & SIMDE_SIDD_MASKED_POSITIVE_POLARITY.

Once you've fixed this (in both the 8 and 16 bit versions), you'll need to regenerate the test cases. I strongly suggest that you always make sure you're generating test cases using the native implementations, and make sure that you enable the relevant ISA extension during development (in this case, add -msse4.2 to your CFLAGS and CXXFLAGS).

@masterchef2209
Copy link
Member Author

It works. We've already masked off the irrelevant bits when you set polarity at the beginning of the function, so all we need to do for the first check is see if polarity is non-zero. For the second test, we don't need to bother with a shift, or hard-code values. Per the IIG, we're only trying to see if imm8[5] is set. imm8[5] maps to _SIDD_MASKED_POSITIVE_POLARITY, so just test polarity & SIMDE_SIDD_MASKED_POSITIVE_POLARITY.

It seems so obvious now I see it, I should have been more rigorous in finding the bug. As I had already found one bug and couldn't spot more I was biased into thinking that the cause of error is something too complex. Clearly, I was wrong. Thank You for finding the bug.

Once you've fixed this (in both the 8 and 16 bit versions), you'll need to regenerate the test cases. I strongly suggest that you always make sure you're generating test cases using the native implementations, and make sure that you enable the relevant ISA extension during development (in this case, add -msse4.2 to your CFLAGS and CXXFLAGS).

I will do that and make sure to use output of native implementations for test cases.

@nemequ
Copy link
Member

nemequ commented May 18, 2020

These things often seem obvious in hindsight; I've spent hours on bugs simpler than this. Often you just need a fresh set of eyes; another person is great, but usually just starting fresh the next day or even just stepping away for a few hours is enough.

@masterchef2209
Copy link
Member Author

masterchef2209 commented May 19, 2020

Also, you should probably tweak your test case generation code to occasionally generate a 0 in the inputs. Hitting the end of the string is a common case, there should be tests for it.

I will include those cases.

I am confused about this again, isn't hitting the end of the string in mm_cmpestra is decided by la and lb, why we need to have 0 in the input?

@nemequ
Copy link
Member

nemequ commented May 20, 2020

I am confused about this again, isn't hitting the end of the string in mm_cmpestra is decided by la and lb, why we need to have 0 in the input?

You're right, sorry. I was thinking there would be a case in the code for handling a null terminating string regardless of la/lb, but there isn't.

So I guess it's just a matter of fixing the polarity thing, then this can be merged.

@masterchef2209
Copy link
Member Author

masterchef2209 commented May 20, 2020

I have spotted more errors, I realised that setting a bit of a number is different from assigning it. I will complete this with all tests soon. I have also made a mistake in limit of for loops, I am actually surprised by the number of errors I have found on looking carefully. I hope it is right after this.

break;
case SIMDE_SIDD_CMP_EQUAL_EACH:
for(int i = 0 ; i <= upper_bound ; i++){
//SIMDE_VECTORIZE_REDUCTION(|:int_res_1)
Copy link
Member Author

@masterchef2209 masterchef2209 May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nemequ what should I do here, if I have to use 2 operators inside this loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't have two reductions on the same variable.

That said, I don't see why you need this; int_res_1 is initialized to 0 (all bits unset), so you shouldn't need to unset the bits here. You only need to handle the case where the bits should be set but aren't.

@nemequ
Copy link
Member

nemequ commented May 20, 2020

I have spotted more errors, I realised that setting a bit of a number is different from assigning it. I will complete this with all tests soon. I have also made a mistake in limit of for loops, I am actually surprised by the number of errors I have found on looking carefully. I hope it is right after this.

Regarding the ranges, this would be a good opportunity to improve the test cases to cover these cases. Maybe after you set a and b to random values you could do something like (untested, but hopefully you get the idea):

int matching_chars = munit_rand_int_range(0, sizeof(a.i8) / sizeof(a.i8[0]));
for (int j = 0 ; j < matching_chars ; j++) {
  b.i8[j] = a.i8[j];
}

That way we have tests where the vectors match, either partially or fully. If you want to increase the number of elements test_vec to increase coverage feel free; 8 is just an arbitrary choice that is sufficient for most functions… it's much better to have too many tests than too few.

simde/simde-common.h Outdated Show resolved Hide resolved
@masterchef2209 masterchef2209 force-pushed the hidayat/may16_1 branch 2 times, most recently from b46c1c3 to 7ebfef7 Compare May 21, 2020 19:24
@masterchef2209
Copy link
Member Author

masterchef2209 commented May 21, 2020

I have a few doubts here. I have added a single test which is giving 0 as output on native but 1 on emul. But I am not able to figure out why. Most probably this part is the problem. I have made some changes in switch part too to match the IIG.

case SIMDE_SIDD_CMP_RANGES:
      for(int i = 0 ; i <= upper_bound ; i++){
        SIMDE_VECTORIZE_REDUCTION(|:int_res_1)
        for(int j = 0 ; j <= upper_bound ; j++){
          int_res_1 |= ((((bool_res_.i8[i] >> j) & 1) & ((bool_res_.i8[i] >> (j + 1)) & 1)) << i);
          j += 2;
        }
      }

Apart from this I feel there are some things which logically don't make sense to me from the intrinisics guide, for example how should we handle a_invalid and b_invalid. If we go by IIG it will be like what I have done in 8-bit version, but what I have done in 16-bit makes more sense.`

@nemequ
Copy link
Member

nemequ commented May 22, 2020

I'm not seeing the error either. I even threw together a quick implementation from scratch, and I'm getting the same results, so either the IIG is wrong or we're both misreading it in the same way. I'll play around with this a bit more tomorrow and see if I can narrow down the issue.

In the meantime, you could work on the CRC functions instead. At least those are well-understood functions; they basically implement CRC-32C. Or, of course, the AVX2 functions should all be pretty straightforward.

@mr-c
Copy link
Collaborator

mr-c commented Sep 1, 2020

@masterchef2209 Can you refresh this PR ? Thanks!

@masterchef2209
Copy link
Member Author

I have added the commit of this branch on top of current master.
Thank You

test/x86/sse4.2.c Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants