Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allowedUnmatched not working as expected. #1854

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hartsock
Copy link

We have an unusual bug in our buildfarm deploy. Some product-teams send us "OSFamily" and "Linux" and this does NOT match "OSFamily" and "Linux" in the queue configuration used by servers or workers.

Examples:

  queues:
    - name: "*"
      allowUnmatched: true
    - name: "default"
      allowUnmatched: true
      properties:
        - name: "OSFamily"
          value: "*"
    - name: "linux"
      allowUnmatched: true
      properties:
        - name: "min-cores"
          value: "*"
        - name: "max-cores"
          value: "*"
        - name: "OSFamily"
          value: "Linux"

Even with all three queues defined, the client is seeing

(INVALID) The `Platform` of the `Command` was invalid.: properties are not valid for queue eligibility: [name: "OSFamily"
value: "Linux"
].

Conjectures:

  • Is a hidden char in the "Linux" string is making it into an upstream OSS project's actions from the Bazel side of the conversation?
  • Is a charset difference between the Bazel and BuildFarm service present and causing a mismatch?
  • Is the character * used on the service/worker side configuration from a different alphabet than the one in the JDK?

Either way, rearranging the order of the checks produces the desired effect and the client no longer receives the error that "Linux" does not equal "Linux" ... still ... it would be nice to specify "Linux" and have it match "Linux"

Is this as intended? Should allowUnmatched: true refuse to match when the properties don't match?

We have an unusual bug in our buildfarm deploy. Some product-teams send us "OSFamily" and "Linux" and this does NOT match "OSFamily" and "Linux" in the configuration used by servers or workers. 

Changing to use "*" as the OSFamily doesn't work either. We change the logic to avoid checking the properties and the situation resolves.

Is this as intended? Should `allowUnmatched: true` refuse to match when the properties don't match?
@werkt
Copy link
Member

werkt commented Sep 20, 2024

I'll admit I'm not following a couple of things:
Minor: The invalid message string you pasted is missing some copy at the end, based on https://github.com/bazelbuild/bazel-buildfarm/blob/f3f0cfdf875f2c86e9b35e7e1972fb5326e9283c/src/main/java/build/buildfarm/instance/shard/ServerInstance.java#L1807. If you truncated it to be more brief, that's understandable.
Major: The error message above, which is the only instance of that copy, occurs on the server, not the worker, but your change only affects worker activity. I don't see how it could possibly cause the error to not occur based on where that error is generated.

The Command's definition is conveniently stored in a blob in this case which must be deserialized by the stock protobuf parser - if you can retrieve the Command's bytes (bf-cat File will retrieve this for the digest), parse those bytes into a Command message, and compare it to a config-parsed queue definition, we should be able to route out any differences due to encodings.

Happy to do that comparison myself if you want to present a pathological copy of your Command blob.

@hartsock
Copy link
Author

@werkt
I'd rather not do this change, it is odd to me that when I put this patch on the worker the server stops rejecting the command.

I did note that the CI/CD we use to deploy the change might have a timing bug in it where the JVM is not fully stopped when we "wipe" the Redis DB. If I read things right there is a point where a redis client defines the queues in the Redis DB and what I maybe what I saw here ... with the server rejecting the prequeue condition ... is just a bug on my end where I didn't wipe the Redis DB properly?

We do see a second issue if I can get the server to accept the job (this makes more sense with this patch) ... this is building the Bazel BuildFarm code-base itself. (I fixed this issue before working with my customer again.)

[222 / 618] JavaToolchainCompileClasses external/bazel_tools/tools/jdk/platformclasspath_classes; 2s remote ... (4 actions, 2 running)
[222 / 618] Compiling absl/base/log_severity.cc [for tool]; 1s remote ... (4 actions, 3 running)
[222 / 618] Compiling absl/base/log_severity.cc [for tool]; 2s remote ... (4 actions running)
...
[222 / 618] Compiling absl/base/log_severity.cc [for tool]; 6940s remote ... (4 actions running)
[222 / 618] Compiling absl/base/log_severity.cc [for tool]; 7000s remote ... (4 actions running)
[222 / 618] Compiling absl/base/log_severity.cc [for tool]; 7060s remote ... (4 actions running)
[222 / 618] Compiling absl/base/log_severity.cc [for tool]; 7120s remote ... (4 actions running)
Bazel caught terminate signal; shutting down.
ERROR: build interrupted
INFO: Elapsed time: 7[194](https://${INTERNAL_GITLAB_REPO}/${REDACTED_PATH}/-/jobs/49535804#L194).461s, Critical Path: 7168.77s
INFO: 226 processes: 1 remote cache hit, 225 internal.
FAILED: Build did NOT complete successfully
real	119m54.721s

I did truncate the error for brevity. Here's what I have from my customer. I'll see if they won't help me get the command blobs. I did note this only happened on SOME of their branches and not others. Specifically IIRC this was a Spring Boot command.

Full text of the client side error (sensitive information redacted):

ERROR: ${REDACTED_PATH}/.cache/bazel/_bazel_bd007141/57dc564e719dcccb93518ce11889b331/external/maven/BUILD:6074:8: Executing genrule @@maven//:commons_beanutils_commons_beanutils_1_9_4_extension failed: (Exit 34): Remote Execution Failure:
Failed Precondition: Action 9a84d06b5d0fb365a8a17b364b9cd93785e69dad9a34656bfc9fe85462608e29/166 is invalid: properties are not valid for queue eligibility: [name: "OSFamily"
value: "Linux"
].  If you think your queue should still accept these poperties without them being specified in queue configuration, consider configuring the queue with `allow_unmatched: True`.
  Precondition Failure:
    (INVALID) The `Platform` of the `Command` was invalid.: properties are not valid for queue eligibility: [name: "OSFamily"
value: "Linux"

We spent a week in our test env iterating through different queue configurations trying to understand why we would get some combinations of behaviors:

  1. the server rejects the action ... which I could get past with the "*" queue
  2. the worker refuses to dequeue the command ... which I couldn't get past with the "*" queue

What I'm not sure why it happened is occasionally I would see the queue * work at the server but fail at the worker. The error snippet I produced here was found during testing with all 3 of the example queue definitions on the server and the worker.

Ultimately, we were puzzled by "allowUnmatched" not allowing the unmatched command.

The fix I propose just stops comparisons if "allowUnmatched" is set ... on the worker. Oddly, I didn't have to do anything on the server and I'm not clear where that code would have been as I couldn't find the equivalent.

I'm pressed for time so I went with a "belt-and-suspenders" approach on this quarter's deploy:

  1. I defined 3 queues (which doesn't make sense to use b/c allowUnmatched we'd expect would let the unmatched platform past us)
  2. I applied this patch. I saw the 7,120s compile time on absl/base/log_severity.cc in the Bazel BuildFarm self-hosted build

Process:

  1. fix the self-hosted build's timeout in our CI/CD first
  2. deploy the unpatched version in a shared test environment with the customer (v2.9.0 unpatched) with the 3 queues defined ... they report the server-side rejection still happens
  3. deploy the patched jars from our CI/CD (full stop and wipe of the BuildFarm here) and re-test with the customer ... they report the issue cleared

Either way, I'd love to understand this behavior better. And, if the patch isn't needed and I just need to tell my customer to fix a remote action that would be a good outcome.

@hartsock
Copy link
Author

hartsock commented Sep 20, 2024

@werkt

Digging around in where you pointed me:

Here:

https://github.com/bazelbuild/bazel-buildfarm/blob/main/src/main/java/build/buildfarm/common/redis/ProvisionedRedisQueue.java#L184

... and here ...

https://github.com/bazelbuild/bazel-buildfarm/blob/main/src/main/java/build/buildfarm/common/redis/ProvisionedRedisQueue.java#L184

The customers only properties we could find were "OSFamily" and "Linux" (we checked for typos but not charset differences) ... probably means we get to line 184 ...

return requirements.isEmpty();

I did try a queue definition of:

    - name: "linux"
      allowUnmatched: true
      properties:
        - name: "min-cores"
          value: "*"
        - name: "max-cores"
          value: "*"
        - name: "OSFamily"
          value: "*"

... which I would have expected to pass ...

https://github.com/bazelbuild/bazel-buildfarm/blob/main/src/main/java/build/buildfarm/common/redis/ProvisionedRedisQueue.java#L184

    // fully wildcarded queues are always eligible
    if (isFullyWildcard) {
      return true;
    }

But ... that's not the behavior I saw.

But, I don't understand:

return requirements.isEmpty();

Which ... if I'm reading this right for ...

Failed Precondition: Action 9a84d06b5d0fb365a8a17b364b9cd93785e69dad9a34656bfc9fe85462608e29/166 is invalid: properties are not valid for queue eligibility: [name: "OSFamily"
value: "Linux"
].  If you think your queue should still accept these poperties without them being specified in queue configuration, consider configuring the queue with `allow_unmatched: True`.
  Precondition Failure:
    (INVALID) The `Platform` of the `Command` was invalid.: properties are not valid for queue eligibility: [name: "OSFamily"
value: "Linux"

... would ... always return "false" on that particular action.

Is that intended on a queue with allowUnmatched: true ? ... or maybe I don't understand what I'm reading?

Sorry for the book of a reply, I've spent 3 weeks on this problem.

@werkt
Copy link
Member

werkt commented Sep 21, 2024

isFullyWildcard is only true when any of the queue item's properties.name is "*". This short-circuits everything but the queue selector property to make a queue eligible.
requirements is only populated with explicit, non-wildcarded values. By iterating over each input property and removing pairs from them, we expect that at the end of isEligible, an empty requirements list means everything the queue requires explicitly has been satisfied.
allowUserUnmatched is a switch on non-isFullyWildcard queues to disqualify property sets which contain keys that are not wildcarded and do not match entries in the requirements. This permits some non-ordinality and passthrough of the queue definitions for true and false, respectively.

The last there is complicated, but I've tested to be effective and valid.

I've explored the java String class and this implementation thoroughly, and I can't find anything out of order on the server side at least - I think your theory of 'invisible characters' is the only plausible one for why this doesn't match.

Here we would get a little more information if you applied the following patch, at least per observed queue on the server:

diff --git a/src/main/java/build/buildfarm/instance/shard/ServerInstance.java b/src/main/java/build/buildfarm/instance/shard/ServerInstance.java
index 6671d629..e3e2901d 100644
--- a/src/main/java/build/buildfarm/instance/shard/ServerInstance.java
+++ b/src/main/java/build/buildfarm/instance/shard/ServerInstance.java
@@ -1798,6 +1798,14 @@ public class ServerInstance extends NodeInstance {
     boolean validForOperationQueue =
         backplane.propertiesEligibleForQueue(platform.getPropertiesList());
     if (!validForOperationQueue) {
+      String explanation = "impossible";
+      try {
+        backplane.dispatchOperation(platform.getPropertiesList());
+      } catch (IOException e) {
+        explanation = "unexpected: " + e.getMessage();
+      } catch (RuntimeException e) {
+        explanation = e.getMessage();
+      }
       preconditionFailure
           .addViolationsBuilder()
           .setType(VIOLATION_TYPE_INVALID)
@@ -1807,8 +1815,8 @@ public class ServerInstance extends NodeInstance {
                   "properties are not valid for queue eligibility: %s.  If you think your queue"
                       + " should still accept these poperties without them being specified in queue"
                       + " configuration, consider configuring the queue with `allow_unmatched:"
-                      + " True`",
-                  platform.getPropertiesList()));
+                      + " True`\nExplanation:\n%s",
+                  platform.getPropertiesList(), explanation));
     }
 
     for (Property property : platform.getPropertiesList()) {

This is an abuse of the worker-side interface to the backplane to make it throw an exception that we expect to contain the explanation of queue ineligibility.

Some notes on your digging above: You pasted the same source of <code link> ... and here ... <same link>, not sure what else you could have been referring to code-wise.

Also, your idea about the flush is interesting, but unlikely: redis doesn't need to have queues 'initialized' per se - they have no state except a landing location in the database. Every queue that could receive that message would go through a server, and that has only an up/down state for the entire thing - either it's serving and will apply the config filters, or it is not, and nothing will be queued.

What your spurious behavior does point to is the possibility of multiple servers with non-homogenous queue definitions - some would pass this through, some would reject it.

Since you have ample time in the 7,120s compile time case, I'd suggest using bf-cat with Operations toolInvocationId=<bazel's invocation id> and Operations status=dispatched to inspect the operation to see if it is bouncing in and out of the dispatched state, as though workers are rejecting its match (this is completely silent, and does not check against requeue attempt limits). bf-client - https://github.com/werkt/bf-client - is also useful for observing and inspecting the operations and associated objects without issuing multiple commands.

The fastest way to a solution here is a copy of the offending Command's blob. It will likely save you vastly more time than has already been spent on this to see the issue in the flesh.

@werkt
Copy link
Member

werkt commented Sep 23, 2024

The above patch also requires a catch for InterruptedException in a similar vein, and rather unlikely unless shutting down.

@hartsock
Copy link
Author

hartsock commented Sep 23, 2024

THANK YOU!

Some notes on your digging above: You pasted the same source of ... and here ... , not sure what else you could have been referring to code-wise.

...

Sorry. I was probably swapping between tabs too fast and missed that I did that.

I meant to say:

here:
https://github.com/bazelbuild/bazel-buildfarm/blob/main/src/main/java/build/buildfarm/common/redis/ProvisionedRedisQueue.java#L172

(line 172 which is the "isFullyWildcard" logic)

... and here ...
https://github.com/bazelbuild/bazel-buildfarm/blob/main/src/main/java/build/buildfarm/common/redis/ProvisionedRedisQueue.java#L184

(line 184 which is the logic "requirements.isEmpty")

Your explanation here:

isFullyWildcard is only true when any of the queue item's properties.name is "*". This short-circuits everything but the queue selector property to make a queue eligible.

Completely clarifies the issue!

It explains why adding:

  queues:
    - name: "*"
      allowUnmatched: true

... ended up working ... it is something I added in desperation at the end of my dev/test cycle. Specifically, to get to the "infinite stall" issue the customer was seeing.

These explanations also help clarify the intentions of the design with me:

requirements is only populated with explicit, non-wildcarded values. By iterating over each input property and removing pairs from them, we expect that at the end of isEligible, an empty requirements list means everything the queue requires explicitly has been satisfied.

allowUserUnmatched is a switch on non-isFullyWildcard queues to disqualify property sets which contain keys that are not wildcarded and do not match entries in the requirements. This permits some non-ordinality and passthrough of the queue definitions for true and false, respectively.

... this explains a lot to me ...

I'll have a shortage of time for the next month or two to push further.

... BUT ...

I'll see if this particular customer is game to chase this. I'll try to apply the patch and work with my customer to see if we can get this to happen. They mentioned the issue was happening in some upstream open-source package of theirs and I'll try to track that down with them to see if anyone else in the community has bumped into this or if its just us.

Setting a reminder to come back to this. (10/2 reminder set)

EDIT:

What is still mysterious

  • I added all 3 queues to the server/managers and the workers BEFORE attempting the patch and had the customer explicitly test the two configurations independently (I did NOT want to write a patch)
  • The customer's test only passed after I applied the suggested patch above

... I expect this must be some kind of hidden char issue that only shows up on the worker for some reason and only this one action (other customers did NOT report this problem)

Apologies for the length of this response, I did not take the time to make it shorter.

@werkt
Copy link
Member

werkt commented Sep 28, 2024

Do you have any clue what locale the client is using? I could try something exotic locally.

@werkt
Copy link
Member

werkt commented Oct 1, 2024

I've discovered a related but definitely-not-the-same-issue with allowUnmatched on worker queues. The action in this issue is being rejected at the server level (if I followed everything correctly here) so this information won't help specifically, but there's another issue with the flag on the worker level:
Workers with multiple queues in the config will not select a queue to inform it regardless of either allowUnmatched: true in the queue or allowUnmatched: true in dequeueMatchSettings unless the properties specified by the queue are satisfied by the provisions.
This entire system is annoyingly complicated, and I think is going to require a rework at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants