-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allowedUnmatched not working as expected. #1854
base: main
Are you sure you want to change the base?
Conversation
We have an unusual bug in our buildfarm deploy. Some product-teams send us "OSFamily" and "Linux" and this does NOT match "OSFamily" and "Linux" in the configuration used by servers or workers. Changing to use "*" as the OSFamily doesn't work either. We change the logic to avoid checking the properties and the situation resolves. Is this as intended? Should `allowUnmatched: true` refuse to match when the properties don't match?
I'll admit I'm not following a couple of things: The Command's definition is conveniently stored in a blob in this case which must be deserialized by the stock protobuf parser - if you can retrieve the Command's bytes (bf-cat File will retrieve this for the digest), parse those bytes into a Command message, and compare it to a config-parsed queue definition, we should be able to route out any differences due to encodings. Happy to do that comparison myself if you want to present a pathological copy of your Command blob. |
@werkt I did note that the CI/CD we use to deploy the change might have a timing bug in it where the JVM is not fully stopped when we "wipe" the Redis DB. If I read things right there is a point where a redis client defines the queues in the Redis DB and what I maybe what I saw here ... with the server rejecting the prequeue condition ... is just a bug on my end where I didn't wipe the Redis DB properly? We do see a second issue if I can get the server to accept the job (this makes more sense with this patch) ... this is building the Bazel BuildFarm code-base itself. (I fixed this issue before working with my customer again.)
I did truncate the error for brevity. Here's what I have from my customer. I'll see if they won't help me get the command blobs. I did note this only happened on SOME of their branches and not others. Specifically IIRC this was a Spring Boot command. Full text of the client side error (sensitive information redacted):
We spent a week in our test env iterating through different queue configurations trying to understand why we would get some combinations of behaviors:
What I'm not sure why it happened is occasionally I would see the queue Ultimately, we were puzzled by "allowUnmatched" not allowing the unmatched command. The fix I propose just stops comparisons if "allowUnmatched" is set ... on the worker. Oddly, I didn't have to do anything on the server and I'm not clear where that code would have been as I couldn't find the equivalent. I'm pressed for time so I went with a "belt-and-suspenders" approach on this quarter's deploy:
Process:
Either way, I'd love to understand this behavior better. And, if the patch isn't needed and I just need to tell my customer to fix a remote action that would be a good outcome. |
Digging around in where you pointed me: Here: ... and here ... The customers only properties we could find were "OSFamily" and "Linux" (we checked for typos but not charset differences) ... probably means we get to line 184 ...
I did try a queue definition of:
... which I would have expected to pass ...
But ... that's not the behavior I saw. But, I don't understand:
Which ... if I'm reading this right for ...
... would ... always return "false" on that particular action. Is that intended on a queue with Sorry for the book of a reply, I've spent 3 weeks on this problem. |
The last there is complicated, but I've tested to be effective and valid. I've explored the java String class and this implementation thoroughly, and I can't find anything out of order on the server side at least - I think your theory of 'invisible characters' is the only plausible one for why this doesn't match. Here we would get a little more information if you applied the following patch, at least per observed queue on the server: diff --git a/src/main/java/build/buildfarm/instance/shard/ServerInstance.java b/src/main/java/build/buildfarm/instance/shard/ServerInstance.java
index 6671d629..e3e2901d 100644
--- a/src/main/java/build/buildfarm/instance/shard/ServerInstance.java
+++ b/src/main/java/build/buildfarm/instance/shard/ServerInstance.java
@@ -1798,6 +1798,14 @@ public class ServerInstance extends NodeInstance {
boolean validForOperationQueue =
backplane.propertiesEligibleForQueue(platform.getPropertiesList());
if (!validForOperationQueue) {
+ String explanation = "impossible";
+ try {
+ backplane.dispatchOperation(platform.getPropertiesList());
+ } catch (IOException e) {
+ explanation = "unexpected: " + e.getMessage();
+ } catch (RuntimeException e) {
+ explanation = e.getMessage();
+ }
preconditionFailure
.addViolationsBuilder()
.setType(VIOLATION_TYPE_INVALID)
@@ -1807,8 +1815,8 @@ public class ServerInstance extends NodeInstance {
"properties are not valid for queue eligibility: %s. If you think your queue"
+ " should still accept these poperties without them being specified in queue"
+ " configuration, consider configuring the queue with `allow_unmatched:"
- + " True`",
- platform.getPropertiesList()));
+ + " True`\nExplanation:\n%s",
+ platform.getPropertiesList(), explanation));
}
for (Property property : platform.getPropertiesList()) { This is an abuse of the worker-side interface to the backplane to make it throw an exception that we expect to contain the explanation of queue ineligibility. Some notes on your digging above: You pasted the same source of <code link> ... and here ... <same link>, not sure what else you could have been referring to code-wise. Also, your idea about the flush is interesting, but unlikely: redis doesn't need to have queues 'initialized' per se - they have no state except a landing location in the database. Every queue that could receive that message would go through a server, and that has only an up/down state for the entire thing - either it's serving and will apply the config filters, or it is not, and nothing will be queued. What your spurious behavior does point to is the possibility of multiple servers with non-homogenous queue definitions - some would pass this through, some would reject it. Since you have ample time in the 7,120s compile time case, I'd suggest using bf-cat with The fastest way to a solution here is a copy of the offending Command's blob. It will likely save you vastly more time than has already been spent on this to see the issue in the flesh. |
The above patch also requires a catch for InterruptedException in a similar vein, and rather unlikely unless shutting down. |
THANK YOU!
|
Do you have any clue what |
I've discovered a related but definitely-not-the-same-issue with allowUnmatched on worker queues. The action in this issue is being rejected at the server level (if I followed everything correctly here) so this information won't help specifically, but there's another issue with the flag on the worker level: |
We have an unusual bug in our buildfarm deploy. Some product-teams send us "OSFamily" and "Linux" and this does NOT match "OSFamily" and "Linux" in the queue configuration used by servers or workers.
Examples:
Even with all three queues defined, the client is seeing
Conjectures:
*
used on the service/worker side configuration from a different alphabet than the one in the JDK?Either way, rearranging the order of the checks produces the desired effect and the client no longer receives the error that "Linux" does not equal "Linux" ... still ... it would be nice to specify "Linux" and have it match "Linux"
Is this as intended? Should
allowUnmatched: true
refuse to match when the properties don't match?