-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1787-Support_multiple_hash_functions #1791
base: main
Are you sure you want to change the base?
1787-Support_multiple_hash_functions #1791
Conversation
Clean Clean Clean
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
@@ -71,20 +74,21 @@ public DigestFunction.Value getDigestFunction() { | |||
return DigestFunction.Value.UNKNOWN; | |||
} | |||
|
|||
// Correct me if I'm wrong but this function doesn't make any sense? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I am referring to the old function here
@@ -80,6 +80,20 @@ java_library( | |||
], | |||
) | |||
|
|||
java_library( | |||
name = "digestUtil", | |||
srcs = ["DigestUtil.java"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this also included in :common
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it it, but if I add common to src/main/java/build/buildfarm/common/resources/BUILD
we have a circular dependency problem. So I created a separate instance for it.
@@ -247,6 +243,13 @@ public SettableFuture<Long> load(Digest digest) { | |||
private final transient Entry header = new SentinelEntry(); | |||
private volatile long unreferencedEntryCount = 0; | |||
|
|||
private static final int MAX_ENTRIES = 10000; | |||
private static final Cache<String, DigestFunction.Value> digest_to_function_map = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need to introduce this cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it this change actually requires a quite huge refactoring of changing many chaining functions, the corresponding interfaces and as a result also all other implementations of those. This seems to be the cleanest solution to me.
src/main/java/build/buildfarm/instance/server/NodeInstance.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets first target supporting multiple hashing algorithm for write request and accordingly update the pr summary.
@@ -101,6 +102,14 @@ Write getWrite( | |||
Compressor.Value compression, Digest digest, UUID uuid, RequestMetadata requestMetadata) | |||
throws EntryLimitException; | |||
|
|||
@Nullable | |||
default Write getWrite( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to create a new overloaded method, you can update the additional parameter to the current method and update all existing implementations.
@@ -981,6 +987,8 @@ public synchronized FeedbackOutputStream getOutput( | |||
throws IOException { | |||
// caller will be the exclusive owner of this write stream. all other requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misplaced comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not my comment, but I can remove it 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean comment got misplaced because of this code in the middle.
// Map identifier to digest function
digest_to_function_map.put(key.getDigest().getHash(), key.getDigestFunction());
@@ -981,6 +987,8 @@ public synchronized FeedbackOutputStream getOutput( | |||
throws IOException { | |||
// caller will be the exclusive owner of this write stream. all other requests | |||
// will block until it is returned via a close. | |||
// Map identifier to digest function | |||
digest_to_function_map.put(key.getDigest().getHash(), key.getDigestFunction()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a work around to store the hashing algorithm for digests. It also introduces extra overhead to maintain a cache.
private static final int MAX_ENTRIES = 10000; | ||
private static final Cache<String, DigestFunction.Value> digest_to_function_map = | ||
Caffeine.newBuilder() | ||
.expireAfterWrite(2, TimeUnit.MINUTES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there are more than 10k write requests in 2 mins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't matter. This is basically a LRU cache. As long as there are no 10k request (or 2 minutes) between the methods getOutput
and putOrReferenceGuarded
it works. But we can also increase the size if you think that is not sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may increase it, but doing so will add extra overhead. Using an additional data structure (accessible by multiple threads) to pass a value from one method to another may seem like a shortcut to avoid further code changes. However, it's better to make the necessary changes rather than introduce a new data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, but it requires quite a large code change then, even more if we don't want to create a new overload method like you mentioned but change the existing interface methods as well as all implementations of those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there will be small changes at multiple places in the code. Lets try to do it proper way so it can be upstreamed.
} else { | ||
// Infer the digest function based on the hash format | ||
// Should go here when it's not BLAKE3 | ||
String hash = maybeDigestFunction; | ||
builder.setDigestFunction(DigestUtil.HashFunction.forHash(hash).getDigestFunction()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DigestFunction is always included in the request, this additional handling might not be needed.
There are several other hashing algorithms with the same hash length as BLAKE3. It might be better to remove the HashFunction.forHash method entirely and always rely on the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see it always included in the request. For blake it is, for sha it isn't.
I assume if we want to change that we would need to change the Bazel OSS first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that hash function is not part of request only for SHA256 as SHA256 is default hashing algorithm for bazel. However, upon checking the Bazel code, I found that it classifies hashing algorithms into two groups: old and new. For the old ones, where the hashing algorithm can be identified using the hash length, the hash function is not part of the request. For the new ones, the hash function is included in the request. The hash detection logic should complement this Bazel method.
https://github.com/bazelbuild/bazel/blob/1dd5c20513b6ff479bf796fe2541bd492c661c94/src/main/java/com/google/devtools/build/lib/remote/ByteStreamUploader.java#L183
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is complementing that now no? For new one we extract if it's part of the request, otherwise we determine by length with this function I added:
public static HashFunction forHash(String hexDigest) {
int hashLength = hexDigest.length();
return switch (hashLength) {
case 64 -> // 32 bytes * 2 characters per byte
HashFunction.SHA256;
case 40 -> // 20 bytes * 2 characters per byte
HashFunction.SHA1;
case 32 -> // 16 bytes * 2 characters per byte
HashFunction.MD5;
default -> throw new IllegalArgumentException("hash type unrecognized: " + hexDigest);
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this complements. Lets update the comment something like if hash function is not part of request then use hash length to determine the hashing algorithm. BLAKE3 is too specific.
@@ -1906,7 +1898,7 @@ public boolean pollOperation(String operationName, ExecutionStage.Value stage) { | |||
|
|||
protected CacheCapabilities getCacheCapabilities() { | |||
return CacheCapabilities.newBuilder() | |||
.addDigestFunctions(digestUtil.getDigestFunction()) | |||
.addAllDigestFunctions(DigestUtil.getSupportedDigestFunctions()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current change mostly focus on supporting multiple hashing algorithm for write requests but this change will signal client that buildfarm support multiple hashing algorithm for all apis including ExecutionService.Execute
API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then how would we make the change just for write? If get Capabilities doesn't say it accepts SHA and BLAKE, the client will cancel the request if it uses a different algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be internal patch until support for multiple hash function is enabled for all APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure these checks passes.
https://buildkite.com/bazel/buildfarm-farmer/builds/8542
} else { | ||
// Infer the digest function based on the hash format | ||
// Should go here when it's not BLAKE3 | ||
String hash = maybeDigestFunction; | ||
builder.setDigestFunction(DigestUtil.HashFunction.forHash(hash).getDigestFunction()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this complements. Lets update the comment something like if hash function is not part of request then use hash length to determine the hashing algorithm. BLAKE3 is too specific.
// It is always calling the same isValidHexDigest function with the same parameters. | ||
// So it will always either return SHA256, or throw an exception. | ||
// With the new method we still can't distinguish between SHA256 and BLAKE3, but at least it's better... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem correct. Yes it calls the same function but on different instance of DigestFunction, so for each call, the value for hash.bits
will be different.
You may leave this code as it is, as it doesn't change the code behavior.
private static final int MAX_ENTRIES = 10000; | ||
private static final Cache<String, DigestFunction.Value> digest_to_function_map = | ||
Caffeine.newBuilder() | ||
.expireAfterWrite(2, TimeUnit.MINUTES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there will be small changes at multiple places in the code. Lets try to do it proper way so it can be upstreamed.
Feature to support multiple Hash Functions as described here.
Some uncertain parts I expressed in comments now. Once discussed I will remove those and adjust the unit tests before a potential merge .