Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predefined field constraints #178

Merged
merged 8 commits into from
Sep 26, 2024

Conversation

jchadwick-buf
Copy link
Member

@jchadwick-buf jchadwick-buf commented Sep 12, 2024

  • Adds the ability to specify an ExtensionRegistry and a TypeRegistry for resolving protobuf messages. Ordinarily, only a TypeRegistry would necessarily be needed. However, we need to be able to resolve extensions defined in file descriptor sets we don't control, which means we need to be able to reparse to and from the user's descriptors in the worst case: to the user's descriptors to get the extended rule message (whose message type descriptors may have a different hashcode and thus may not resolve using just an ExtensionRegistry alone) and back from the user's descriptors in order to parse the priv/shared field.
  • Refactors some of the code around reparsing and extensions in general:
    • Reparsing options for protovalidate built-ins will always use a static extension registry.
  • Adds the rule variable.
    • Some refactoring is done around the individual rule compilation, since the code was getting a bit unwieldy.
  • Updates the conformance runner to generate an ExtensionRegistry and a TypeRegistry. This enables the conformance runner to pass both the old conformance test suite and the new one, regardless of whether the proto descriptors match up.

TODO:

  • Update to new version of protovalidate protos when they are merged.

This will depend on bufbuild/protovalidate#246.

@jchadwick-buf jchadwick-buf marked this pull request as draft September 12, 2024 14:14
@jchadwick-buf jchadwick-buf marked this pull request as draft September 12, 2024 14:14
@jchadwick-buf jchadwick-buf marked this pull request as ready for review September 19, 2024 17:51
@jchadwick-buf
Copy link
Member Author

We still need to synchronize this to the latest version of protovalidate, but that is held up for now. This can be reviewed/merged earlier if desired; it should pass all relevant conformance tests, except for the boolean rules fixes, because those would require the internal set of proto descriptors to be up-to-date. I'm going to push for reviews now so we can get this closed up as soon as possible.

@jchadwick-buf
Copy link
Member Author

OK, it is now synchronized to Protovalidate v0.8.1 and should be ready to go.

@jchadwick-buf jchadwick-buf removed the request for review from pkwarren September 24, 2024 14:15
@jchadwick-buf
Copy link
Member Author

I decided to go over this PR today and made some adjustments.

Updated to add an allowUnknownFields option and have unknown fields behavior match other runtimes closer. Added a test that tests predefined fields with dynamic descriptor sets. Made a slight adjustment to TypeRegistry behavior so that it's possible to fall back to the built-in descriptors when appropriate (e.g. when using an ExtensionRegistry with generated code.)

@jchadwick-buf jchadwick-buf requested a review from jhump September 24, 2024 17:19
Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some fairly complicated stuff, but I think I now understand it all and it seems mostly good.

I think there are some unnecessary uses of extensionRegistry in a few places, and likely some things we could to make it less error-prone (like maybe the ability to auto-create a correct type registry?).

@@ -70,4 +73,29 @@ static Map<String, Descriptors.FileDescriptor> parseFileDescriptors(
}
return fileDescriptorMap;
}

static TypeRegistry createTypeRegistry(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this function and the one below it want to be recursive, in case there are ever messages or extensions nested inside other messages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one below it probably does, but createTypeRegistry doesn't: TypeRegistry.add does its own recursion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have addressed this for the latter function (as mentioned above it is not needed for TypeRegistry.)

*
* @return if allowing unknown constraint fields is enabled
*/
public boolean isAllowUnknownFieldsEnabled() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a kind of awkward name. Maybe isAllowingUnknownFields? Or maybe even just allowUnknownFields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with isAllowingUnknownFields. Personally I prefer allowUnknownFields but aside from the fact that this conflicts with the variable name I think it is just going against Java convention to have accessors named that way. Y'know, when in Rome...

* @param typeRegistry the type registry to use
* @return this builder
*/
public Builder setTypeRegistry(TypeRegistry typeRegistry) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really clear on why we need this. Is it for performance? Because when we have a descriptor with options, we also have the full transitive dependency graph for that proto and could use that to build these registries. I guess if we built them on demand, we might end up allocating/building numerous registries and re-processing (re-adding across multiple registry instances) the types from common imports. But we only ever need to do this once per FileDescriptor, so is a one-time cost.

Maybe the behavior of null/unset registries might be to do the above, so if you only have to set them as a performance optimization? And we could even provide helpers, similar to what you have in the conformance package, that take a set of file descriptors and crawl them to produce the type and extension registries. That way people don't have to really compute them on their own -- can use the helper to compute them once up-front.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might actually be possible to do this but it's sort of complicated since we're jumping back and forth between user protos and our own. But actually, that's not my main concern. I have really two problems:

  1. TypeRegistry is immutable. Once it is constructed, it is done. Therefore, incrementally building a TypeRegistry is always inefficient. We can, of course, build our own TypeRegistry which is essentially just a Map, but that smells. And it gets worse, because,
  2. The behavior of conflicting file descriptors will depend on the order in which you try to validate messages. To me this is just too confusing.

Bear in mind that this situation (unknown extensions) should realistically only be a problem for users that are dealing with dynamic file descriptor sets. In many cases users that are dealing with dynamic descriptor sets might have to deal with multiple different dynamic descriptor sets, especially if they're doing some kind of multi-tenant service. By exposing the TypeRegistry we make it explicit: The user controls what resolves to what Descriptor, and you need a new Validator for a new TypeRegistry.

I can definitely see a case for us providing better helper functions. However, the helper functions for this would really not be terribly protovalidate-specific and so I opted to keep the helpers in our private code in the conformance test suite, which users could conceivably copy if they wanted our behavior. Users who are dealing with dynamic protocol buffers likely need to build an ExtensionRegistry and/or a TypeRegistry anyhow, and probably pass it to other libraries or use it in their own code.

Of course I am open to trying to improve this situation but I opted to do things the dumb way, because it's already pretty confusing to begin with, and adding more magic on top of it seems like it is just not the greatest user experience.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point that we could just provide our own type since we're not actually providing the TypeRegistry to any API that requires it.

The behavior of conflicting file descriptors will depend on the order in which you try to validate messages. To me this is just too confusing.

If we do it right, there is no such thing as conflicting descriptors. A single descriptor is not valid if it has conflicts within its own transitive dependency graph. So we really want a tree of maps -- each node has a map of the messages in that file, and then it has tree links to its imports. So finding a type could require O(n) time (where n is number of files in transitive graph) instead of O(1), but it wouldn't be subject to conflicts.

Sadly, we can't really do this with extension types since ExtensionRegistry is a concrete type that is required by APIs we're using.

Note that I wasn't suggesting incrementally building a single shared registry but rather to build a new registry per file descriptor. I understand that could be a perf issue, but since it's only done once per file per cache (and there will likely be a single cache for most programs plus fixed number of files with generated message types) it is likely to be perfectly acceptable.

Copy link
Member Author

@jchadwick-buf jchadwick-buf Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I just realized that our own tests illuminate a case where this won't work.

Consider the following:

    ExtensionRegistry registry = ExtensionRegistry.newInstance();
    registry.add(isIdent);
    TypeRegistry typeRegistry =
        TypeRegistry.newBuilder().add(isIdent.getDescriptor().getContainingType()).build();
    Config config = Config.newBuilder().setTypeRegistry(typeRegistry).setExtensionRegistry(registry).build();

This obviously works just fine because the TypeRegistry and the ExtensionRegistry are synced with each-other. However... if we get rid of the TypeRegistry, this test can never work!

This is because this test does not actually go across two different Descriptor types like one might assume. It goes across three, each with their own distinct hashCode()s.

  • The first one is build.buf.validate.StringRules.
  • The second one is com.example.imports.buf.validate.StringRules. We have a second copy of the gencode to specifically test this case (dynamic proto of separate gencode.)
  • The third one is a dynamically-resolved com.validate.imports.buf.validate.StringRules - crucially, extensions aren't resolved in nested types. This means e.g. buf.validate.predefined fields are not resolved inside of field options, which results in a different hashCode() everywhere!

Our code already intends to handle this case (dynamic message parsed from potentially distinct file descriptor; we test it for other cases that don't involve user-defined extensions). This puts us in a tricky position, because we really want to reparse the extensions using the extension types, not the types that happen to be attached to the descriptor of a message, which may be dynamic or have a different gencode than whoever registered the extensions.

The only other alternative I can think of to explicitly having the user register a TypeRegistry is if we enumerate their ExtensionRegistry to find the types instead, but it seems like it just keeps getting worse the harder you try:

  • There's no great way to enumerate all extensions in an ExtensionRegistry so what you actually have to do is enumerate the extensions of each possible message that you might want to query extensions for.
  • There's no guarantee that there's only one Descriptor for a given full name stored in the ExtensionRegistry - it will happy return a list with extensions that have many distinct Descriptor types. We would probably need to throw an exception in this case since there's not much reasonable to do with it.

At this point I'm torn. The test case is indeed a bit convoluted, but it's convoluted on purpose. With the TypeRegistry it works for fairly obvious reasons. Without it, it doesn't work for reasons that I found very unintuitive.

What would be nice is if we could just have the ExtensionRegistry key on the full name of the descriptor instead of the descriptor identity, but after looking around for too long I've determined there's nothing we can really do on that front other than things that I'm pretty sure we shouldn't do.

this.evaluatorBuilder =
new EvaluatorBuilder(
env,
config.isDisableLazy(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just pass the whole config through, since it seems to need every field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, that's a good idea. I just made the same change for ConstraintCache as well since there's really no obvious downside to doing this.

Comment on lines 176 to 178
if (descriptorMap.containsKey(constraintFieldDesc)) {
return descriptorMap.get(constraintFieldDesc);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use get and then return if the value is non-null, instead of querying the map 2x.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good point. (I didn't actually change this line, it just got relocated, but there's no sense in keeping it like this, since it's not relying on null values in the map or anything like that.)

@@ -165,11 +172,10 @@ private void buildMessage(Descriptor desc, MessageEvaluator msgEval)
try {
DynamicMessage defaultInstance =
DynamicMessage.newBuilder(desc)
.mergeFrom(new byte[0], EXTENSION_REGISTRY)
.mergeFrom(new byte[0], extensionRegistry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this line does nothing. Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Removed.

Comment on lines +298 to +299
Descriptors.Descriptor expectedConstraintMessageDescriptor =
typeRegistry.find(expectedConstraintDescriptor.getMessageType().getFullName());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this. Why would we need touse the descriptor in a provided type registry instead of the descriptor we already have?

This appears to be the only use of typeRegistry, so I'm wondering if it's even needed.

Aside: oneofFieldDescriptor is an awful name. I was so confused reading this, thinking it might be the descriptor for OneofFieldConstraints. But, IIUC, it really means the descriptor for a field inside the type oneof on FieldConstraints, right? Might be nice to rename it... typedConstraintFieldDescriptor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I re-read the PR description and think I understand now: we have to use the provided typeRegistry in case its version of the message descriptor has a different hash code (so it's queryable from a map inside extensionRegistry I think).

That is really subtle and I think that explanation deserves to be in the code here, not just in a PR description/commit message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment explaining it has been added. I agree with you that it definitely can't just live in the PR comment although I put most of the effort into explaining what's going on here over in the setTypeRegistry documentation.

Comment on lines 215 to 216
build.buf.validate.PredefinedConstraints.parseFrom(
((MessageLite) extensionValue).toByteString(), extensionRegistry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PredefinedConstraints does not have any extensions (or nested extendable message fields), so I think the extensionReigstry here is unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed extensionRegistry here.

this.env = Objects.requireNonNull(env, "env");
this.constraintCache = Objects.requireNonNull(constraintCache, "constraintCache");
this.cache = new HashMap<>(previousCache);
this.extensionRegistry = Objects.requireNonNull(extensionRegistry, "extensionRegistry");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field appears to be unnecessary. It seems to only be used by calls that parse zero bytes (so never any extensions to actually parse). I assume we do that in order to build a partial message in the off-chance that any message descriptor we have has required fields (in which case the simple case of creating a new empty message would throw).

Note: if I am right and we can remove this, we can also remove from the enclosing type, which only has an extensionRegistry to pass to this (AFAICT).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're actually right. I realized this at some point and never looped back to fix it. (Frankly, I'm rather impressed you picked this one up during a code review. Very nice attention to detail.)

Copy link
Member Author

@jchadwick-buf jchadwick-buf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I have addressed all of the feedback now. Most of the things I have trivially resolved.

The main sticking point for me is how to handle the TypeRegistry bits. It would be nice for congruent behavior with other runtimes if we could avoid needing the TypeRegistry part but I think that it's not trivially possible to avoid it.

@@ -70,4 +73,29 @@ static Map<String, Descriptors.FileDescriptor> parseFileDescriptors(
}
return fileDescriptorMap;
}

static TypeRegistry createTypeRegistry(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have addressed this for the latter function (as mentioned above it is not needed for TypeRegistry.)

Comment on lines +298 to +299
Descriptors.Descriptor expectedConstraintMessageDescriptor =
typeRegistry.find(expectedConstraintDescriptor.getMessageType().getFullName());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment explaining it has been added. I agree with you that it definitely can't just live in the PR comment although I put most of the effort into explaining what's going on here over in the setTypeRegistry documentation.

/** Config is the configuration for a Validator. */
public final class Config {
private static final TypeRegistry DEFAULT_TYPE_REGISTRY = TypeRegistry.getEmptyTypeRegistry();
private static final ExtensionRegistry DEFAULT_EXTENSION_REGISTRY =
ExtensionRegistry.newInstance();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @return if allowing unknown constraint fields is enabled
*/
public boolean isAllowUnknownFieldsEnabled() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with isAllowingUnknownFields. Personally I prefer allowUnknownFields but aside from the fact that this conflicts with the variable name I think it is just going against Java convention to have accessors named that way. Y'know, when in Rome...

this.evaluatorBuilder =
new EvaluatorBuilder(
env,
config.isDisableLazy(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, that's a good idea. I just made the same change for ConstraintCache as well since there's really no obvious downside to doing this.

Comment on lines 176 to 178
if (descriptorMap.containsKey(constraintFieldDesc)) {
return descriptorMap.get(constraintFieldDesc);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good point. (I didn't actually change this line, it just got relocated, but there's no sense in keeping it like this, since it's not relying on null values in the map or anything like that.)

build.buf.validate.PredefinedConstraints constraints = getFieldConstraints(constraintFieldDesc);
if (constraints == null) return null;
List<Expression> expressions = Expression.fromConstraints(constraints.getCelList());
List<CelRule> celRules = new ArrayList<>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, done.

Comment on lines 215 to 216
build.buf.validate.PredefinedConstraints.parseFrom(
((MessageLite) extensionValue).toByteString(), extensionRegistry);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed extensionRegistry here.

this.env = Objects.requireNonNull(env, "env");
this.constraintCache = Objects.requireNonNull(constraintCache, "constraintCache");
this.cache = new HashMap<>(previousCache);
this.extensionRegistry = Objects.requireNonNull(extensionRegistry, "extensionRegistry");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're actually right. I realized this at some point and never looped back to fix it. (Frankly, I'm rather impressed you picked this one up during a code review. Very nice attention to detail.)

@@ -165,11 +172,10 @@ private void buildMessage(Descriptor desc, MessageEvaluator msgEval)
try {
DynamicMessage defaultInstance =
DynamicMessage.newBuilder(desc)
.mergeFrom(new byte[0], EXTENSION_REGISTRY)
.mergeFrom(new byte[0], extensionRegistry)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Removed.

Copy link
Member

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a shame about needing users to provide type and extension registries for any custom extensions. But the Protobuf runtime's APIs here for Java really make this hard to do otherwise. @jchadwick-buf and I discussed a couple of alternatives that could make it so users don't have to supply anything extra, but they are all much more complicated to implement and test and possibly brittle/risk in some ways, so it seems like maybe we could merge as-is and noodle on it more later.

So this LGTM. Hopefully 🤞 we can find a solution to not need user-provided registries before a v1.0. But if not, c'est la via.

@jchadwick-buf
Copy link
Member Author

Yeah, I kept wrestling with it a bit more because it really is a shame that we have to expose this complexity, but at the very least we can now say that we did an extraordinary amount of diligence before just accepting it, so if we missed anything obvious it wasn't for lack of trying.

Thanks for the very thorough review. This wound up being a challenging patchset to write and it was great to have a very in-depth review, I'm pretty confident in the shape this is in now at least.

@jchadwick-buf jchadwick-buf merged commit 7345e69 into main Sep 26, 2024
4 checks passed
@jchadwick-buf jchadwick-buf deleted the jchadwick/shared-field-rules branch September 26, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants