-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Validating WebHooks for Create/Update/Delete of Instances & Bindings #572
Comments
What would happen in case of mass deletions like deletion of CF space: |
Hey @jboyd01. You mentioned yesterday that if a broker says "yes you can delete this instance", but then the instance still fails, then service-catalog are planning on implementing something to provide an adequate UX in this scenario. This seems fairly likely since the validate request is really just a "best guess" by the broker. What happens today in this case and how are you planning to handle this scenario? If a deprovision request fails in CF, then the user has a few options:
|
@edbartsch wrote:
Yes, but that's already the case today without the precheck but on the Delete operation itself. If the delete is rejected, the namespace sits in a "terminating" state until the instances and bindings are removed. I don't believe we'd see any different behavior on this one: the delete instance request comes in, the platform executes the precheck validation and it told to fail by the Broker, the platform then tells CF or Kubernetes that the delete was rejected. With Kubernetes, when you delete a namespace, the namespace is set to "terminating" and will stay that way until the instances and bindings are successfully deleted.
I think you are suggesting that if you have two services instances such as A and B and A can't be deleted until B is, the mass deletion should result in retrying the deletion on A once B is removed, do I have this correct? I'll have to experiment with this in Kuberenetes. You raised a good angle to consider here - thanks! I'll add this scenario as we move forward. |
@mattmcneeney: This is a recognized pain point - we have a github issue that reports the problem and the maintainers agree we need to address it, there has been discussion along with some potential ideas thrown around, but we do not have an agreed upon solution yet. Whatever actions we take around this will likely be multipronged, each addressing a specific scenario. Your three options look to be similar to our choices in Kuberenetes. I'd say # 2 is likely the first action users will take (review the error returned by the broker). # 3 is presently a manual process for our users involving editing the resource and removing the Service Catalog "finalizer" which allows deletion of the Service Catalog object without doing the deletion at the Broker. We've got a PR to add this "nuclear" or "abandon" command to make this easier for the user while stressing to the user this may result in Broker resources they will continue to be charged for. The prechecks alleviate the pain for the class of issues that result in denial from ACL or Business Logic validation. I agree this isn't a magic bullet, but it addresses a sizable segment of the issues. |
I thought svc-cat's vote ended the discussion of alternative solutions - am I mistaken?
But in Kube we don't allow them to do anything about it (like downgrade the plan, per matt's suggestion) and that's a big difference.
I'd like some data behind this assertion because I would claim the opposite. From an OSBAPI perspective, a new operation like this can cause more harm than good. Since there is no clear definition of what the broker is supposed to do or what kind of guarantee the response will have associated with it (each broker/instance will have its own list of checks it can do w/o actually performing the ultimate operation), so its ambiguous w.r.t what the Platform can assert from a "yes" (or "no") response. Any response from a check type of operation might change right after the operation is completed, which means the Platform can't actually make any definitive decision based on it - the Platform must always be prepared to deal with the real operation saying "no". A "yes" from the check is virtually meaningless in the face of this reality. The only way for the Platform to know for sure about the results of the real operation is to actually attempt it - anything else is just a guess and therefore the platform taking any actions based on it is questionable at best, and perhaps dangerous at worst if $ or serious business logic is at stake (e.g. being given the wrong answer and being forced to prematurely delete a resource due to it). From an OSBAPI spec perspective, this new operation would provide no additional value over just calling the real operation but would open the door for a "guess" type of operation which is not appropriate for a formal API spec. Again, this is just from a straight OSBAPI spec perspective. |
@jboyd01 A thought about rationale for the change. I think the underlying driver for this proposal boils down to the platform not having the ability to change the state of an object back to its original version(pre update/pre delete), if the operation fails on the broker. To solve this issue, the proposal adds a pre-check before the platform does the operations. |
As a side note... if we did add an op like this we'd need to support an async version of it since the processing behind the check could be complicated and time consuming - this is pretty much why we added async versions to most of the OSBAPI ops. This means that K8s would need to support an async call to the op, which I'm not sure is possible given the synchronous nature of how K8s is expected to deal with webhooks and the timeout constraints on the original request coming into the K8s apiserver. |
I think this adds complexity without providing a 100% solution. I know that seems like a high bar, but if we already have to have a recovery case, and after still have to have a recovery case, then I would choose the solution with less complications and the recovery case. I think we need to ask ourselves "Does this result in enough fewer problems that it is worth the extra complication of implementing and running and maintaining?" I've tried to write up the cases so we can see everything at once: existing cases
New Cases are 1-3, 4&5 are the original cases
|
@tinygrasshopper #572 (comment) |
#572 (comment) |
@duglin, in response to #572 (comment)
This was in specific reference to the scenario where despite the validating webhook passing, if the broker encounters some error, we could end up with a failed delete - how do we deal with these? We've briefly discussed (and put on hold) elevating the issue to an admin "queue" , effectively moving the issue out of the user's domain and making it an issue the admin has to deal with. There may be other options we could pursue here. I agree, the svc-cat vote ended discussions about changing the way Kubernetes delete works. |
+1 to @MHBauer's comments about how no matter what Kube will need to deal with what he calls "enter recovery". Platforms using a pre-check op will not avoid this need, so adding something to our spec which doesn't actually solve the problem (but adds complexity) doesn't make a lot of sense to me. |
If we view this proposal as precheck done by the platform before doing the actual operation, it is not making much sense, I fully agree. But we may adjust the proposal in a following way:
we may go for
This would mean for example for CF that new commands / cloud controller APIs to be introduced like:
|
@edbartsch given that we have the ability for a broker to return the schema of the parameters, I would think that would cover most of the trivial checks that might be needed, which means it can be done by the platform itself w/o talking to the broker. And for cases where a simple schema check isn't sufficient, it seems like the number of times people will want to ask "what if?" on an operation and not actually attempt it is pretty small. Plus, add into that the fact that its (at best) a guess, I just don't see the value in adding the complexity when I doubt anyone will use it for anything expect as a toy operation. But, ignoring all of that, since its pretty subjective, regardless of whether or not we add this operation, the Platform still needs to deal with the situation of the real operation failing, and allowing the end user to continue to use the resource in question - at least for 4xx errors. Given that the main driving force behind this proposal is to try to specifically not deal with this situation in the proper way, it shifts from something I think isn't very useful to something that I consider harmful to the spec's adoption and the Platform's users. |
@edbartsch From a CF perspective, I think it is very unlikely that we'd add commands like We could add logic to the Cloud Controller to automatically get an indication of whether or not the real request would succeed or fail, but I don't believe we would get any benefit from writing that additional logic. |
@mattmcneeney The checking REST calls are mainly interesting for humans and user interfaces as they introduce a non-destructive way to check validity of operations to be executed. JSON schema definition for parameters covers only a small subset so that there is still a room for delegation of complex checks to service brokers (e.g. "Are certain combinations of parameters applicable to the current service instance?"). But for every problem there are usually multiple options how to solve it. #570 solves the requirements of this proposal is a bit different way (by allowing brokers to say NO during operation execution). Therefore, I agree with the others that this proposal (572) is going into a wrong direction and should be rejected because the pull reqeust 570 solves the requirements but in a different way . |
This isn't the right course of action for Kube; admission validating webhooks are meant to give users of Kubernetes (and people that build software using Kubernetes extension mechanisms) a chance to approve or reject an operation on a Kubernetes resource before that change is accepted. Calling the actual operation in a webhook isn't a fit for the state reconcilation concept, in that Kubernetes controllers reconcile the accepted declarative state from the user. The problem this is intended to solve is allowing brokers to reject a request that should not be accepted into the system. Today, there is no mechanism we can use in Kubernetes to validate that a particular change to a ServiceInstance resource, for example, is actually allowed. This proposal is meant to provide such a mechanism that will allow the Kubernetes catalog to reject changes to service-catalog resources that would place them in a state where the broker would be known to fail. For example, if an update to a ServiceInstance is made that would result in an invalid combination of plan and parameters, the broker would have a chance to reject that update before it is accepted into a Kubernetes API. Kubernetes APIs have the characteristic that they represent the user's desired state for the system to be in. By accepting a particular state of a resource into Kubernetes, the system has told the users that their desired state is acceptable. Therefore, you can think of the proposed change as allowing brokers to participate in API validation and provide feedback to the user before the system has accepted an invalid state. |
IMO the URL pattern Another alternative option is to add a "dry run" mode to normal create/update/delete OSB requests instead. This is aligned with a Kubernetes feature kubernetes/kubernetes#11488 (that is also supported on the backend, but hasn't been fully implemented yet). For example, This could be dangerous in case OSB broker ignores the |
Per today's call, will put this one on-hold until @jboyd01 indicates its ready for us to resume our discussions. |
Purpose
Provide a mechanism for brokers to register callbacks (webhooks) that can be used for validation prior to the platform attempting to Create, Update or Delete (CUD) a Service Instance or Binding.
There will likely be a lot of discussion on the actual implementation details, initially this proposal will just focus on surfacing the issue and proposing the use of pre-action validation so Brokers have an opportunity to indicate to the platform that an action will or will not be accepted for processing. Once the SIG has discussed and given general agreement, we'll drill into a detailed design.
This feature allows a broker to register webhooks for precheck validation for Instances and Bindings. That is, if indicated by the broker, the Platform will invoke a validating webhook just prior to invoking the actual call to create, update or delete an Instance or Binding. The webhook will be invoked with the same parameters and payload as the actual create/update/delete operation, but this operation is a dry-run for Broker validation only. The broker may accept the request and respond with a status 200 OK or may return an error matching the same set of possible returns codes as is documented for the actual operation. At this time, the validation must be synchronous.
Platforms are not required to execute precheck validations - if the platform does not, it is expected the CUD operation will fail in the same manner as the validation would have failed.
Rationale
Some platforms (Kubernetes being one) create platform metadata that represents the Service Instance and Service Binding. In many cases these objects are created or updated prior to invoking Broker endpoints to create, update or delete the Broker resource. If the Broker encounters an error or rejects the request the platform usually leaves the platform object in an error or modified state that no longer reflects the Broker's resource and requires the user or administrator to clean it up or roll it back. Sometimes the platform and user can not recover and are unable to roll back the object to its prior state. The addition of the precheck allows the platform to fail the user's request early before any platform metadata objects are modified that would have have otherwise required cleanup.
Scenario 1 (Instance Update)
User A's update operation will fail because of some ACL or other business policy check. Prior to validation webhooks, the platform's metadata object is updated with the end user's changes but the update operation is marked as failed. If user B then attempts a different update on the object by editing the current object, he'll be working with user A's change as well. If user B doesn't realize this and commits the change, the broker will see the full set of updates and will assume these updates are being requested by user B and if he has privileges, an unintended update may be executed. The precheck adds the ability to verify A's access and other business checks prior to making any changes to the platform metadata object. With prechecks in place, the platform metadata object would never be updated if user A's precheck update operation failed.
Scenario 2 (Instance Deletion)
Today without validating webhooks, when a user deletes an instance the platform metadata object is irreversibly marked as deleted - even if the Broker rejects the delete, the platform object was marked for deletion and this can not be undone. Eventually the Platform resource must be deleted and re-created if necessary - this entails someone with the required privileges actually deleting the resource with the Platform command, waiting for the Broker to also successfully delete its resources, and then recreating the Instance and any associated Bindings. With a validating precheck, the platform checks early to see if the broker is going to reject the operation and if so, the platform can abort the operation and present an appropriate error message to the end user. In the latter case with the validation giving the error, nothing has been changed on the platform side; no cleanup is necessary.
High Level Design
/v2/catalog
:/v2/service_instances/validate:instance_id
and/v2/service_instances/:instance_id/service_bindings/validate:binding_id
. The validation requests are formed exactly as original target operation except the platform must not specifyaccepts_incomplete=true
. Otherwise the request has identical request headers, parameters, body, and invocation method. I.E., to validate deleting a Binding, the platform will execute a HTTP DELETE against/v2/service_instances/:instance_id/service_bindings/validate:binding_id
TBD/Needs investigation:
The text was updated successfully, but these errors were encountered: