-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Extrapolation to Bernstein Polynomial Flow #36
Comments
Hello @MArpogaus, thanks for the feature request and PR! I am curious about the issues you had with gradients. Normally A solution could be to cast Also how did you handle the inverse transformation with the extrapolation? I guess you would also need to look whether |
I had a similar problem in my TF implementation, and casting to 0.5 helped. Is it preferred to use autograd over analytical Jacobians? Additionally, in the sigmoidal case, we discovered numerical issues since the derivative of sigma (sigma * (a-sigma)) converges towards zero for +/- Inf: Hence, we decided to add
Good, point! I did not think of the inverse yet, but of course, the bisection is limited to the bounds. |
An analytical Jacobian is fine (if it is exact), but if the output Can you provide a small example demonstrating the incorrect behavior?
I see, but is it problematic in practice? Neural network layers expect standardized features (zero mean, unit variance), so the features are never too large. Anyway if you have a working implementation with the extrapolation, we can drop the sigmoid mapping. |
This actually solved the problem, my binary mask was wrong in the first place.
I now added a |
I have added an additional option to get a smooth transition into the extrapolation by enforcing the second order derivative to be zero on the bounds (b2869d4). @oduerr what do you think? Do we need such an option? I was thinking about adding an other optional augment to specify the codomain of the polynomial. first minimal working example: low = -3
high = 33
positive_fn = torch.exp
theta = torch.rand(size=(M,)) # creates a random parameter vector
theta_low = low # - theta[:1].exp() optionally allow flexible bounds
theta_high = high # + theta[-1:].exp()
diff = torch.nn.functional.softmax(theta, dim=-1) # use theta[1:-1] if flexible_bounds=True
diff *= (theta_high - theta_low) - 2 * bpoly.eps
torch.cat([torch.cumsum(torch.cat([theta_low, diff], dim=-1), dim=-1), theta_high], dim=-1) This is especially usefully when chaining several Transforms or using a base distribution with limited support. |
Nice! I would actually enable this by default if it is not too expensive.
Why not! However, instead of specifying the co-domain, it is probably better to fix the co-domain to Note that implementing the |
Ok, i enabled it by default in a recent commit.
I can add it like this for now, but I certainly need asymmetric bounds for my own models.
Thanks for the hint. I'll make it conditional, depending on a new |
This is not possible, as the transform is not initialized when entering the BPF constructor. @francois-rozet What is your opinion? |
Just pushed a first draft: 13ead7d |
What I had in mind is to make the Bernstein transform such that it always (not as an option) maps This is what the rational quadratic spline transform does. It maps B to B and extrapolates linearly outside. In addition, the derivatives at the bounds are 1 so that Note that if you need another (asymmetric) domain or co-domain you can always combine a transformation with a (monotonic) affine transformation. What do you think? |
Ok, that is basically what i implemented, but i did not yet enable it by default. What is your opinion on chaining flexible transformations? Currently the If we want we could also force the derivative to 1 on the bound to ensure we get the identity function when extrapolating. Could look sth like this: def _constrain_theta(unconstrained_theta: Tensor, bound: float) -> Tensor:
r"""Processes the unconstrained output of the hyper-network to be increasing."""
if bound:
theta_min = -bound * torch.ones_like(unconstrained_theta[..., :1])
def fn(x):
return torch.cat(
(
torch.ones_like(unconstrained_theta[..., :2]), # f'(0) = 1, f''(0)=0
torch.nn.functional.softmax(x, dim=-1) * (2 * bound - 4),
torch.ones_like(unconstrained_theta[..., :2]) # f'(1) = 1, f''(1)=0
),
dim = -1
)
else:
shift = math.log(2.0) * unconstrained_theta.shape[-1] / 2
theta_min = unconstrained_theta[..., :1] - shift
unconstrained_theta = unconstrained_theta[..., 1:]
fn = torch.nn.functional.softplus
widths = fn(unconstrained_theta)
widths = torch.cat((theta_min, widths), dim=-1)
theta = torch.cumsum(widths, dim=-1)
return theta However, i would personally prefer to keep an option for a simple "ordered" constrain function in the library, as we had in the beginning for less restrictive setups.. |
In multivariate flows, it is generally necessary to chain several multi-variate (think autoregressive or coupling) transformations, even if the univariate transformations are very expressive. And even if a single multi-variate transformation is used, you should ensure that its co-domain covers the support of the base distribution.
You mean the one currently in the lib or one with extrapolation + smooth bounds? Anyway, instead of adding more and more options to the class you can also sub-class it (e.g. |
Sorry for the late reply. We (@MArpogaus and I) just had a discussion and are a bit puzzled by:
We thought that if the 1-D transformation function is flexible enough, utilizing the AR flows allows us to express even complex distributions due to the chain rule of prob. See also [1]. While we still believe this is true in theory, we observed much better performance on the UCI Benchmarks when we did chaining. Do you have intuition why one gets a better performance when chaining, or do you even know a paper? @MArpogaus will implement two classes as you suggested soon (in the following week): one unbounded and one Bounded in which the coefficients are scaled. Sorry for the delay. [1] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing Flows for Probabilistic Modeling and Inference,” Journal of Machine Learning Research, vol. 22, no. 57, pp. 1–64, 2021. |
Hello @oduerr, no problem, everyone is busy 😄
You are right, a single auto-regressive transformation should be enough if the uni-variate transformation is an universal (monotonic) function approximator. However, the hyper network (MADE/MaskedMLP) conditioning the transformation must also be a universal function approximator. In practice, the capacity of the hyper network is finite and the function it should approximate might be complex depending on the data, the uni-variate function (and its parametrization) and the auto-regressive order. The latter, in particular, can have a huge impact: some orders can lead to very simple functions while others can lead to almost unlearnable functions. Stacking several multi-variate transformations with different (sometimes randomized) orders help to alleviate this issue. For example if |
Hello @francois-rozet, thanks for your patience. And many thanks for the intuitive example that makes it very clear! |
Hello everybody, I have finally found the time to continue working on this. There are now two alternative versions of the Bernstein Transform: The class Additionally
For now What are your opinions? I would like to discuss two remaining questions:
|
This looks really nice! I am curious to see how it compares to the other flows, especially NSF.
Unless you think it is necessary, I am fine with (and actually prefer) only proposing BPF with
I don't think so. Users can construct their own flow with Flow(
transform=ElementWiseTransform(
features=1,
...
univariate=BernsteinTransform,
shape=[(16,)],
),
base=...,
) |
Hey @francois-rozet, thanks for the fast response! I am fine with it too. |
Hello @marcel, Thanks a lot for the changes; they look fine. The idea of fixing the coefficients for a bounded transformation is quite clever! |
some early results on DatasetSamplesEvolution through trainingDensity@francois-rozet do you have some results from other flows that you can share? |
Hello @MArpogaus, here are the results for You can check the run at https://wandb.ai/francois-rozet/zuko-benchmark-2d/runs/5xllo4bz . |
Closing as #37 was merged 🥳 |
Description
As discussed in #32, we now implemented linear extrapolation outside the bounds of the Bernstein Polynomial.
The feature becomes active if
linear=True
.Here is a simple Python script to visualize the resulting effect:
This makes the
BPF
Implementation more robust to data laying outside the domain.of the Bernstein Polynomial,without the need for thenon linear sigmoid function.@oduerr Do you have anything else to add?
Implementation
The implementation can be found in the bpf_extrapolation branch of my fork.
My changes specifically include
torch.where
statement and to improve numerical stability ind the sigmoidal case.The text was updated successfully, but these errors were encountered: