-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: add support for LU factorization in the linalg extension #627
Comments
Thanks for asking @ogrisel. I had a look through the initial issues which considered the various I think it's an uphill battle for now. It would require adding it to It seems to me that LU decomposition is important enough that it's worth working on. So we could figure out what the optimal API for it would be, and then adding it to |
The signatures to consider:
The The PyTorch defaults to partial pivoting, and the keyword allows no pivoting. An LU decomposition with full pivoting is also a thing mathematically, but it does not seem implemented. JAX also has jax.lax.linalg.lu, which only does partial pivoting. So it seems like |
dask.array.linalg.lu has no keywords at all, and no info in the docstring about what is implemented. From the tests it's clear that it matches the SciPy default ( |
For PyTorch, the non-default flavor is only implemented on GPU: >>> A = torch.randn(3, 2)
... P, L, U = torch.linalg.lu(A)
>>> A = torch.randn(3, 2)
... P, L, U = torch.linalg.lu(A, pivot=False)
Traceback (most recent call last):
Cell In[6], line 2
P, L, U = torch.linalg.lu(A, pivot=False)
RuntimeError: linalg.lu_factor: LU without pivoting is not implemented on the CPU Its docstring also notes: The LU decomposition without pivoting may not exist if any of the principal minors of A is singular. tl;dr maybe the best way to go is to only implement partial pivoting? |
Maybe we can start with a function with no argument that always returns PLU (that is only implement scipy's On the other hand, I think it would be good to have an option wot do the
Also note, from PyTorch's doc:
Such a disclaimer should probably be mentioned in the Array API spec. |
Note that flu, = get_flinalg_funcs(('lu',), (a1,))
p, l, u, info = flu(a1, permute_l=permute_l, overwrite_a=overwrite_a) and |
Only the default (partial pivoting) algorithm that is implemented in all libraries and for all devices is added here. data-apisgh-627 has details on the no-pivoting case, but it's not universally supported and the only reason to add it would be that it's more performant in some cases where users know it will be numerically stable. Such an addition can be done in the future, but it seems like a potentially large amount of work for implementers for limited gain. Closes data-apisgh-627
@ogrisel I opened gh-630 for the default (partial pivoting) case that seems supportable by all libraries.
Perhaps. The alternative of having an empty Given that this use case seems more niche and it's not supported by Dask and by PyTorch on CPU, and you don't need it now in scikit-learn, it seems waiting for a stronger need for this seems like the way to go here though. |
We do use the "permute_l=True" case in scikit-learn. |
It would be easy to provide a fallback implementation that uses an extra temporary allocation + mm product for libraries that do not natively support scipy's permute_l=True. But it's not clear if pytorch' pivot=False is equivalent to scipy permute_l=True or doing something different. |
Oops yes, I didn't read well.
It looks like it's doing the exact same thing: >>> import torch
>>> import numpy as np
>>> from scipy import linalg
>>> t = torch.rand(20, 10, device='cuda') # must be 'cuda', because `pivot=False` is not supported on CPU
>>> Pt, Lt, Ut = torch.linalg.lu(t.to('cuda'), pivot=False)
>>> Lt.shape
torch.Size([20, 10])
>>> Ut.shape
torch.Size([10, 10])
>>> Pt
tensor([], device='cuda:0')
>>> Ln, Un = linalg.lu(t.to('cpu').numpy(), permute_l=True)
>>> Ln.shape
(20, 10)
>>> Un.shape
(10, 10)
>>> np.testing.assert_allclose(Lt.to('cpu'), Ln)
...
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatched elements: 190 / 200 (95%)
Max absolute difference: 56.958973
Max relative difference: 181.88818
>>> (Lt @ Ut - t).max()
tensor(2.1011e-06, device='cuda:0')
>>> (Ln @ Un - t.to('cpu').numpy()).max()
1.1920929e-07 So both give valid decompositions with the same shapes for One other thing to point out, the non-pivoting case does result in a decomposition with larger errors for PyTorch: >>> Pt, Lt, Ut = torch.linalg.lu(t)
>>> (Pt @ Lt @ Ut - t).abs().max()
tensor(1.4901e-07, device='cuda:0')
>>> Pt, Lt, Ut = torch.linalg.lu(t, pivot=False)
>>> (Lt @ Ut - t).abs().max()
tensor(2.1011e-06, device='cuda:0')
>>> Pn, Ln, Un = linalg.lu(x)
>>> np.abs((Pn @ Ln @ Un - x)).max()
1.1920929e-07
>>> PLn, Un = linalg.lu(x, permute_l=True)
>>> np.abs((PLn @ Un - x)).max()
1.1920929e-07 SciPy's pivoting decomposition seems to be a lot faster for small arrays, and has similar performance for large arrays when they are square or wide, and is several times slower when they're tall:
Given that scikit-learn does use |
For completeness, some PyTorch GPU timings: >>> t = torch.rand(1000, 1000, device='cuda') # square tensor
>>> %timeit torch.linalg.lu(t)
7.11 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit torch.linalg.lu(t, pivot=False)
1.28 ms ± 2.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> t = torch.rand(1000, 50, device='cuda') # tall tensor
>>> %timeit torch.linalg.lu(t)
2.15 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit torch.linalg.lu(t, pivot=False)
343 µs ± 6.58 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> t = torch.rand(50, 1000, device='cuda') # wide tensor
>>> %timeit torch.linalg.lu(t)
1.21 ms ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
>>> %timeit torch.linalg.lu(t, pivot=False)
488 µs ± 800 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) The performance gain is even larger there. So we have a default algorithm that's slower but stable, and an opt-in to a significantly faster algorithm. |
@lezcano @nikitaved it'd be nice to get your opinion on this one. I can't find a PyTorch feature request for LU decomposition with |
@rgommers , if there is something for the GPU which is missing for the CPU in PyTorch, that is most likely because of gpu performance/functionality having much higher value... Especially considering the usage in DL applications. However, sometimes PyTorch is too strict and follows the suite, like with SVD, iirc. PyTorch produces the full SVD by default, even though it is not required in most cases, and hurts gpu performance overall... I, personally, would not mind having |
Partial pivoting is often a good compromise between stability and speed when compared to, say, full pivoting or other more general decompositions. For this reasons, I don't think that there are many practical cases where you would want to have a regular LU decomposition. |
I would not say few, there are plenty of matrices that satisfy such conditions, like full-rank matrices with dominant diagonals. And in most cases diagonals are pre-conditioned anyway. Besides,
There are cases when a single decomposition of a non-symmetric matrix leads to solving linear systems when the rhs comes in a stream, like in NNs that solve a system during training, and then (or while) the corresponding weight matrix could be decomposed to be able to do efficient inference. |
The diagonal case you mentioned I don't think it's correct, but it's true that there is at least one family of matrices that I know of for which LU decompositions exist, which is column diagonally-dominant matrices. At any rate, in most cases you do want partial pivoting due to stability reasons, so yeah. This could be implemented in PyTorch, so it shouldn't stop the array API from standarising it. I just think it shouldn't be the default. Also, in terms of standarisation, I prefer the API we have in PyTorch for |
Agree. Fixed and changed to diagonal dominance which I meant in the first place :) But on the topic of flexibility and performance my opinion is that having control over And I agree that having |
I am not sure but I cannot quickly check since I do not have access to a cuda GPU at the moment: By reading the docstring of torch.linalg.lu, I am under the impression that the returned EDIT: The fortran code for the scipy LU implementation is here: |
Also in the case of randomized_range_finder used internally by scikit-learn's |
If you have a tall matrix, what you often want to do is to compute the PLU of the transposed matrix and use it to solve the given system. Even more, you very rarely want to use |
Also, thank you for digging this up #627 (comment). It looks to mee that we don't want to include the |
@lezcano I agree. I think the matrix |
Cc @ilayn who is currently making changes to |
@oleksandr-pavlyk I was referring to the output of For these functions, a minimal set of functions from which all the others can be implemented is given by:
With those two, you can implement |
I suspect that this would cause quite a bit of overhead compared to the current API but maybe it's negligible for large enough data.
It's quite possible that the LU re-normalization trick was written this way (with |
Note that the matrix being re-normalized is often highly rank-deficient. That said, the matrix is also randomized, although the degree of randomization varies. Omitting all pivoting with a rank-deficient matrix for which the randomization may be less than perfect sounds potentially numerically unstable to me. But allocating space for a full permutation matrix (rather than just for the permutation defining it) sounds unnecessarily wasteful to me, yup. Maintaining the pivoting yet without having to form all entries of the full permutation matrix would make sense, perhaps? |
I think a confusion here is coming from the fact that using With that out of the way, my point is that, when doing an LU decomposition with partial pivoting, X = linalg.lu_factor(A)[0]
L = np.fill_diagonal(np.tril(X), 1) If you want to keep the information of the rank of the input matrix, you should keep the But again, all this part is deviating from the main topic of conversation. The point here is that all these things ( |
One way to cleanly implement this would be to standardize a Linear Operator API as part of the Array API spec but I guess this is out-of-scope for now (since it is kind of related to sparse arrays). For reference, here are some implementations of linear operators in the ecosystem: |
That's because we don't want LU success to be data dependent. LU decomposition is not guaranteed to exist without pivoting. Partial pivoting failures are only possible by carefully designing pathological arrays. So virtually not existing. Full pivoting puts LU into "very expensive" bucket and you'd probably be better off QR anyways. Hence partially pivoting more or less guarantees output. No pivoting makes it a hopeful decomposition. I don't know what the scope of this array API but decomposition and solving are completely different operations. Julia implements Like I mentioned,
We have an explicit remark about this in docs already but if you still find it confusing I'd like to fix it right away if you have an alternative wording. |
I experimented in a branch of scikit-learn to use the new scipy API from the In conclusion:
In my opinion, the future standard
Here is a summary of the typical snippet following the API I would expect in the standard (if we ignore any API backward compatibility concerns): X = xp.asarray(np.random.randn(42, 43))
P_indices, L, U = xp.linalg.lu(X)
assert xp.linalg.norm(X - L[P_indices] @ U) < 1e-12 This means that scipy could implement this with its new Not sure how to deal with backward compat for pytorch either. |
It is true that the I would propose then to just have one function:
Note that Uses of this API:
|
Since you folks are working with insane amount of data and tools I can't even begin to understand to handle that data, just a few remarks about the speed of LU or any other linalg stuff in SciPy
That is to say, instead of Bp = B[p, :]
for r in range(n):
foo = Bp[r, :] @ bla you can do for r in range(n):
foo = B[p[r], :] @ bla without any data motion on |
While I have the audience, if you have a better naming please do so, we considered
and many other ugly candidates. We went with |
I think I find |
@lezcano I am not sure I understand your proposal in: #627 (comment), in particular "pivots returned by BLAS". Could you please illustrate how your API proposal would feel from an end-user perspective? |
My proposal accounts for having # p_indices=False
assert xp.linalg.lu_factor(X) == scipy.linalg.lu_factor(X)
# p_indices=True
A, perm = xp.linalg.lu_factor(X, p_indices=True)
assert A == scipy.linalg.lu_factor(X)[0]
assert perm = scipy.linalg.lu(X, matrix_p=True)[2] |
@lezcano thanks but alone this is not very user friendly. I supposed users need extra steps to retrieve the L and U components from A using So the full snippet would be: # p_indices=False
A, P = xp.linalg.lu_factor(X)
L, U = xp.tril(A), xp.triu(A, k=1)
assert xp.linalg.norm(X - P @ L @ U) < 1e-12 or: # p_indices=True
A, p_indices = xp.linalg.lu_factor(X, p_indices=True)
L, U = xp.tril(A), xp.triu(A, k=1)
assert xp.linalg.norm(X - L[p_indices] @ U) < 1e-12 I find the original suggestion in #627 (comment) way more user-friendly. Maybe we could offer both, that is still expose the lower level Note, that if there is a concern with backward compat with existing libraries and default parameters, it's possible to use a different function name (e.g. Also: what is the stance of the Array API standardization committee on "soft" polymorphic functions where the number of array element in a tuple of results does not dependent on the value of the function argument but the |
The full snipped would look like: # p_indices=False: Solve a linear system
A, P = xp.linalg.lu_factor(X)
sol = xp.linalg_solve(A, P, Y)
assert xp.linalg.norm(X @ sol - Y) < 1e-12
# Reconstruct the full PLU decomposition
A, p_indices = xp.linalg.lu_factor(X, p_indices=True)
L, U = xp.fill_diagonal(xp.tril(A), 1), xp.triu(A)
assert xp.linalg.norm(X - L[p_indices] @ U) < 1e-12 The idea behind this API is based in three principles but here is where I may be wrong and it may be helpful for other members of the standarisation committee to chip in:
As discussed above, the proposed I think the conciseness point here is important, as in PyTorch we used to have
The suggestion in that comment does not account for the most common case, which is using the LU decomposition to solve a linear system. When it comes to the user-friendliness, I believe that the array API is not in the business of offering plenty of auxiliary functions, but to have enough expressibility so that common constructs can be expressed efficiently in it. Auxiliary functions could potentially go into a library that builds on top of the array API. For example, using this explicit construction in your experiments in ogrisel/scikit-learn#16 would save you from instantiating When it comes to efficiency, note that using the NumPy API (that is, without fusion of operations) it's not possible to do better efficency-wise than the described
The proposed API is BC with other libraries, as its defaults are the same as those from SciPy and PyTorch
|
Thanks for the explicit summary.
I am not sure what you mean by shape. I would expect the following: >>> A, p_indices = xp.linalg.lu_factor(X, p_indices=True)
>>> p_indices.dtype.kind
'i'
>>> p_indices.ndim
1 while: >>> A, P = xp.linalg.lu_factor(X, p_indices=False)
>>> P.dtype.kind
'f'
>>> P.ndim
2 this is what I meant by "soft polymorphic outputs". |
When If you want to recover the full 2D matrix
where |
I now realize that I misunderstood how the I find that exposing those low level 1-based indexed pivot info quite confusing anyway. I have the feeling this should not be part of the standard API or at least not with the default options. |
I can fix the docs if something is broken. What kind of issue do you have with it? And about exposing it, yes that's what I meant in the last part of #627 (comment) about the "mistake". Note that they are pivot "swaps" on |
Which API would you guys then propose that can be implemented with the current tools (BLAS/cuBLAS/cuSolver/MAGMA) that allows for solving linear systems fast, possibly with a common lhs? |
My vote goes to If you have an array If you need explicit factorization then you can still call Otherwise we will repeat the same fortran api yet another two decades. That's pretty much what I wanted to do in scipy/scipy#12824 but then I lost my way due to other things. It tries to address the discovery part for solve. The idea is that the linalg APIs have moved on and it is quite inconvenient to fight with lapack api just to get a single operation with internal arrays, convert things from internal lapack representation to usable Scientific Python ways etc. We are now trying to reduce our Fortran77 codebase in SciPy which is taking quite some time and after that I am hoping to get back to this again. Anyways, that's a rough sketch but that's kind of my wish to have in the end. |
I thought the example in the docstring was broken (I thought I had copied and pasted the snippets and it would fail) but it's actually not the case. Still I think it can be helpful to be more explicit: scipy/scipy#18594. For some reason I thought that I read somewhere that |
I find the proposal in #627 (comment) interesting. Would it be possible to implement it on top of the existing building blocks available in numpy/scipy, torch and co? If it can be done, (e.g. maybe in a proof of concept repo), maybe it can help move the spec discussion forward if it is backed by working code. |
It seems that many libraries that are candidates to implement the Array API namespace already implement the LU factorization (with variations in API and with the notable exception of numpy).
However LU is not part of the list of linear algebra operations of the current state of the SPEC:
Are there any plans to consider it for inclusion?
The text was updated successfully, but these errors were encountered: