Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradients for Flux etc.-- WIP #59

Closed
wants to merge 4 commits into from
Closed

Gradients for Flux etc.-- WIP #59

wants to merge 4 commits into from

Conversation

mcabbott
Copy link

@mcabbott mcabbott commented Jan 9, 2019

I believe I found the bug in my gradient calculation today, and so I tidied it up a little for this PR. (See also #47.) The code specific to Flux is loaded via Requires, while the gradients themselves are in another file.

  • This does not attempt to handle gradients of the scalars α,β, only the arrays. I would like α::TrackedReal to be an error, which I think will require doing slightly less promotion, but I don't precisely see where yet.
  • Everything to do with the cache could use a close look. Since trace∇A calls contract!, it might be helpful if trace! was handed some syms by the macro, as contract! is; they could be ignored elsewhere.
  • I copied & pasted formulae for Zygote, and the example commented out in tests does actually run for me (once I have IRTools#master) but others give errors.
  • It should be easy to similarly support Knet, again with the same gradients.

@coveralls
Copy link

coveralls commented Jan 9, 2019

Coverage Status

Coverage decreased (-1.7%) to 73.717% when pulling 51e59f6 on mcabbott:gradients into 0388ba2 on Jutho:master.

@codecov
Copy link

codecov bot commented Jan 9, 2019

Codecov Report

Merging #59 into master will decrease coverage by 1.73%.
The diff coverage is 53.17%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
- Coverage   77.15%   75.41%   -1.74%     
==========================================
  Files          13       16       +3     
  Lines        1383     1509     +126     
==========================================
+ Hits         1067     1138      +71     
- Misses        316      371      +55
Impacted Files Coverage Δ
src/gradients/zygote.jl 0% <0%> (ø)
src/TensorOperations.jl 92.98% <100%> (+0.25%) ⬆️
src/gradients/backwards.jl 59.61% <59.61%> (ø)
src/gradients/flux.jl 72.34% <72.34%> (ø)
src/implementation/stridedarray.jl 72.51% <0%> (+1.89%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d32e9e...ce18832. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 9, 2019

Codecov Report

Merging #59 into master will decrease coverage by 1.49%.
The diff coverage is 54.47%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #59     +/-   ##
=========================================
- Coverage   75.52%   74.02%   -1.5%     
=========================================
  Files          13       16      +3     
  Lines        1344     1467    +123     
=========================================
+ Hits         1015     1086     +71     
- Misses        329      381     +52
Impacted Files Coverage Δ
src/gradients/zygote.jl 0% <0%> (ø)
src/TensorOperations.jl 77.77% <100%> (+2.77%) ⬆️
src/gradients/backwards.jl 59.61% <59.61%> (ø)
src/gradients/flux.jl 72.34% <72.34%> (ø)
src/implementation/stridedarray.jl 72.51% <0%> (+1.89%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0388ba2...9f1b50b. Read the comment docs.

@Jutho
Copy link
Owner

Jutho commented Jan 9, 2019

While this looks like great work you are doing, I would like to inform/warn you that I don't have a lot of time to review this in the next few weeks, especially since I am also pretty new to automatic differentation.

I have a keen interest in this topic, and I am open to supporting it with TensorOperations.jl, but I would actually have hoped that some of the more specific compatibility code could actually live in the corresponding packages, not here. Having it here means also a much higher responsibility from my part to keep it up and running, and might make future innovations that I would like to implement a lot harder.

For example, I am not convinced that the way I currently lower the @tensor macro into an explicit expression is the way to go. Maybe an actual computation graph, similar to how broadcast works, can more easily be maintained. It would also allow for certain optimizations, i.e. if you contract a tensor network, you can recycle many of the intermediates when taking the derivative with respect to a certain tensor (see https://arxiv.org/pdf/1310.8023.pdf).

@mcabbott
Copy link
Author

I dropped the ball here, sorry... but at least you know there is no rush! Maybe RFC would have been a better title for this. It was basically a challenge to see if I could put this together, and now it works, more or less. I learned quite a bit, including from your code.

My PR suggestion that it live here was I guess because it has to depend on lots of internal details of your functions, not just what’s “public”. But I hear you about maintenance load, and don’t mean to impose on you. It also necessarily depends on many things in Flux etc. which look pretty different from how they did a year ago… it’s a bit awkward really that it must do both, maybe there will eventually be a nicer way.

Indeed there may be smarter ways of going about this whole thing. I haven't thought very hard about this, it could be that gradients should be defined either on a higher or a lower level than these functions. That was just what seemed within reach -- 5 minutes on a piece of paper for these 3 basic operations.

For now perhaps I leave it here & we can see? If it's useful to anyone perhaps they could chime in too.

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Feb 28, 2019

@mcabbott Thanks for you PR, it solved my problem, so far every thing works fine!
I will provide more “stress test” for this PR in the near future.

@GiggleLiu GiggleLiu mentioned this pull request Mar 26, 2019
@mcabbott
Copy link
Author

This code is now at https://github.com/mcabbott/TensorTrack.jl instead, and a higher-level approach to the same problem is at https://github.com/mcabbott/TensorGrad.jl .

@mcabbott mcabbott closed this Aug 10, 2019
@mcabbott mcabbott deleted the gradients branch August 10, 2019 12:09
@Jutho
Copy link
Owner

Jutho commented Aug 10, 2019

I know that reference and have wanted to implement this in TensorOperations.jl for a long time already. In a sense, I consider that reference as a first step towards AD for tensor networks 'avant la lettre'. In fact, I started working on an implementation last week and had thought a bit about the interface. I was thinking of something like

@tensor X[a,b,c](A,B) = A[a,d,e]*B[b,d,f] * C[c,e,f]

which would create a callable object X of a type that could be named, say TensorNetwork. Upon calling, the tensors A and B as arguments, but C would need to be determined in the scope where the macro was called, and will be stored in its data structure using named tuples, in such a way that you can still evaluate the same contraction with a different C once, by saying
X(A, B; C = ...) or you could permanently change C as X.C = ....

This would work perfectly with the gradients from your TensorGrad.jl , so I was wondering if you were interested in joining forces on this @mcabbott ? I have yet to figure out how to store and recycle intermediate objects, but that should not be too hard.

@MacKenzieHnC
Copy link

What's the status on this? This would save me a lot of hell if it worked. I'm happy to devote any amount of time to getting this working

@mcabbott
Copy link
Author

The gradient definitions from TensorGrad.jl (linked above) are presently bolted onto Tullio.jl. It's perhaps a slightly awkward combination but useful for now.

BTW, the design there is that there's a callable struct Eval to which the Zygote gradient is attached. This contains a forward & a reverse function, but it's not returned, it's simply called once, behind the usual syntax. Gradients are calculated for all arrays unless you say otherwise in a keyword argument, like @tullio A[a,d,e] * B[b,d,f] * C[c,e,f] nograd=(A,C).

@maartenvd
Copy link
Collaborator

@mcabbott is there a reason you abandoned defining gradients for contract! and instead went with an @Grad macro?

@GiggleLiu
Copy link
Contributor

GiggleLiu commented Nov 28, 2021

UPDATES:

@Jutho , It is probably much easier to define the backward rules on binary tensor contraction operations. Because the nary tensor operations calls the binary contraction function, the contraction order in the backward pass can be determined automatically by the AD engine. It always has constant overheads w.r.t. the forward pass.

FYI: It is much easier to support AD now, because the AD community has switched to ChainRules, you just add a very small dependency: ChainRulesCore. (however, we still do not have a reliable AD engine in Julia 😞 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants