Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add Functors-aware structural gradient #129

Merged
merged 4 commits into from
Aug 21, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,33 +1,37 @@
name = "Tracker"
uuid = "9f7883ad-71c0-57eb-9f7f-b5c9e6d3789c"
version = "0.2.20"
version = "0.2.21"

[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
DiffRules = "b552c78f-8df3-52c6-915a-8e097449b14b"
ForwardDiff = "f6369f11-7733-5829-9624-2563aa707210"
Functors = "d9f16b24-f501-4c13-a1f2-28368ffc5196"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
LogExpFunctions = "2ab3a3ac-af41-5b50-aa03-7779005ae688"
MacroTools = "1914dd2f-81c6-5fcd-8719-6d5c9610ff09"
NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
NaNMath = "77ba4419-2d1f-58cd-9bb1-8ffee604a2e3"
Optimisers = "3bd65402-5787-11e9-1adc-39752487f4e2"
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Requires = "ae029012-a4dd-5104-9daa-d747884805df"
SpecialFunctions = "276daf66-3868-5448-9aa4-cd146d93841b"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"

[compat]
Adapt = "1, 2, 3"
Adapt = "3"
DiffRules = "1.4"
Functors = "0.3.0"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW [email protected] required Metalhead#master right now. With that, the example from https://fluxml.ai/Optimisers.jl/dev/#Usage-with-[Flux.jl](https://github.com/FluxML/Flux.jl) runs, and has half the TTFG of Zygote:

julia> let 
       Random.seed!(1)
       model = Metalhead.ResNet(18) |> gpu  # define a model to train
       image = rand(Float32, 224, 224, 3, 1) |> gpu;  # dummy data
       @show sum(model(image));  # dummy loss function

       rule = Optimisers.Adam()  # use the Adam optimiser with its default settings
       state = Optimisers.setup(rule, model);  # initialise this optimiser's momentum etc.

       @time _, (∇model, _) = Tracker.withgradient(model, image) do m, x  # calculate the gradients
         sum(m(x))
       end;

       state, model = Optimisers.update(state, model, ∇model);
       @show sum(model(image));
       Base.summarysize(∇model)
       end
sum(model(image)) = 1.2527118f0
 19.638126 seconds (39.40 M allocations: 3.444 GiB, 44.46% gc time, 87.70% compilation time)
sum(model(image)) = -4792.643f0
46767520

compared to, for Zygote, this:

sum(model(image)) = 1.2527118f0
 47.450042 seconds (73.94 M allocations: 5.419 GiB, 36.40% gc time, 93.23% compilation time)
sum(model(image)) = -19.776657f0
46765720

But something is wrong, as the final loss differs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking on the bright side, I guess with this it would be fairly easy to add checks to Flux's tests, comparing what Zygote thinks about each layer to what Tracker thinks. Any which disagree are cause for concern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I can already see that being helpful for Metalhead since we see the occasional odd gradient anomaly.

Copy link
Member Author

@mcabbott mcabbott Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diffractor, with JuliaDiff/Diffractor.jl#89

sum(model(image)) = 1.2527118f0
 15.313321 seconds (35.07 M allocations: 2.822 GiB, 1.97% gc time, 98.18% compilation time)
sum(model(image)) = -19.776482f0
19384064

ForwardDiff = "0.10"
LogExpFunctions = "0.3"
MacroTools = "0.5"
NNlib = "0.7.18, 0.8" # 0.7.18 is the last version which supports Julia 1.3
NaNMath = "0.3, 1"
Requires = "0.5, 1.0"
SpecialFunctions = "0.10, 1, 2"
julia = "1.3"
NNlib = "0.8"
NaNMath = "1"
Optimisers = "0.2.9"
Requires = "1.0"
SpecialFunctions = "1, 2"
julia = "1.6"

[extras]
PDMats = "90014a1f-27ba-587c-ab20-58faa44d9150"
Expand Down
6 changes: 3 additions & 3 deletions src/Tracker.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import Printf
import Base: ==

export TrackedArray, TrackedVector, TrackedMatrix, Params, gradient,
jacobian, hessian, param, back!
jacobian, hessian, param, back!, withgradient

tracker(x) = nothing

Expand Down Expand Up @@ -70,10 +70,10 @@ end

include("idset.jl")
include("params.jl")
include("back.jl")
include("numeric.jl")
include("lib/real.jl")
include("lib/array.jl")
include("back.jl")
include("numeric.jl")
include("forward.jl")
@init @require PDMats="90014a1f-27ba-587c-ab20-58faa44d9150" include("lib/pdmats.jl")

Expand Down
57 changes: 57 additions & 0 deletions src/back.jl
Original file line number Diff line number Diff line change
Expand Up @@ -178,3 +178,60 @@ function jacobian(f, x::AbstractVector)
end

hessian(f, x) = jacobian(x -> gradient(f, x, nest=true)[1], x)

using Functors: fmap, fmapstructure
using Optimisers: _trainable, isnumeric

"""
withgradient(f, xs...)

This computes the value `f(xs...)` and the gradient with respect to `xs`.
However, it differs from `gradient` in several other respects:
* It will recurse into `xs` using `fmap`, and thus like Zygote's "explicit mode" it
returns a tree-like gradient matching the shape of a Flux model.
* Only objects satisfying `Optimisers.isnumeric` are regarded as parameters,
thus in particular integers are ignored.
* Returns plain arrays, not tracked.

# Examples
```
julia> nt = (vec = [1.0, 2.0], mat = [4.0;;], fun = sin);

julia> withgradient(nt, 2) do x, p
sum(abs2, x.vec) ^ p
end
(val = 25.0, grad = ((vec = [20.0, 40.0], mat = [0.0;;], fun = nothing), nothing))

julia> using Flux

julia> model = Chain(Dense(2 => 1, tanh), Dense(1 => 1, bias=false));

julia> withgradient(model, rand(Float32, 2)) do m, x
sum(abs2, m(x))
end
(val = 0.035716165f0, grad = ((layers = ((weight = Float32[-0.4241869 -0.16741231], bias = Float32[-0.5529184], σ = nothing), (weight = Float32[-0.04804218;;], bias = nothing, σ = nothing)),), Float32[0.12706584, -0.08858479]))
```
"""
function withgradient(f, xs...)
pxs = fmap(param, xs; exclude = isnumeric) # would ideally apply params only to trainable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some variation of trainable_walk from FluxML/Optimisers.jl#35 (comment) could work here.

Copy link
Member Author

@mcabbott mcabbott Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I get lost every time I try to remember how walks work... that looks like the right one. one place I thought I understood...

I guess another option would be not to depend on Optimisers at all, just Functors. Although not tracking non-trainable arrays in Flux probably increases the chances of this just working.

Am not keen on Requires here, seems like a hassle for one tiny package. This already depends on 19 others: https://juliahub.com/ui/Packages/Tracker/cI3wW/0.2.20?page=1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need a way to not track non-trainable arrays while still tracking arrays that should be moved on/off GPU in Flux.

I suppose the dep issue isn't a major one in practice, but it may turn a couple of eyebrows. If import times stay mostly the same though, no objections here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on trainable does mean raising the floor to Julia 1.6. I think that's fine, anyone on 1.3 has surely accepted freezing all downstream packages by now, we aren't planning to backport bugfixes.

Zygote of course tries to compute gradients with non-trainable too. But would be nice not to do so.

l = f(pxs...)
losscheck(l)
l isa TrackedReal || return (val = l, grad = nothing)
@interrupts back!(l)
(val = data(l), grad = rec_grad(pxs))
end

# Easier to write the recursion to extract the gradients without using fmap:
rec_grad(x::TrackedArray) = grad(x)
rec_grad(x::TrackedReal) = grad(x)
rec_grad(x::AbstractArray{<:Number}) = nothing
rec_grad(x::Number) = nothing

rec_grad(x::Union{Tuple,NamedTuple,AbstractArray}) = map(rec_grad, x)
rec_grad(::Tuple{}) = nothing
rec_grad(::NamedTuple{(), Tuple{}}) = nothing
function rec_grad(x::T) where {T}
F = fieldnames(T)
isempty(F) && return nothing
map(f -> rec_grad(getfield(x, f)), NamedTuple{F}(F))
end
7 changes: 7 additions & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,11 @@ using Tracker: jacobian
@test J ≈ A.data
end

@testset "withgradient" begin
nt = (vec = [1.0, 2.0], mat = [4.0;;], fun = sin);
@test withgradient((x, p) -> sum(abs2, x.vec) ^ p, nt, 2) == (val = 25.0, grad = ((vec = [20.0, 40.0], mat = [0.0;;], fun = nothing), nothing))

@test withgradient(x -> sum(x.v), (v = [1, 2], w = [3.0])) == (val = 3, grad = nothing)
end

end # overall @testset