From 4331dc8240509a1bf57fb1f15968d62d2d1b1690 Mon Sep 17 00:00:00 2001
From: Zhanibek <zhanibek.om@gmail.com>
Date: Wed, 13 Mar 2024 14:06:07 +0900
Subject: [PATCH 1/2] freezingdocs

---
 docs/src/tutorials/misc-model-tweaking.md | 131 ++++++++++++++++++++++
 1 file changed, 131 insertions(+)
 create mode 100644 docs/src/tutorials/misc-model-tweaking.md

diff --git a/docs/src/tutorials/misc-model-tweaking.md b/docs/src/tutorials/misc-model-tweaking.md
new file mode 100644
index 0000000000..9280d2ef28
--- /dev/null
+++ b/docs/src/tutorials/misc-model-tweaking.md
@@ -0,0 +1,131 @@
+# Choosing differentiable/gpu parts of the model
+!!! note 
+    This tutorial features somewhat disconnected topics about customizing your
+    models even further. It is advised to be familiar with
+    [`Flux.@layer`](@ref), [`Flux.@functor`](@ref), [`freeze!`](@ref
+    Flux.freeze!) and other basics of Flux.
+
+Flux provides several ways of freezing, excluding from backprop entirely and
+marking custom struct fields not to be moved to the GPU
+([Functors.@functor](@ref)) hence excluded from being trained. The following
+subsections should make it clear which one suits your needs the best.
+
+## On-the-fly freezing per model instance
+Perhaps you'd like to freeze some of the weights of the model (even at
+mid-training), and Flux accomplishes this through [`freeze!`](@ref Flux.freeze!) and `thaw!`.
+
+```julia
+m = Chain(
+      Dense(784 => 64, relu), # freeze this one
+      Dense(64 => 64, relu),
+      Dense(32 => 10)
+    )
+opt_state = Flux.setup(Momentum(), m);
+
+# Freeze some layers right away
+Flux.freeze!(opt_state.layers[1])
+
+for data in train_set
+    input, label = data
+
+    # Some params could be frozen during the training:
+    Flux.freeze!(opt_state.layers[2])
+
+    grads = Flux.gradient(m) do m
+        result = m(input)
+        loss(result, label)
+    end
+    Flux.update!(opt_state, m, grads[1])
+
+    # Optionally unfreeze the params later
+    Flux.thaw!(opt_state.layers[1])
+end
+```
+
+## Static freezing per model definition
+Sometimes some parts of the model ([`Flux.@layer`](@ref)) needn't to be trained at all but these params
+still need to reside on the GPU (these params are still needed in the forward
+and/or backward pass).
+```julia
+struct MaskedLayer{T}
+    chain::Chain
+    mask::T
+end
+Flux.@layer MyLayer trainable=(chain,)
+# mask field will not be updated in the training loop
+
+function (m::MaskedLayer)(x)
+    # mask field will still move to to gpu for efficient operations:
+  return m.chain(x) + x + m.mask
+end
+
+model = MaskedLayer(...) # this model will not have the `mask` field trained
+```
+Note how this method permanently sets some model fields to be excluded from
+training without on-the-fly changing.
+
+## Excluding from model definition
+Sometimes some parameters aren't just "not trainable" but they shouldn't even
+transfer to the GPU (or be part of the functor). All scalar fields are like this
+by default, so things like learning rate multipliers are not trainable nor
+transferred to the GPU by default.
+```julia
+struct CustomLayer{T, F}
+    chain::Chain
+    activation_results::Vector{F}
+    lr_multiplier::Float32
+end
+Flux.@functor CustomLayer (chain, ) # Explicitly leaving out `activation_results`
+
+function (m::CustomLayer)(x)
+    result = m.chain(x) + x
+    
+    # `activation_results` are not part of the GPU loop, hence we could do
+    # things like `push!`
+    push!(m.activation_results, mean(result))
+    return result
+end
+```
+See more about this in [`Flux.@functor`](@ref)
+
+
+## Freezing Layer Parameters (deprecated)
+
+When it is desired to not include all the model parameters (for e.g. transfer learning), we can simply not pass in those layers into our call to `params`.
+
+!!! compat "Flux ≤ 0.14"
+    The mechanism described here is for Flux's old "implicit" training style.
+    When upgrading for Flux 0.15, it should be replaced by [`freeze!`](@ref Flux.freeze!) and `thaw!`.
+
+Consider a simple multi-layer perceptron model where we want to avoid optimising the first two `Dense` layers. We can obtain
+this using the slicing features `Chain` provides:
+
+```julia
+m = Chain(
+      Dense(784 => 64, relu),
+      Dense(64 => 64, relu),
+      Dense(32 => 10)
+    );
+
+ps = Flux.params(m[3:end])
+```
+
+The `Zygote.Params` object `ps` now holds a reference to only the parameters of the layers passed to it.
+
+During training, the gradients will only be computed for (and applied to) the last `Dense` layer, therefore only that would have its parameters changed.
+
+`Flux.params` also takes multiple inputs to make it easy to collect parameters from heterogenous models with a single call. A simple demonstration would be if we wanted to omit optimising the second `Dense` layer in the previous example. It would look something like this:
+
+```julia
+Flux.params(m[1], m[3:end])
+```
+
+Sometimes, a more fine-tuned control is needed.
+We can freeze a specific parameter of a specific layer which already entered a `Params` object `ps`,
+by simply deleting it from `ps`:
+
+```julia
+ps = Flux.params(m)
+delete!(ps, m[2].bias) 
+```
+

From 693604a3f06fa1f5cbb3c903c4e1efe1b8d8aa38 Mon Sep 17 00:00:00 2001
From: Zhanibek <zhanibek.om@gmail.com>
Date: Fri, 28 Jun 2024 16:34:13 +0900
Subject: [PATCH 2/2] update freezing docs

---
 docs/make.jl                                  |  1 +
 .../misc-model-tweaking.md                    | 52 ++++++++++---------
 2 files changed, 29 insertions(+), 24 deletions(-)
 rename docs/src/{tutorials => reference}/misc-model-tweaking.md (72%)

diff --git a/docs/make.jl b/docs/make.jl
index 6c7b483caa..9fc32e7632 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -40,6 +40,7 @@ makedocs(
             "OneHotArrays.jl" => "reference/data/onehot.md",
             "Low-level Operations -- NNlib.jl" => "reference/models/nnlib.md",
             "Nested Structures -- Functors.jl" => "reference/models/functors.md",
+            "Advanced" => "reference/misc-model-tweaking.md"
          ],
         "Tutorials" => [
         # These walk you through various tasks. It's fine if they overlap quite a lot.
diff --git a/docs/src/tutorials/misc-model-tweaking.md b/docs/src/reference/misc-model-tweaking.md
similarity index 72%
rename from docs/src/tutorials/misc-model-tweaking.md
rename to docs/src/reference/misc-model-tweaking.md
index 9280d2ef28..ed1bd72793 100644
--- a/docs/src/tutorials/misc-model-tweaking.md
+++ b/docs/src/reference/misc-model-tweaking.md
@@ -12,7 +12,8 @@ subsections should make it clear which one suits your needs the best.
 
 ## On-the-fly freezing per model instance
 Perhaps you'd like to freeze some of the weights of the model (even at
-mid-training), and Flux accomplishes this through [`freeze!`](@ref Flux.freeze!) and `thaw!`.
+mid-training), and Flux accomplishes this through [`freeze!`](@ref Flux.freeze!)
+and `thaw!`.
 
 ```julia
 m = Chain(
@@ -45,12 +46,13 @@ end
 ## Static freezing per model definition
 Sometimes some parts of the model ([`Flux.@layer`](@ref)) needn't to be trained at all but these params
 still need to reside on the GPU (these params are still needed in the forward
-and/or backward pass).
+and/or backward pass). This is somewhat similar to `pytorch` [register_buffer](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer).
+
+
+
 ```julia
-struct MaskedLayer{T}
-    chain::Chain
-    mask::T
-end
+struct MaskedLayer chain; mask; end
+
 Flux.@layer MyLayer trainable=(chain,)
 # mask field will not be updated in the training loop
 
@@ -70,11 +72,7 @@ transfer to the GPU (or be part of the functor). All scalar fields are like this
 by default, so things like learning rate multipliers are not trainable nor
 transferred to the GPU by default.
 ```julia
-struct CustomLayer{T, F}
-    chain::Chain
-    activation_results::Vector{F}
-    lr_multiplier::Float32
-end
+struct CustomLayer chain; activation_results; end
 Flux.@functor CustomLayer (chain, ) # Explicitly leaving out `activation_results`
 
 function (m::CustomLayer)(x)
@@ -90,15 +88,16 @@ See more about this in [`Flux.@functor`](@ref)
 
 
 ## Freezing Layer Parameters (deprecated)
+When it is desired to not include all the model parameters (for e.g. transfer
+learning), we can simply not pass in those layers into our call to `params`.
 
-When it is desired to not include all the model parameters (for e.g. transfer learning), we can simply not pass in those layers into our call to `params`.
-
-!!! compat "Flux ≤ 0.14"
-    The mechanism described here is for Flux's old "implicit" training style.
-    When upgrading for Flux 0.15, it should be replaced by [`freeze!`](@ref Flux.freeze!) and `thaw!`.
+!!! compat "Flux ≤ 0.14" The mechanism described here is for Flux's old
+    "implicit" training style.  When upgrading for Flux 0.15, it should be
+    replaced by [`freeze!`](@ref Flux.freeze!) and `thaw!`.
 
-Consider a simple multi-layer perceptron model where we want to avoid optimising the first two `Dense` layers. We can obtain
-this using the slicing features `Chain` provides:
+Consider a simple multi-layer perceptron model where we want to avoid optimising
+the first two `Dense` layers. We can obtain this using the slicing features
+`Chain` provides:
 
 ```julia
 m = Chain(
@@ -110,19 +109,24 @@ m = Chain(
 ps = Flux.params(m[3:end])
 ```
 
-The `Zygote.Params` object `ps` now holds a reference to only the parameters of the layers passed to it.
+The `Zygote.Params` object `ps` now holds a reference to only the parameters of
+the layers passed to it.
 
-During training, the gradients will only be computed for (and applied to) the last `Dense` layer, therefore only that would have its parameters changed.
+During training, the gradients will only be computed for (and applied to) the
+last `Dense` layer, therefore only that would have its parameters changed.
 
-`Flux.params` also takes multiple inputs to make it easy to collect parameters from heterogenous models with a single call. A simple demonstration would be if we wanted to omit optimising the second `Dense` layer in the previous example. It would look something like this:
+`Flux.params` also takes multiple inputs to make it easy to collect parameters
+from heterogenous models with a single call. A simple demonstration would be if
+we wanted to omit optimising the second `Dense` layer in the previous example.
+It would look something like this:
 
 ```julia
 Flux.params(m[1], m[3:end])
 ```
 
-Sometimes, a more fine-tuned control is needed.
-We can freeze a specific parameter of a specific layer which already entered a `Params` object `ps`,
-by simply deleting it from `ps`:
+Sometimes, a more fine-tuned control is needed.  We can freeze a specific
+parameter of a specific layer which already entered a `Params` object `ps`, by
+simply deleting it from `ps`:
 
 ```julia
 ps = Flux.params(m)