log.txt

2024-01-15

decided to use this log again for daily stuff.

looked into value expansion version again, from DDP type solvers in continuous
time. https://dcsl.gatech.edu/papers/acc15e.pdf here is a nice derivation for
the HJI equation.

also debating: if we go with this value expansion option, we will have to do
more computation per trajectories, the state is enlarged by an (nx, nx) matrix
(symmetric so maybe half that). is it worth it? can we effectively explore the
state space without too much unnecessary data?

thinking a lot about the "fundamental limits" of this purely backward-in-time
way of solving the problem. inevitably the bottleneck will be travelling along
long, turnpike-type trajectories, when the fast dynamics, which are unstable in
backward time, constantly push us away from it. Will we actually end up
creating much more data than what would be needed? Would it be a smart(ish)
idea to throw away *some* of the trajectories to limit dataset size?

also, inuitively, it seems like higher dimensional problems are even harder
from this perspective. If we want to re-start HJB characteristics away from the
terminal set, we need accurate knowledge of λ(x) and V(x) at that point. This
in turn we can only do when the data density in that region is alredy high, so
that either some nearest neighbor interpolation or taylor extrapolation is
accurate enough. But does this even work in high dimensions? Basic "curse of
dimensionality" intuition says that as dimension increases, this type of
nearest neighbor search becomes increasingly a bad idea: The dataset size
needed to satisfy some fixed data density is exponential in dimensionality.

(another interesting question here would be to challenge the assumption that we
need accurate V and λ to continue a HJB characteristic curve. If λ is slightly
wrong, can we prove something about the suboptimality of the "stitched"
characteristic curve w.r.t. the actual optimal trajectory?)

In the same vein, my initial assumption kind of has been: Better to "stitch
together" optimal trajectories by re-initialising the characteristic ODE (a),
than to re-calculate the whole final part of the trajectory (b). Is this true
tough? To do (a), we have to interpolate the costate and value info, which in
higher dimensions might be problematic. (and even if the sets from which most
backward trajectories start locally resemble a lower-dimensional subspace,
interpolation requires lots of care to not go "outside" of that subspace too
much. This is also something I think where quadratic value expansions can help
a lot: the hessian of the value gives us a nice distance metric, which maybe
encapsulates this "lower dimensional turnpike subspace" thing which hopefully
exists)

all of this makes me ask the question: would a DDP type solver be a better
overall fit? the benefit of our approach (hopefully) is that we only generate
(locally) optimal trajectories, wasting no time on iteratively adjusting
suboptimal ones. However, that advantage can be nullified if in the process of
exploring the state space with optimal trajectories, we generate much more of
them than what we would need for training an NN. Maybe we could have made a
similarly good dataset with much fewer DDP-generated optimal trajectories, even
if they are a bit more expensive to calculate, but we only generate the amount
actually needed, which probably is a lot less than the current idea.

this makes me heavily question the purpose of the project. if we just take a
DDP solver, why not one of the 100s that already exist? If we do our
characteristics type thing we can already guess the conclusion*, so why do it?

* something like:

we attempt to solve control problems to global optimality by evaluating optimal
trajectories backwards. end up with no global optimality guarantee, best case
some probabilistic asymptotic approximate thing. It works great for
"non-multiscale", low-ish dimensional problems, for others we end up generating
huge datasets for not much tangible benefit, all while algo complexity is
basically O(dataset size^2).

ways to address this:
- be aware that it probably won't be state of the art, have fun exploring
  algo details and heuristics to speed it up and stuff.
- connect with previously existing methods (ddp) and work from there to some
  sort of global optimality guarantee
- do a 180 and start a somewhat different project? (PINN? active exploration?)

also still kind of debating thoughts from last week: should we go all in on
making it efficient with the right data structures? store trajectories in a
tree or DAG like thing based on "flow" of information? use some black magic to
speed up nearest neighbor search, like k-d tree or locality sensitive hashing?
because pretty much no matter how we go about it, we will not only create a
sizable data set of trajectories, but also have to do many, many
nearest-neighbor-type searches over it.

this was a lot of maybe incoherent rambling. sorry to my future self and
whoever reads this. maybe the final paper will be more readable \o/


2024-01-16

looked through literature. found out that our gradient enabled neural network
is already known under one more name: Sobolev training (sobolev spaces are
function spaces endowed with a norm that looks like a p-norm of a vector of
function norms of all partial derivatives up to order k). specifically there is
also an interesting phd thesis:
https://gepettoweb.laas.fr/articles/amit_icra_22.html. They basically use a DDP
type solver coupled with global value function approximation, to find infinite
horizon optimal control over a large domain by iteratively using the learned
value function as terminal cost.

Should we try to expand this to some sort of provable global optimality?
Because I don't think they do that there. There the claim "So, when used as a
proxy for terminal cost functional, ∂PVPv [name of their algo] tends to drive
the locally optimal solver toward the globally optimal solution.", but no proof
or explanation.

possible decent contribution would also be to do the infinite horizon in one
single trajectory with some sort of time reparameterisation. reduces the number
of approximations and weak links, and I think with adaptive solver this can
actually be quite efficient. In fact, I think inf-horizon DDP itself would already
be a cool contribution, even without the "hopefully global" optimality.

other new interesting papers in semanticscholar folders:
- ddp-ish value expansion
- local -> global optimality
- sobolev learning
- V approx, transfer learning, etc.


small tidbits i thought about:

- in continuous time DDP apparently the hessian Vxx is always positive
  definite, in contrast to the more often used time-discretised version. Does
  this hold even when we have a really sloppy numerical solver? would be cool if
  yes

- NN learning: only consider data points with v(x) <= v, and sweep up v during
  training? this way we might explicitly identify the points when different
  local solutions start conflicting, and maybe do something about it.


2024-01-17

had meeting. talked about all the concerns from last few days and got
reassuring answer that yes, these are hard and, in practical terms, unsolved
problems. advice: just explore around for another week and slowly try to come
up with a sensible research question.

mentioned frustration that "everything" has already been done, no "niche" left.
answer: "own" niche can also be very small, or a contribution to extend
existing work can also be quite small to be significant.

current options:

- explore (probably limited) possibilities with pure PMP backward integration.
  also sometimes labelled geodesic flow, geodesic spray, etc. probably only
  simple systems, low dim, not multiscale.

- explore backwards PMP integration + quadratic value expansion. the appeal is
  obviously still no iterative optimisation over trajectories/inputs. but
  scalability concern due to instability and huge dataset remains -- can we
  still optimise the hell out of it? with smart proposal strategies or fancy
  data structures?

- go for a DDP like solver method, similar to PhD thesis of Amit Parag, and try
  to come up with tricks to "encourage" global optimality. Lots of ideas around
  BNNs with asymmetric loss, Sobolev learning, active learning, etc. currently
  this does seem the most interesting to me.

- actually already infinite-horizon, continuous-time DDP could be a cool
  contribution itself...?

- sth else?

different, mostly-orthogonal directions to expand the problem setting beyond
inf-horizon, continuous time:

- some sort of approximate *global* optimality. probably though there just
  isn't an actual shortcut to this. give up?
- input constrained (in general dims -> QP solver finds u*)
- state constrained (in the sense of safety limits)
- state constrained (in the sense that x is on a manifold)
- robust (with adversary, HJI equation)
- parameterised family of dynamics


other than that:

- devoured more literature about continuous-time DDP-ish solvers, especially
  nteresting ones to do w/ HJI/robust stuff.

- found a nice paper comparing different BNN methods to the "true" posterior
  obtained with Hamiltonian Monte Carlo, put it in a new BNN folder.

- found a couple more papers about what I think is essentially the same as
  pontryagin backward integration like in SA, one example was planning min-time
  flight trajectories on the actual globe with known wind field and simplified
  single integrator dynamics. that is actually a cool application, where the
  multiscale limitations do not come into play.

Random assorted thoughts, about making DDP work in inf-horizon:

I am doubtful as to whether the time-value rescaling, or some different time
rescaling, is needed. Probably the adaptive solver will be just as efficient
without it [citation needed]. What is more interesting then is how do we choose
the time horizon. Simple first idea: just choose a large-ish time horizon and
define a terminal set Xf (eg. LQR value sublevel set). Then, optimise
trajectories using DDP. If the trajectory ends inside of Xf, good. If not,
increase the time horizon. In continuous time and w/ adaptive solvers, [i hope
that] a "too" long time horizon will not be much of an issue, especially if
most of the time is spent within Xf, where the ODE solver can choose large
steps.

And about distributions/ill conditioning/timescale separations:

also, many of the problems we see (ill conditioning, bad distributions) arise
from basically the fact that systems are decomposable across time scales.
Traditionally, this makes engineering solutions easier, however if we aim to
solve the whole problem at once it becomes much more difficult. Can we somehow
use these time-scale decomposition heuristics to maybe improve/inform our
sampling process, while still ultimately finding the actual solution to the
full problem? not so sure how exactly to go about this.

about possible approximate global optimality:

this seems admittedly hard and convoluted in the framework DDP + Sobolev BNN
value function. One possiblilty: keep all trajectories in a data set. For each
proposal, also find nearest neighbor trajectories, and for each one of them (*)
try to do a homotopy method, moving its initial condition to the proposal while
keeping local optimality w/ DDP.

*: To speed this up we can first do some k-means clustering of the trajectories
in terms of lambda(x0), or in terms of some sampled points along the
trajectory.  Probably we get away with at most producing k=2 clusters since
(probably?) the set of points where n+1 different locally optimal trajectories
are equally good has one dimension less than the same set for n different
trajectories. So we might choose the minimum nontrivial number 2...

Other possibility: somehow use a sobolev BNN with multimodal posterior. But
this will be worth nothing if we don't have the data. you know, I am starting
to think that global optimality is just actually not worth worrying about, and
instead we will have to keep relying on engineering intution about the specific
system to decide if we are happy with the local solution at hand.


2024-01-18

thinking about ways to do function approximation which maybe works well given
the structured nonsmoothness encountered in value functions.

What would happen with a standard (B)NN (and enough data of suitable
distribution covering several distinct local solutions) is that we just kind of
smoothly interpolate between different local solutions, making the precise
"decision boundary" rather muddy. What we would prefer is that there is a
separate NN for each adjacent region, and that the output of the whole model is
the minimum of the individual models. That way we benefit from smooth
extrapolation (which over small enough distances should work quite well with
appropriate sobolev loss) without interference due to neighboring, distinct
solutions.

How could we achieve this behaviour, even approximately? We cannot just blindly
have N values passed to some min(.) type last layer, because of this:

due to the global structure of the value function, there exists a curve γ(s)
with γ(0) arbitrarily close to the nonsmooth decision boundary from one side,
and γ(1) from the other side, but with the property that V is smooth at all
γ(s) for 0 <= s <= 1. (intuitively, this curve can be constructed by following
one local trajectory forward to almost the goal/equilibrium, then smoothly
connect to the other solution, which we follow in reverse time until we reach
the opposite side of the decision boundary). Therefore, if we apply a naive
min(.) operator, we cannot exactly represent this because there has to be
another switching surface switching between the left and right regions,
somewhere along the curve, where V is actually smooth.

Therefore my next instinct is to approximate a min(.) operator with some
smoothed version thereof, and add an extra parameter which changes the
smoothness of it somehow. e.g. with some softmax thing. These are the
requirements:

- the soft minimum selection should be a close approximation of the actual
  minimum close to the decision boundary.
- the soft minimum selection should "fade out", i.e. change its selection only
  very smootly, when we are far from the decision boundary.
- there should be no deterioration in performance if we choose 10 different
  "classes" but only 1 or 2 are necessary.
- in general it should not completely break NN fitting.


in other news:

tried to understand some of the results on "PPO attains global optimality",
went over my head still. their convergence rate depends on the discount factor
though with a singularity at 1 so I doubt it is in any way applicaple to the
(IMO much more interesting) undiscounted case.


can we make ddp state constrained in an easy-ish way? I think I wrote this on
some note yesterday. basic idea: during the backward pass have two different
value functions: one from the standard unconstrained problem and one for the
constraint violation \int max(0, g(x(t))) dt. Then, during the forward pass, if
the constraint violation is nonzero, apply inputs minimising its local
expansion, basically replacing V with V_g, otherwise apply inputs minimising
the usual local value function. Does this work? If so, also in continuous time?
Are there problems due to nonsmoothness on constrained arcs?

is this basically a cheap, not-thought-thorugh version of an augmented
lagrangian type method? still want to grok those papers from Lutter et al. If
working that would probably properly get rid of the nonsmoothness problems.


2024-01-22

Yet another flavor of the main idea. For some time now I've been thinking that
while timescale separation introduces mostly problems for "full problem"
optimal control solvers (ill conditioning, uneven trajectory distributions,
stiff-ish ODEs), in the classical engineering setting it is mostly a blessing
because we can often design controllers separately over timescales (sacrificing
optimality but making it much simpler). From this angle I thought it would be
cool if we can somehow use this timsecale separation intuition to our benefit,
i.e. to make learning control laws more efficient, while still solving the
"full" non-decoupled problem.

Wrote a page in the overleaf idea dump about it ("Slow(er) Manifold following).
The idea is this: Instead of the RRT-style framework I thought about for some
time, try to find a backward optimal trajectory on the slow manifold (i.e.
where fast dynamics don't move much). This can probably be done with local
quadratic value expansion (= DDP backward pass) and by somehow identifying the
slow manifold or its local linearisation. Then once our trajectory inevitably
strays away from the slow manifold we can go forward in time along it until it
is ε-close to the slow manifold, then re-seed the trajectory "exactly" on the
manifold. The crucial change then is that we can immediately re-seed the same
trajectory, instead of doing nearest neighbor searches all the time. Then we
can save only the stitched together trajectory which actually stays (almost) on
the slow manifold and only do NN searches among this smaller dataset. Basically
we then reduce our sampling problem to a lower-dimensional one over the slow
manifold, where we first find "all" relevant causal dependencies and only after
that branch out in the fast subspace (maybe this will not even be necessary due
to already generated data as a byproduct). More thoughts in dump. Maybe this
would be cool still...

Other flavor: Modify the Hamiltonian dynamics of the PMP in such a way that the
slow manifold is stable in backward time, instead of unstable. However this
probably requires knowing the stable manifold in advance, which we don't in
general. Very hard to ensure that the trajectories of the modified dynamics are
somehow a good approximation of the unchanged dynamics with PERFECT
initialisation. This is probably a dead end tbh.

OTOH probably the full-blown inf-horizon DDP <-> BNN active learning loop would
probably be even more scalable and data efficient. But admittedly less flashy
and esotheric, and harder to pinpoint where exactly my "new" contribution will
be.

Both require quadratic value expansions so maybe I should start by working this
out properly in continuous time...

So the current selection, to summarise, is this:

a) Develop "inf-horizon" continouos time DDP, implement in jax. Connect with
   active learning & Sobolev NN to efficiently find new solutions over whole
   interesting region. Maybe some sort of "approximate" global optimality.

b) Try to take the "purely backwards" approach further, possibly like the notes
   just above


at the moment I am leaning towards a) which is probably the more generally
useful and scalable optinon. Main components of that:

- inf-horizon, continuous time DDP
  - theory (basically figure out those couple papers)
  - implementation :)

- Sobolev Bayesian NN
  - Which flavour of "bayesianisation"?
  - Sobolev how exactly? With hessian approxiamated by random Hvp?
  - Find some nice version that half works with nondifferentiable points?
    (see notes from Fr.)

- Intermingling of the two
  - How to propose new samples?
  - How to handle trajectory optimisation? Fixed iterations? Until converged?
  - Data structure considerations? Hopefully not
  - Reweight data, e.g. by 1/density, for training?
  - Bias somehow os we are more likely to encounter "surprising" trajectories
    which are better than the currently known solution?

- Other stuff
  - Which examples, costs, engineering tradeoff type goals?

this gon be fun

basic research questions to justify these goals (is this the valid way of doing
science? I am but a lowly engineer after all) :

- How can we achieve maximum sample efficiency? (by choosing correct NN and
  sampling strategy)
- Are we able to meaningfully speed up long-horizon trajectory optimisation by
  going to continuous time?
- To what degree can we automate the process and make it "hyperparameter-free"?
  (probably not that much tbh, I feel like optimal control is in the end just
  as much black magic as RL, perhaps with two or three less black first steps.)


other small stuff:

- had an idea over the weekend that H(x, λ)=0 describes something useful and
  could help in interpolating the right costate (in RRT idea). convinced myself
  though that this is nothing. for all other terminal value functions, we also
  have optimal trajectories with H=0. So this does not give much information.

- thought about other ways of modifying the backwards PMP system to make
  trajectories more well behaved. I think continuously modifying it, while
  always adjusting the V/Vx info using the local expansion, is a bad idea though,
  can't think of a way to ensure low approximation error.


2024-01-23

some more thoughts on how to represent the "multiple branched" value function
with an NN. got not much further than last time though. Current idea: have an
NN model which outputs k "possible" value functions Vi(x), and a smoothness
parameter κ(x), to finally give it to a "smooth min" type function to form the
final estimate. Scoured literature for descriptions of the "set of x where
different equally good local solutions are globally optimal", found lots of
names, made a section in idea dump.

Coded up a small demo of trajectory splitting with linear V extrapolation (from
costate), the rest constant, for really tiny "jumps". Looks okay, though I am
unsure how to verify it more precisely. Anyway I am thinking this is not really
the way to go.

Also had thoughts about trying to implicitly represent (= learn) the lifted
manifold, i.e. the set of (x, λ) described by our geodesic spray. However I
think this does not fundamentally solve anything, the "complexity explosion"
with longer time horizon is still the same, finding all local value functions
given some x is still an nx-dimensional rootfinding problem, and we make no
progress towards efficient global optimality with pruning of suboptimal local
solutions. Really I think I have to ditch "proper" global optimality soon :(

However, could we somehow "aid" the whole thing with engineering intuition? I
was just thinking always that the 2D quad has one obviously "nonconvex"
decision, which way to go around to go to the upright equilibrium of the
attitude subsystem (or perhaps how many times to rotate if we consider
outlandish states). Can we somehow encourage the local solver to "check" both
these cases? Maybe we can make something that obviously cannot in general find
global optima (we can always make pathological OCPs with arbitrary number of
almost equally good local optima: Think about a tiny UAV flying through some
sort of grid from a far distance). Maybe we can then adapt the NN to have
separate models depending on some simple condition, like: Is Vx saying we
should turn left or right? Or based on literally the number of turns from the
equilibrium. Although this could once more be some half assed solution that
only works in weird special cases...


2024-01-24

found another paper about continuous time DDP, which seems to have a nice
derivation of the propagation of the value hessian. todo read this more
carefully, and attempt to recreate?

https://dl.acm.org/doi/pdf/10.1145/3592454

also got convinced by bhavya that making some custom NN model suited to the
particular type of nondifferentiability encountered in value functions is
probably not the way to go. questionable results and not scalable to higher
dims (due to general lack of global solution).


2024-01-25

been reading that paper for some time. still don't understand all the
notational quirks exactly.

have gained new intuition about the riccatti eq. this is actually really cool:
instead of fiding the rapidly diverging backwards optimal trajectories adjacent
to our trajectory, we find a quadratic/linear parameterisation (= quadratic
taylor expansion of V(x)) of all of them. the good thing is that even if the
dynamics are really rapidly unstable, the associated taylor expansion stays
approximately constant, enabling us to "skip" all the local fast dynamics, and
skip ahead to finding out how to stabilise them locally.

From this viewpoint it is almost nonsensical to propagate Vxx information with
every trajectory in a purely backward PMP integration thing.


viable research goal:

learned *robust* optimal control a la https://arxiv.org/pdf/2107.04507.pdf? and
instead of a gradient GP we use a Sobolev NN for function approximation,
enabling us to use the second derivative. Going from there to optimise active
learning <-> TO interaction. Maybe heuristically "address" global optimality.

for robust control there are ugly technicalities with the terminal set, because
with an adversary we cannot stabilise all the way to the origin, because the
adversary has no input cost and always chooses the maximally destabilising
disturbance, whereas we do have an input cost and at some point stop caring
about small setpoint errors if we have to incur large input cost to fix it. A
very basic example of this is in Pachter & Weintraub.


~~~ other random idea ~~~

In https://gepettoweb.laas.fr/articles/amit_icra_22.html, they use a fixed time
horizon for prediction and use a learned value function as terminal cost. thus
the region where controls are accurately approximated is expanded gradually,
and they find infinite-horizon controls in the end.

I have been put off by this idea for some time because stitching trajectories
together gives us another source of approximation error (though I have the
intuition that this error is somehow very small), and also it gives us an
excuse to implement continuous time DDP, ofc with the hope that adaptive
solvers can "skip" the long boring "turnpike" parts of the trajectory quickly
and give effectively inf horizon optimal controls with feasible effort.

However, what if we do the following. We have a BNN with accurate uncertainty
quantification. To find the initial guess for TO, do a forward sim with
predicted u* from BNN. Evaluating uncertainty in u is no extra work so we might
as well do it, and stop the simulation once we are in a region where
uncertainty is below some small threshold. Then we do the TO with the BNN
(mean?) value function as terminal cost, over a much shorter horizon, hopefully
only concerning the fast part of the dynamics.

This *should* work intuitively right? If there is a "slow" manifold we expect
to have relatively many trajectories there because they all stabilise to that
region. Therefore we can hope there is a large region with low uncertainty
after some active learning iterations, and that this region is quickly reached
in the initial closed loop sim.

Are there concerns with the terminal state changing during TO, possibly to a
"worse" region? Intuitively, we should expect that it only changes to better
regions. Is this true? Try a proof:

Because the initial guess is a closed loop sim, it is a suboptimal trajectory.
Join it with the rest (from T to inf), suppose the rest is actually optimal
(reasonable since uncertainty low there). Then we know the cost-to-go from x0
will be lower after TO than before (bc. we've gone from suboptimal to optimal).
Does this tell us anything about the terminal state? Somehow I have the feeling
it doesn't if we have a time horizon in physical time, without warping...

We can also just check at the end whether the BNN at the new terminal state is
still certain enough, and if not, ditch the trajectory, or re-do with longer
horizon in the next iteration.


current list of very relevant other publications:

DDP treatments in continuous time:

Hutter et al.
https://dl.acm.org/doi/pdf/10.1145/3592454 (w/ general parameters for diff)
https://arxiv.org/pdf/2101.06067.pdf (w/ state constraints)


Sun et al, w/ terminal constraints
https://www.semanticscholar.org/paper/b00a864b457992c5f87a4e0e24382b33c26512f4
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10156422 (w/ control constraints)


+ HJI/robustness:
https://www.semanticscholar.org/paper/5ae21807233c95c5485da721a96cb9ff06a659de
https://arxiv.org/pdf/2107.04507.pdf

Trajectory Opt <-> Learning:
https://arxiv.org/pdf/2107.04507.pdf
https://gepettoweb.laas.fr/articles/amit_icra_22.html

CACTO: https://arxiv.org/pdf/2211.06625.pdf
CACTO-SL: https://arxiv.org/pdf/2312.10666.pdf (Sobolev!!!)

Initial Sobolev learning paper: https://arxiv.org/abs/1706.04859

+ something about BNNs/active learning but everyone knows how these work


...need to be precise about research question, specifically, what are we doing
...better than these?


2024-01-26


read the different continous time ddp papers for a bit longer. i think I gained
new intuition.

backwards propagation of V, Vx, Vxx is actually similar to directly evaluating
the characteristic trajectories backwards, but "modified" so they follow the
forward dynamics. Imagine the PMP characteristic is a particle following some
vector field (which it is), and imagine the Vxx term which we carry with it as
a local representation of the value level set, i.e.  the "front" of possible
adjacent trajectories. We can imagine replacing our point by an adjacent point
on the front, and if close enough, we should still get an accurate optimal
trajectory.

The DDP backward pass consists of essentially the same: propagating a particle
backwards in time together with the "front" of nearby possible other points.
The difference is the particle doesn't follow the vector field itself but
travels on the trajectory obtained in the forward pass, like on a rail. Then
there have to be some adjusting terms to account for the "sideways" movement of
our point. Because we implicitly (using Vxx) keep track of all nearby
trajectories, if the optimal trajectory is nearby (read: close enough so the
system is essentially linear in the region between), we have already found it
during the backward pass and just need the forward pass to "retrieve" it (and
to check whether it is actually optimal). If we are further away it is more
complicated, proving that we have a descent direction we can use for line
search is above my paygrade.

Now, the implementation is still not done, I still have to grok some notation
from the papers. Certainly this is the next task.

Also, installed ppopt (explicit multiparametric QP solver) and ran the test
example (control allocation on octacopter). It finishes in like 5 seconds and
gives a solution object with 500ish critical regions. But evaluating the
solution at some point does not work, it thinks the point is not in any
critical region. Want to find out why, also, am unsure whether using such a
poorly documented tool is worth the hassle. Maybe I just stick with 1D or 2D
input sets...


2024-01-30


copypasting from last time: the overarching project goals as of now.

a) Develop "inf-horizon" continouos time DDP, implement in jax. Connect with
   active learning & Sobolev NN to efficiently find new solutions over whole
   interesting region. Maybe some sort of "approximate" global optimality.

at the moment I am leaning towards a) which is probably the more generally
useful and scalable optinon. Main components of that:

- inf-horizon, continuous time DDP
  - theory (basically figure out those couple papers)
  - implementation :)

- Sobolev Bayesian NN
  - Which flavour of "bayesianisation"?
  - Sobolev how exactly? With hessian approxiamated by random Hvp?
  - Find some nice version that half works with nondifferentiable points?
    (see notes from Fr.)

- Intermingling of the two
  - IMO this is the actual interesting part.
  - How to propose new samples?
  - How to handle trajectory optimisation? Fixed iterations? Until converged?
  - Data structure considerations? Hopefully not
  - Reweight data, e.g. by 1/density, for training?
  - Bias somehow so we are more likely to encounter "surprising" trajectories
    which are better than the currently known solution?
  - some sort of heuristic global optimality thing? maybe assuming we know about
    the number or structure of the different "modes"?

- Other stuff
  - Which examples, costs, engineering tradeoff type goals?


been trying to grok the two main continuous time DDP papers again today. Tried
to develop my own intuition, connect w/ characteristics approach, and see
whether deriving the backward propagation from that angle gives us the same,
and maybe lets me understand the papers. Got a bit further on the theory front
but no implementation that is working yet.


2024-01-31


meeting. discussion about whether or not it is worth it to implement own,
continuous time optimiser. they say: probably not, probably they are right. i
can still do this if I get the rest working. then I can focus on the more
interesting part: how to do do active learning sensibly, how to handle
different local solutions, how to make it infinite horizon.

Problem is, in discrete time we get discretisation artefacts, we get no clearly
defined time invariant value function (or do we?) Also the inner optimisation
is now concerned with minimising the cost over an actual time interval, making
everything nonlinear and nonconvex and ugly. and yet more, we don't get to
profit from adaptive step sizes or anything like that. I really don't feel like
"throwing away" all the nice continuous time things. Is this a sunk cost
fallacy?

Stuff to look up:
- Offline RL (learning optimal policy from fixed, suboptimal data)
- average-reward MDP (undiscounted/inf-horizon problems)

goal for next time: formulate overarching goal, high level problem description
in cohesive and clear form


found yet another paper from Farshidian, Buchli, etc. about continuous time
constrained trajectory optimisation: https://arxiv.org/pdf/1701.08051v2.pdf
They state pretty clearly the advantages over NLP-based approaches:

"
    In general, NLP–based planning algorithms require the
    discretization of the inﬁnite dimensional, continuous optimiza-
    tion problem to a ﬁnite dimension NLP. This discretization is
    often carried out using heuristics, which can result in numer-
    ically poor or practically infeasible solutions. Our algorithm,
    by contrast, is a continuous–time method which uses variable
    step-size ODE solvers in its forward and backward passes.
    Given the desired accuracy, it can automatically discretize the
    problem using the error control mechanism of the variable
    step-size ODE solver. Informally speaking, this allows the
    solver to indirectly control the distance between the “nodes”
    in the feedforward and feedback trajectory. In practice, this
    decreases the runtime of an iteration, since the number of
    calculations decreases.
"


2024-02-01

finished own derivation of how the value taylor expansion evolves in the
backward pass. just couldn't help it. coded it up in skeleton form, still a few
things missing, I almost fear trying it out.


2024-02-02

continued hacking a bit at the implementation. been thinking about how we best
access the RHS used to make the forward sim in the backward pass.
mathematically we can just evaluate the time derivative of the solution
trajectory. however, it is a polynomial interpolation, not the real thing.
Compared the two with a plot and we indeed see rather large errors (more than
10% of the state values themselves). So, there seems to be no shortcut to
evaluating the RHS again, given that the ODE solver chooses different time
points anyways. And, come to think of it, we need the state derivative of the
RHS too which will dominate computational cost in all cases.

maybe using the interpolation is even better? the interpolation's time
derivative tells us where our actual approximate "solution" trajectory goes,
while the RHS evaluated again tells us where it "should have gone", loosely
speaking. will have to try it out. using the derivative is certainly easier
though. but even if we do that we need the previous input anyway, and to get
that reliably, the previous costate, and in turn to get that, either the full
data from the last backward pass, or some interpolation of the costate seen
during the forward pass.


ideas to start working next monday (high priority):

- formulate overarching goals neatly, maybe in own latex document?

- finish the backward pass implementation, debug, sanitycheck, work towards DDP
- if it completely fails, resort to trying to understand other papers

- think about active learning part and local/global heuristics.
- maybe start using some finished DDP implementation for other experiments?

low priority:

- implement inner convex optimisaiton in general case
  (brute force active set = region-free explicit mpQP?)

- come up with example dynamical systems illustrating different up/downsides of
  our approach

- think abt extensions (2 player differential game? parameterised?) in fact
  just the combination of those two (basically unifying Hutter et al. with
  Sun&Kleiber would be a great novelty but also considerable coding work)


2024-02-05

...vs what I actually did: re-do the calculation of \dot S along
characteristic curves. The thing from last time turned out to be not
symmetric, therefore it must be wrong (still unsure where exactly I went wrong
though). The second try is akin to the DOC derivation: write down a solution
of the characteristics plus some small variation as (x, \lambda) + (\delta x,
\delta \lambda). Then by subtracting the non-delta part and doing a first
order taylor approximation we get "differential dynamics", a time varying
linear system for the state variable (\delta x, \delta \lambda). like DOC we
then parameterise lambda as a linear function of x, S x. After juggling around
some terms the "differential dynamics" also tell us how this matrix S evolves,
we get something depending on different second partial derivatives of H^\star
and also S itself, looking very reasonable. Let's try it out in jax.


2024-02-06

was a bit sick in the morning, feared i had covid, slept in. was better by 10.

did some more philosophising about the differences between the DOC derivation
(LQ-isation of entire OCP about current iterate + solution of linear BVP) and
the weird thing I did (follow characteristic curves first and foremost, but
adjust the taylor expansion to account for the different direction in which
the forward trajectory points.) I may have found the central difference,
namely that in our version we are taylor expanding the pre-optimised
hamiltonian about the characteristic field given by current x and lambda, and
with it the u* map which is piecewise linear. In DOC this taylor expansion is
formed about the iterate trajectory. But also they have no input constraints
and no piecewise linear u*, just a linear one. Maybe read Sun&Kleiber more
precisely again to see how they handled it?

then, continued the implementaiton (of our, almost-characteristics-following
version) and got to the first run where numbers come out on the other side
juhuuu ;) On the first try the value and costate match the forward trajectory
(which was evaluated in backward time and so is already optimal). sanity check
passed!  The entries of the S matrix look kind of sus to be honest but then
again it is a trajectory which initially has the uav spinning quickly so that
is kind of to be expected. more sanity checking follows tomorrow, goodnight :)


2024-02-07

implemented the forward passs! it seems to work as expected!!!! except for some
small questions, such as: why does S = V_xx sometimes turn
non-positive-definite?


2024-02-07

had big enlightement this morning by feeling like the DOC derivation is the
correct one after all. by properly approximating the problem about the current
iterate with quadratic cost and linear constraints, then solving the resulting
linear subproblem, they basically build a standard newton method, with easy
proofs of convergence, line search etc. Went over the DOC style derivation
again, I think I have it correct now, first implementation looks alright
(symmetric \dot S and very close to other version). Not sure if they are both
actually the same. Maybe the only difference is really which linearisation for
the u* map we are using.

spent some time thinking about line search. in contrast to fininte-dim
optimisation, we cannot just add arbitrarily many iterates on top of each
other, but standard line search basically requires this. by letting x+ = x +
alpha * (descent direction) we are making x+ a linear combination of the first
iterate and all descent directions.

so we need to be somewhat smart about this. brute force solution:
- attempt full newton step, if some sensible descent condition holds, accept.
- if not, repeat backward pass, with an additional "regularization" term
  penalising ||x - x_iterate||. this gives still a LQ subproblem which we can
  solve exactly. Do the forward pass again, check descent condition. If OK,
  return; otherwise repeat the step with a larger "closeness" penalty.

Repeating the backward pass is expensive though. Other papers rely on just
applying a scaled down version of \delta x and \delta u. But with input
constraints this will be a bit complicated. Would much rather have a 'scaled
down' value function. But to do that properly we'd need to keep track of the
local value function that generated the last forward pass. And if that was
already made by combining the previous backward passes we need both of those
too.

easy heuristic which maybe works: Do only 1 backward pass. If descent condition
fails, then do another forward pass, but with local value function:

V(t, x) = v(xbar(t)) + λ(xbar(t)) (x - xbar(t)) + α/2 (x - xbar(t)).T S(t, xbar(t)) (x - xbar(t))

for some α > 1 which we may increase to increase the quadratic value term. this
corresponds to using a larger linear feedback gain which surely will not work
for significant α.

There must be something akin to "step in the descent direction but only a
fraction" that does not have all these issues...? how?


2024-02-07

did some further attempts at replicating DOC's S and s updates but to no avail.
mostly looked at pretty plots and figured out how to record webcam timelapses
from mjpeg streams.

points to start next week (this is essentially copy pasted from last week...)

- (still) formulate overarching goals neatly, maybe in own latex document?

- further debug & sanitycheck the backward pass, work towards DDP
- either decide on DOC or my implementation, or do both separately and compare
  (but stop switching every half day)

- think about active learning part and local/global heuristics.
- maybe start using some finished DDP implementation for other experiments?


2024-02-12

re-did the derivation from the chacteristics angle without the error I made
previously (in confusing a total and partial derivative of the hamiltonian).
Got the same fomula I have from the DOC style derivation for \dot S. looks
good.

did a couple sweeps (move initial state a bit, do a fixed amount of DDP
iterations) to obtain a family of (hopefully) optimal trajectories. Next up,
further sanity checking: Are the trajectories actually optimal? Find some way
to see how close to a solution we are.

But all in all this looks great. Was also happy about how quick it seems to be
after jit. it does 32 iterations, for the flatquad model in like 0.1 seconds :)


2024-02-13

continued playing around with ddp. made a cleaner implementation in functional
jax style with some extra abstraction. did some initial state sweeps. looks
pretty nice :) i think for the moment this should be workable.


2024-02-14

continued the implementation & some tests. made plots and brought meshcat
visualiser back from the dead. looks all pretty nice. also removed the
initialisation with backward shooting by a simpler one with LQR-controlled
solution near goal state.

2024-02-15

trying some more "difficult" states to sweep to. quickly the ODE integrator
steps explode. first remedy i am trying now: set maxsteps to something
outlandish like 10000, and set step controller dtmin to T/maxsteps.
theoretically this should work! but maybe at the expense of accuracy. there is
also the norm argument which might be useful especially in the backward pass:

norm: A function PyTree -> Scalar used in the error control. Precisely, step
sizes are chosen so that norm(error / (atol + rtol * y)) is approximately one.

in the forward pass as long as the system is about unity scale the default,
probably the 2-norm, should be good enough.

alright the problem is not the ODE integrator probably but some other error.
when sweeping from initial state 0 to Phi=pi/2, initially all goes well for low
angles, but suddenly, the angular rate goes either way up or way down. this
smells somehow like wrong treatment of input constraints to me. I'll plot the
input time series next.

they look quite decent, and already have significant constrained parts before
it goes wrong, so probably we are alright there. the backward pass in iteration
816 (first one that goes wrong) blows up to like 1e7 which the previous ones
don't do...

lowered solver tolerance on backward pass, and the particular instance of
blowing up is removed.

finding a new problem where NaNs creep up in sol.evaluate somehow. or wait, no,
it is the same old problem where backward pass blows up.

making plotting code for backwardpass in jax-ish, functional implementation.
did not help to definitifely find the issue. sometimes S explodes, causing
crazy unstable forward pass with basically random bang-bang input, messing up
all subsequent runs.


2024-02-16


meeting notes. talked at length about "grand plan".

suggestions for next time (in addition to what I planned to do anyways):

- think about how to compare continuous time DDP to alternatives
  (e.g. discretised DDP with Euler/RK4, or adaptive collocation+NLP)
  and see what it can do better, and what worse

- think about active learning problem/formulation.
  do we do uncertainty sampling in V or V_x? Bhavya has some points in
  favour of V, mostly that the problem is closer to the standard
  formulation, V is continuous, and there is an easy rule which V is the
  best among different local solutions (the lowest). for V_x or even u
  it is a bit hairy in contrast.

- make some sort of collection/visualisation of related works.


2024-02-19

trying to investigate the effects of evaluating d/dt solution(t) instead of
RHS(solution(t)). For the first, very easy (pretty much in linear domain)
example we already see ripples in the forward pass solution that do not appear in the initial guess (made with LQR value function). Possibly the problems I'm seeing are due to this.

turns out, there are ALWAYS ripples at least in the omega state which, while
not hugely messing up the state info, are not very small and certainly render
the derivative useless.

first tried to store the costate in the forwardpass solution using the fn argument
of SaveAt, however that does not make it to the interpolation, so is useless.
https://github.com/patrick-kidger/diffrax/issues/301

so to do it properly we kind of need to evaluate the interpolated previous backwardpass
during the current backwardpass to know the direction of the previous forward solution.

but here is the kicker: to do that, we need to know at what point the taylor
expansion (from the previous backward solution) was done. To know this we have
to know the forward solution even before that!!!

so, will this just fundamentally not work unless we are willing to evaluate a
growing number of ODE solutions summed together? This is kind of the same
issue as  line search: To do it properly we have to have an iterate which is a
weighted sum of all past iterates. This is trivial in the finite dimensional
case. But in our case  the adaptive discretization messes everything up. Is
this really a dead end? Would be very sad.

--> this "kicker" is actually not true. we only need for a particular backward
pass the forward solution immediately before that, and the forward/backward
solutions that produced it.

To make it nicer i reorderd forward and backward pass, making current
iteration again only depend on last iteration. was a whole day of refactoring.
Now it works (apparently) very similarly to before but we eliminated one
possible cause of uncontrolled numerical error buildup. Need to look more
closely at forward/backward passes still to see where it fails.


2024-02-20


maaan I am always second guessing myself. yesterday morning I already feared
that I am pursuing a complete "dead end" with no proper solution, which turned
out to be false, but now I am still fearing a dead end with no real interesting
research, only incremental improvements. Seeing how tricky it is to get the
trajectory optimiser working, just like the last time I did this kind of stuff,
I kind of wish I went with either the PINN style method, or the "brute force"
backward reachability of the PMP/characteristic equations...  what should I do?


in other news, lots of bug hunting today. more in commit message 178f9265. also
am compiling an overview of related publications right now :)


2024-02-21

meeting w/ Bhavya, Lenart, Miguel. go over list of similar works, seems fine
for them.

talk about self doubting. got reassuring reply that the goal is not to make
something great, but to make something and learn a lot in doing so :) also that
the active learning part will work quite certainly.

meeting also with Miguel at his lab to get suggestions about getting TO to
work. according to him: priority 1 = regularisation and/or line search, also
making sure that the derivation behind it is correct, probably smarter to go
with DOC style rather than "guided characteristics" style. also constraints ->
soft constraints? would remove discontinuous RHS in backward pass.


2024-02-23

spent all day yesterday tinkering with pure backward integration again, but
this time including Vxx with the formula we derived a couple times now (the one
from "third try" in idea dump specifically). It just seems too good to let go
of it! but the fundamental problem is the instability which is as fast as the
(stabilised) fastest dynamics. Maybe we can combine backward integration and
trajectory optimisation somehow?

made many notes and a pro/con list in idea dump to think more clearly about
this. essentially, the tradeoff is this: TO is the more "standard" active
learning oracle, that gives us a result at a precisely specified point, but is
expensive and also not very reliable. PMP backward shooting gives us data at
approximately the x0 we asked for, where the distance between the two probably
grows exponentially with the time horizon, according to the fastest close loop
eigenvalue.

I can think of a couple ways to do this:

- two separate iterations: initially some TO until "slow manifold" is
  sufficiently known, after that backward shooting to quickly find fast dynamics

- active learning formulation with two oracles, and some condition to choose between them

- just always do both, knowing both can "fail" in some cases.

- another combined thing. if backward shooting comes *kind of* close to the
  desired state do a forward simulation with the local quadratic value from
  backwardpass. from the new terminal state do backward shooting again. maybe do
  this a couple times?


but, all in all, the conclusion is that we still need at least a somewhat
working trajectory optimiser. two things are severely lacking:

- new, DOC derivation backwardpass (with lambda not delta lambda) is not working properly yet

- no linesearch/regularisation yet, and also questions about how to implement
  them without major refactoring. basically both need access to iteration k-2,
  or we need to modify the scan loop from the outside somehow. shouldn't it rather
  be a while loop? we don't really need the in between iterations as output. but still the
  basic implementation questions would probably be the same.


was just thinking about all the things that are left to do and it does seem
kind of overwhelming... so here is a high-level list of goals.

base goals:
- get DDP working (correctness, line search/regularization)
- try second-order Sobolev NN
- get basic active learning working w/ Sobolev NN ensemble
  (without special regard for local solution conflicts)

fancy goals:
- think properly about active learning setup with regards to sobolev learning.
- do "mixed" active learning - with characteristics and TO oracles and some decision
- implement explicit convex optimisation for u* for general case.
- address "pruning" of local solutions with NN cost function


found other possible bug - when one input is constrained and the other comes
close to the constraint, sometimes it "chatters" randomly between where it
(apparently) should be and between being at the constraint. is our u_star
wrong? damn.

possible remedy: replace the argmin (selecting from candidate solutions in
u_star_2d) by some softmax smoothed thing (just a tiny bit) and see if it
improves. bit of a hack though.

or, investigate more thoroughly and make this not happen at all! this should
not be happening, u* (i think) is piecewise linear....


2024-02-26

investigating that u* numerical jitteryness. implemented alternative smooth
version, which  finds a convex combination of all candidate solutions that
closely approximates the selection of the best candidate solution, but is
smooth, using softmax and logsumexp. does not look very much better on the
plot,  and can output infeasible solutions. would be nicer if it was always
feasible... but a convex combination of several solutions which may be
infeasible can just become infeasible not much to do about that.


when plotting the gradients (jvp in direction of (x, lam) perturbation) they
look even worse for the "smooth" version... this is probably not it.

even when plotting the values of H at the different candidate solutions, we
get some numerical jitter. and of course when the different ones are rather
close they only differ by small amounts -- quadratic in the difference between
them. these two effects combine to make the u* function "jitter" between
selecting different candidate solutions.

I have a feeling this is fundamental to any active set type method that
calculates candidate solutions, and then selects the best one, because close
to active set changes the differences in cost function are always going to be
tiny. If the comparison between different objective values is accurate up to
machine eps, then the potential inaccuracy in the solution will be sqrt(eps)!
the fundamental mistake is this: comparing two large-ish floats which are very
close together.

tried a second step where we re-center the taylor expansion around u* to
hopefully  evaluate the different Hs to greater accuracy (bc. the interesting
ones are right above 0, so better float accuracy.). same type of chattering
still happens, BUT only over a 10x smaller time interval, so definitely
progress!

the individuual solution candidates do have some numerical noise which is
probably why we're seeing the chattering. however, their individual gradients
look absolutely perfect. therefore the only source of the problems must be the
chattering which also chatters between different gradients.

"usual" explicit QP solvers solve this by explicitly storing the parameter
regions over which a linear solution map is valid. Maybe we can improve it by
pre-storing everything we can (i.e. everything except the QP parameter p =
H_u(x, u_taylorexpansion, lambda)).


what also seems surprising to me is that the function u* does not even seem to
be piecewise linear in lambda?? when only perturbing lambda not x??

maybe I do need to make the general active set solver first..?


okay, am back on debugging the main ddp stuff. it now doesn't fail obviously
due to the chattering anymore, and the chattering is not visible anymore even
with very fine interpolation. HOWEVER. it still routinely goes wrong. the
moment it does we seem to have a very negative eigenvalue of S(t), and it
starts being negative close to a moment where the input constraint active set
switches. still a bit sus...


2024-02-27

wrote some stuff about mpqp. in case i need it i now know how to prune the
active sets that are never optimal (by solving an LP feasibility problem for
each one).


back at debugging positive definiteness. scoured through all ct-ddp papers i
could find, and it doesn't look like any of them claim that the hessian stays
positive definite.

can we use sqrtm(S) instead as an ODE state? but probably ugly things would
happen if the ODE wants to make it non-pd regardless...

apparently the ricatti eq. itself does result in pd solutions if inputs are
pd.

managed to rewrite both the ricatti eq. and our Sdot in a way that makes it
obvious they are one and the same (write the quadratic in terms of S as  [1;
S]' [some matrix] [1; S]). so in theory I think the solutions should stay
positive definite as long as the input cost hessian l_uu and thus H_uu are.


also, tried regularisation. no scheduling of any sort, just constant penalty
of distance to previous trajectory, weighted with Q matrix. seems to do what
it should. choosing the scaling of the penalty term dynamically will be the
far greater challenge.


trying to understand what precisely doesn't work about the new DOC RHS
function. changing the old one step by step to hopefully either arrive at the
(correct) new version, or to find the change that breaks it.

got pretty far actually, without changing the look of the solutions at all.
everything is now in terms of linearisations at the forward trajectory. but
still it might precisely correspond to the neater DOC derivation.

still getting similar instability issues, it almost looks like again it is
caused by chattering at the constraint switching time... there are some papers
about optimisation constrained ODE solving in the cs/numerics folder which
might after all be not a bad idea. basically they reframe the ODE with
optimisation problem in its RHS as a DAE with the corresponding KKT system.
But the KKT system has to be changed on active set changes and detecting this
event is apparently not so trivial. still, maybe we can coax diffrax into
doing this???

after looking at all forward-backward plots it might also be that accepting
the full step messes us up. a couple iterations before disaster there are
already some oscillations which look menacing...


2024-02-28


put the explicit convex optimisation notes in latex. found a paper presenting
the same concept, checking if some active set is ever optimal by solving an
LP. they also have a nice method to do extensive enumeration with a
tree-pruning strategy allowing them to skip over many active sets, knowing in
advance they are not optimal. if we scale to larger examples this might be
pretty smart. but for the moment let's ignore this.


found a mistake in DOC derivation! fixed it, in dump section 4.6.2,  "Backward
pass with actual λ instead of δλ". tried running it, got interesting results.
no wild instability anymore, instead the trajectories all seem to be very
chilled out, without the aggressive down-braking maneuver that showed up
before. very weird.


thinking about a possible change in perspective. up until now the general goal
was this: find the optimal control law, using TO and/or characteristics, as
efficiently as possible.

maybe efficiency first is kind of dumb? maybe it would be nicer to start with
the "brute force" backward PMP reachability type methods first? which also
gives us a grasp on global optimality if we always analyse value level sets.
this would be essentially an adaptation of standard dynamic programming /
bellman recursion to continuous time, which certainly has its kinks but the
general structure and also probably complexity should be about the same as
discrete time DP.

and then from there, see how much we can optimise and "cheat" our way around
the limitations for physical systems which, far from the origin, have a clear
timescale separation going on, but while still retaining the benefits of the
full optimal control law close to the origin.

is this dumb? tbh if we do this and then end up speeding up the exploration
along slow  manifolds using TO we will probably arrive at exactly the same
thing we're building right now.


played around with backward integration and time-value reparameterisation
again. got not much further than a couple of plots though...


2024-02-29


cooked up some thoughts yesterday evening. if we treat trajectory optimisation
not as the omnipotent tool (which it is probably not) but as a heuristic to speed up the brute force backward PMP reachability computations, then we are probably better off. let me explain.

as discussed the two methods (TO and backwards PMP reachability) have
complementary benefits. Long turnpike trajectories in multiscale systems
completely mess up the backwards PMP approach due to frequent resampling, and
a data generation which is more dense than what would be needed for NN
fitting, which is needed to even find the slow, turnpike-like trajectories.

TO can do that very nicely however! I am proposing a two-step scheme like this.

1. collect, with TO, a database of many trajectories, from x0 where the fast
states are either zero or "already stabilised". point being, we are interested
in long trajectories mainly.

2. using backward PMP, "largest underestimator" type NN, and growing the
dataset as increasing value level sets, approximate the full backward
reachability computation. the advantage is that now that we have a good
approximation of the value function on and close to the "slow manifold" we can
relatively easily find optimal trajectories to other, close states with pure
PMP backward shooting, without (huge) sensitivity problems (i.e. we propose
some x0, simulate forward with approximate optimal controller, find optimal
trajectory in backwards time, and probably we land close-ish to the state we
started). Viewed from another lens, we can now circumvent the problem of
having to generate "too many" trajectories for the purpose of finding the
turnpike, because we already have it. instead we can do a relatively sparse
sampling of those trajectories, until the BNN has enough data to be confident
but not (much) more. See sketch.


flashy title: using optimal trajectories to speed up backwards PMP
reachability level set optimal control computation in multiscale dynamical
systems. hehe


found the mistakes!!! what I called H_x in the code was not actually H_x. now
the two backwardpass rhs's should be the same. though I almost feel like it is
more brittle and fails sooner now... at least we now have a theoretical basis
for doing line search or regularisation.

am making a new plotting function to see if the lambdas and S matrices
actually look like gradients and hessians of v, at least in the direction of
the individual trajectory. looks quite alright. tbh if this wasn't correct
we'd probably have bigger problems that showed up sooner


2024-03-01

did some experiments where I tried to find the root cause of the negative S
eigenvalues. did the following
- define the backwardpass also in terms of the cholesky factor of S
- plot det(S) after running

unfortunately I never got to the failure case because now it fails much
sooner, when S grows very large on a long-ish time interval where both inputs
are at the upper constraint.

though, i could always test the same thing in the current
backward_with_hessian  example in flatquad_landing_experiment.py -- there we
do see negative hessians regularly. next week \_o_/

am really debating whether it is worth it to continue working on the TO stuff.
on one hand, it would be very cool to get it to run, and as talked about just
above, would undoubtedly be able to accelerate the "pure" PMP backward
reachability approach. On the other hand, I've now invested 2 months into
this! well, looked back into log and commits, and I've only really started
this one month ago. january was spent entirely thinking about backwards PMP
type things. so maybe the sunk cost fallacy is not huge here after all :)

but more practically, it would be cool to start working on the learning part.
should I just go ahead and do this with backward PMP regardless? there are
still many nontrivial questions to be answered there, even if we completely
forget DDP stuff:

- which new points to propose?
  2 main strategies. a) sample statespace points and select good ones or b)
  start in the "interesting set" (BNN uncertain within value sublevel set) and
  explore it with some MCMC type thing

- how to do bayesian NN?
  ensemble, stein, others? probably ensemble should be enough for a start

- 2nd order Sobolev
  this I think is actually what I think could be the nicest of all things. Calculating
  Vxx amounts to having feedback gains, enabling us to treat the nonliear system with
  linear systems tools, instead of just having control inputs at some points. given that
  most usual systems *are* pretty close to their linearisations over large-ish regions
  we should expect to make a decent improvement on brute force backward reachability
  by doing this. maybe, who knows \o/

- local/global?
  by now I think I have a decent grasp on this. If we learn V(x) as successive value sublevel sets this should be pretty natural.
  the question is then how quickly can we "fill" this sublevel set with data, especially in the direction
  of slow manifold/turnpike trajectories. but I think it should be manageable, we just have to live with
  dataset size O(timescale ratio) multiplied with whatever complexity we assume in
  terms of state space dimensionality.

- better strategy for u*?
  the current one works after the hack to minimise numerical chattering. certainly
  cleaner would be active set enumeration + selection of the solution with smallest KKT residual.
  from there there are millions of mpqp tricks, pruning of never-optimal active sets,
  decision tree structure, actual active set method running online but with precomputed
  KKT systems of some sort.
  even fancier, the suggestion of patrick kidger, where we extend diffrax with
  more flexible event handling, detect an active set change, and somehow give the
  active set to the RHS in args. as a boolean vector? as a finished linear map from
  parameter to u*? Lots to be done there still. but probably best to put this on the
  back burner as the solutions here are kind of the most obvious.


2024-03-04

am investigating non-positive-definite vxx with pure backward integration. added
the state logdet(S) to the ODE, with RHS very trivially evaluated with jax:

\dot logdet(S) = d/dt logdet(S + s \dot S)

as expected we get many small steps when det(S) is close to 0, i.e. logdet
approaches -infty. which it still does! so unfortunately it seems like the
determinant turning negative is not just a fluke but actually what happens for
the solution of these ODEs.

okay, new info. if we compare the ricatti equation to standard form, we see
that the Q matrix (constant term in quadratic rhs) is very non-pd when both
upper input constraints are active. y dis?

maybe this is just a thing that happens (for general systems and "unsuitably")
defined costs... Maybe we just have to live with the fact that the value
hessian is sometimes not positive definite? I have double checked that we have
the correct derivatives forming the Q and R matrix for the ricatti eq and
found no mistakes...

does this correspond intuitively to the fact that the feedback can be locally
unstable in some directions at some points?

alright, thought about it some more. am now pretty sure that this is a thing
that happens normally. drew a picture to convince myself -- a great variety of
2D, single-integrator navigation-type examples will do, as long as there is
some sort of nonconvexity in the cost funciton. so, forget this, negative
hessian eigenvalues are okay too!


found a couple additional papers applying pretty much my ideas (backward
characteristics following, level set, remeshing) to 2D and 3D single
integrator navigation examples. pretty cool for inspiration. but higher state space
dimension will certainly bring its challenges.

atm I am doubting if we will get sobolev to learn an accurate hessian. It (the
function x -> Vxx(x)) will definitely be nondifferentiable at some points,
though continuous. this will be a bit of a challenge for the NN (maybe). if we
then do the resampling step of forward simulating with approximate V, then
backward simulating with the PMP, we carry with us this wrong hessian for all
time. ways to remedy this could include:

 - accepting the inaccuracy and only using the hessian as a "small hint" for how
   the value function should roughly look like

 - writing some analysis to show that the error becomes less significant with
   longer time horizon (akin to continuous time, finite horizon LQR for LTI
   system -- quite soon the influence of P(T) vanishes)

 - making sure that the sobolev NN *does* actually approximate the hessian precisely!
   maybe this works better if we have separate NNs, one for V(x) and one for lambda,
   with just one additional order of sobolevisation? but not shure why this is just
   a wild guess based on intuition and would make other things harder too.


2024-03-05


set the goal of making a very basic mockup of the backward shooting, NN
ensemble fitting, forward simulation loop. recently I've been thinking about
this and gained two new pieces of intuition:

first, learning the value function in level sets should make global optimality
much easier, because whenever two solutions "meet", they are of equal value,
and thus can both just be stopped. (intuitively). in practice it will probably
also have something to do with the extrapolation outside of the current level
set kind of becoming an interpolation before the level set starts a self
intersection. t

second, the *only* difficulty with instability is finding long trajectories
along the "slow manifold". this is not really new, but this also means we
probably don't have to do much resampling in the fast dynamics part. therefore
everything *but* the slow manifold trajectories are kind of almost free. (free
lunch hehe). we can integrate them further than we need too: e.g. stop only
once v(x) >= 10 * v_k, where v_k is the current value level set. for learning
we can just ignore the data above v_k. otherwise we'd have to do many many
short simulations at each value level increase, and then some long ones too
for the slow manifold. (just a small implementation detail tbh)


got scared of the huge task ahead of me and procrastinated.

instead got confused whether what we call vxx in the backward ODE is actually
the hessian or the half hessian or twice the hessian. am pretty sure now that
ode_state['vxx'] is the half hessian. we initialise it with P_lqr which is
definitely the half hessian of the LQR value function x.T P_lqr x. This causes
the vxx eigenvalues to stay about constant in the linear region, indicating
that the two are the same.

HOWEVER, I did the hessian sanity check plot again (evaluating the taylor
expansion along the trajectory) with the new dict solution format from pure
backward integration. am not sure if it looks better with 'vxx' or 2 * 'vxx'.
and all the other formulas (d/dt vxx mainly) seem to be consistent with
whatever is implemented now.  definitely should look at this again.

also put the "purely backward" stuff in its own class again.

spent all day making this kind of plots and sanity checks. am not much wiser,
also no prototype of backward shooting with NN fitting loop yet. todo next
time \o/


2024-03-07


ran some experiments of generating many, many trajectories, then plotting the
ones with lowest v(0) (start of physical time = end of integration time).
looks cool, there are trajectories which come from a bit further away, however
it is obvious that purely random sampling will not lead much further. that
much I knew beforehand.

continued a tiny bit on the writeup.

took the nn implementation out of the dusty drawer and made it run again.
didn't find very nice hyperparams just jet, but looking promising. maybe it has
trouble with data being on a "small" scale? maybe we need to build in some data
normalisation...

Quickly before heading home, figured out the adjustments for gradient under
linear coordinate transformation. is pretty simple. implement tomorrow.

also tomorrow: finish/clean up that writeup...


2024-03-08

made a nice data normalisation class, accounting for how the gradient gets
squashed under linear coordinate change. looks much better with normalised
data!

spent some time tuning nn hyperparams. low final learning rate substantially
improves stuff! plotting the NN at states visited by a trajectory gives the
expected result: decent fit in region where data was used (this case, value
sublevels set 100), flailing around otherwise.


went on a small debugging spree to do with LQR terminal value or half of it.
it seems that the optimal controller (backward PMP) gives costates similar to
the value function 1/2 x.T P_lqr x. However the terminal value function was
assumed  to be x.T @ P_lqr @ x. I really suspect we are doing something wrong
here. for now  I just replaced everything by the value function that "looks"
more correct, but  sometime i should def. get to the bottom of this.

plugged the nn costate into a closed loop controller. took me all afternoon to
realise that here we need to undo the normalisation...

still it is not trivial to find hyperparams that lead to a good fit. some
gradients are still much smaller than unity. maybe we can tell the nn loss
function about that and have a penalisation relative to the typical scale on
which the individual gradient entries fall?

by now I think the focus/goal should be specifically to improve the backward
PMP stuff with vxx propagation and sobolev learning. The regular PMP backward
reachability/level set method is interesting but its limits are kind of
obvious. more interesting is how we can take inspiration from it and make
something better. atm I do think that vxx/sobolev shows great potential, by
basically fully using linear systems tools to help in learning the nonlinear
solution. so if we happen to be lucky and the linearised optimal controller
(around some trajectory) is accurate over a large-ish region, at least in some
directions, we profit by making faster progress (albeit with more expensive
simulations.)


another small experiment: if we train only with data close to equilibrium (v <
0.1), then the hessian of the nn value function is almost precisely already the
LQR hessian (relative norm difference 0.032)!! WITHOUT ever telling the NN
about the hessian! (although v_x is probably practically linear so recovering
its gradient is probably not a huge miracle)

still, this is quite promising and certainly makes it look like we can use
second order sobolev learning to get an even better fit without huge amounts of
data, and more importantly, better extrapolation, which will speed up our
"filling" of the state space when the value function is similar to its
extrapolation (i.e. basically the quadratic local value function)


next tasks:
- extend this to NN ensemble (was this as easy as a well placed vmap?)
- think about practicalities: how to propose high-uncertainty points?
  how to handle level set collisions? how to decide on v_k+1? lots of this stuff
- implement second-order sobolev functionality, take the chance to use
  a nicer data format in the nn class, maybe also vmapped dict like ode solver?
- also take that chance to sanity check the vxx calculation, esp. with regards
  to the now fixed mistake in choosing the LQR value function...


2024-03-12

second order sobolev is implemented and works (not sure why no log entry
before...)

performed a parameter sweep for the sobolev weights from (1, 100, 1e-2) to (1,
100, 1e+4). In this particular case (data from initial uniform backward
shooting, max value level 1.0 -- exact code in commit 5e43e169), we have a
sweet spot about at 100 or maybe between 100 and 1000, where the sobolev loss
noticeably improves vxx accuracy but doesn't mess up the v and vx yet. I'd say
100 in this case is the point where it completely doesn't worsen vx and vxx.
interestingly, and as I'd hoped for, there is a long slow decrease in vx loss
as we increase the vxx weight from about 1 to 100. only slight though,
probably a factor of 2 or 3 on avg.

plot is in losses.png file, root dir but not in git.

this is essentially the training loss and the test loss at 0, which is not a
very strong testament to the validity of this modelling approach. we should do
proper test loss evaluation and hyperparam selection via cross validation. but
this serves as an initial sanity check that it kind of does what we expect and
gives us parameters that are better than just smelling some numbers and hoping
for the best.

update, made the actual test loss plot. train/test splitting function in
nn_utils. test loss looks much the same as train loss except maybe a tiny bit
higher.

tried increasing value level to inf (= including all points) to fit more
interesting function. seeing problems where ONE SINGLE trajectory has vxx
values that blow up to like 1e30 and then become NaN. investigating this atm.

am with ipdb in the first rhs evaluation that has ||vxx_dot|| > 1e5. not sure
how to debug, all the numbers look normal-ish...? should I make a second
f_extended that outputs all intermediate values, to plot them?

yes, did that, using the previously unused args keyword required by diffrax.
every variable looks like a reasonable function of time, just d/dt vxx as given
by the function starts containing crazy entries. whyyy?


seems the vxx components are unstable for already a few timesteps before.


at this point i have the feeling that somehow this is just a thing that
happens, that the hessian can become wildly large... it happens before a quite
long segment where both upper constraints are active too. so high sensitivity
to state space position and thus high feedback gains might be kind of
expected here...

next attempt: switch to 64bit float precision. it seems we do not get nans
anymore! however the data are completely different, as probably the PRNG is
essentially progressing twice as fast.

next: find bad solution in 32 bit, serialise whole ys dict with flax msgpack,
switch to 64 bit, read it again. try to reproduce the solution with 64 bit and
see what happens.

IT WORKS! more precisely: the spot where previously the entries and
eigenvalues of vxx shot towards infinity is now quite well behaved. instead of
going up wildly the vxx entries actually decrease rapidly after the
constraint  stops being active (reasoning in backward time here). during the
constraint it goes up some time which also makes sense. decreasing dtmin by a
factor of 10 (now) makes the transition even nicer.

need to see afterwards if in 32 bit we can also just decrease dtmin and have
everything work... UPDATE it does. that was a bit of a detour \o/

however, the problem is not entirely gone. if we continue the trajectory for
some time, there is a long segment where both constraints are active, one
upper one lower. the trajectory corresponds to slowing down from a state with
incredibly high omega (and phi, but bc the angle wraps it is the same as (-pi,
pi). TODO transform to (sin, cos) for learning to reflect that topology.). at
some point the constraint stops being active and then the problem appears
again.

that state however is so outlandish we might as well forget it. practical
solutions possibly are:
 - stop integration after we exit some large superset of the relevant states
 - stop integration after some number of steps (this happens anyway...)
 - stop integration once vxx, its norm or eigenvalues exceed some limit?
 - in addition, throw away any points where any part of the state is nan
   or even just above some sanity limit? or also remove the preceding couple
   moments from the trajectory?

possible too: go to a handful of steps before the issues, perturp extended
state slightly, re-run.


after smoothing out those issues, tried to fit the sobolev NN on the whole
dataset. results are much less exciting than for 1-sublevel set of v. loss
goes down at the start but barely low enough to be of any use. training loss
that is. possible reasons:
 - large/small numbers despite normalisation messing it up
 - low-loss solution "too far" to reach in amount of iters
 - conflicting local sols (although intuitively i dont think so)
 - wrong vxx values making impossible for the loss to go near 0?
 - vxx which is not wrong per se but a "too complicated" function (it is
   made by integrating an only piecewise continuous rhs after all....)

made epochs 10x longer and final lr 10x smaller, also set value level to only
50. after that it looks not too bad again (but takes 2min for 10000 pts). vxx
testloss between 1e-4 and 1e-4, vx between 1e-4 and 1e-5, v between 1e-5 and
1e-6. i think this is alright, although it does suggest we might increase the
vxx weight. (is it a good tuning rule to make all losses "equally low"?)


next steps, mostly carried from yesterday:
- find out why the nn is shit for nontrivial value function. find some tuning that
  sorta works.
- NN ensemble (or something similar w/ multiple initialisation, SGLD, whatever)
- think about practicalities: how to propose high-uncertainty points?
  how to handle level set collisions? how to decide on v_k+1? lots of this stuff
- make first (hopefully, barely) working draft of main iteration! even if shit.


2024-03-13

still trying to find good nn tuning. currently doing grid search over
relative weights for vx and vxx loss.

probably the very basic reason that the value nn works shit is that we are
asking too much. if we set the value level relatively high, but only train
with data from initial backward shooting from uniform terminal states,
there will be large empty regions, where we cannot hope for any sort of
good fit. if v_k is smaller indeed it works much better: vxx test loss goes
down almost two orders of magnitude by switching on vxx training, and vx
loss goes down as well! just like i was hoping for. though, this is still
probably pretty close to a quadratic value function and not an indicator of
real world performance.

some new thoughts. if we want to have value sublevel sets we should really
make sure that the value function does not go down outside of the usable
range. maybe this happens automatically? if not, we could try adding some
sort of "prior" loss that pushes the value function up slightly (in a much
bigger region than where we have data so the influence there should be
negligible). maybe also penalise norm of the hessian similarly, to
encourage smoothness?

also, how do we evaluate whether we "know" the value function in a certain
sublevel set? or even better, directly what is the largest value sublevel
set in which the information is accurate?

two ways: posterior uncertainty of some ensemble and cross validation. if
doing the first we need to calibrate (via cross validation probably). is
this just a detour?

just for fun calculated "overparameterisation" ratio, n_params / n_data.
for all practical intents & purposes each entry of vxx is just another
label which we have to fit, although with particular "meaning" and with a
modified (differentiated) NN but with the same params. Therefore we
calculate just naively the total number of floats in the params and data
pytree.

I expected to be in the underparameterised regime but not that low: it is
0.005! half a percent! (for that particular dataset and (32, 32, 32) nn).
thus it should also be expected that we cannot reach zero, or very low,
training let alone test error.

trying the same thing but with (1024, 1024, 1024) shaped NN. This gives us
overparameterisation at a ratio of 4.312. took 5 min to train! aaaand it
completely failed. looks like lr was initially way too large, then just
right for a tiny moment, then too low. not going to fix that now, fat NNs
like this are impractical anyway (unless we put it on euler or google
TPU.... should do that sometime anyway)

other than that, current settings are pretty decent. not what i kind of
hoped for (much lower vx test loss when including vxx), but still more
than 10x improvement in vxx test loss compared to not including it at all,
and like 2-3x better vx loss. takes 1-2 s to train on laptop.


meeting. main input: describe the approach at a very high level in clear
and known terms. level set/dynamic programming -> level set "expansion" as
active learning problem -> oracle = backward PMP or TO. without any
implementation issues & details, these come later.

functional priors: this idea exists and bhavya did some work on it! i

BNN: find last layer weight distribution using BLR with previous layers
fixed. simple but apparently this works!


2024-03-14

big refactoring spree. cleaned up current testbed, put backward solver in
place of the old one in pontryagin_utils, etc.


tried nn ensemble with jax.vmap, only altering training key (responsible for
batch selection and stochastic sobolev approximation). looks alright actually!
ofcourse takes nx more time. other, more sgld inspired approaches might be
nicer, then we only repeat the "tail" of the training.

for N=8, we have nice uncertainty bands that roughly correspond to the actual
trajectories in the training set. as reported by many others we have
overconfidence outside of the training region. if this is problematic we can
switch to repulsive ensembles..? but probably it will work just fine without
repulsiveness.

still this only works for low value level sets and breaks down completely for
v_k > 100 or so. still don't know why. list of possible reasons in log entry
2024-03-12 towards the end.


2024-03-15

taking a closer look at the output of data normalisation to see if maybe vx
and vxx (which we can't just freely normalise) have very unsuitable magnitudes.

but it looks pretty reasonable unfortunately. v has obviously zero mean and
unit variance, and so does each x component. the vx entries have means mostly
below .1 in magnitude, and variances below 1. vxx has most means very close to
0, and some ranging up to +/- 5. stds mostly between 1e-3 and 1, about uniform
looking on a log scale, with one ranging up to almost 10.

as it stands now none of these numbers should be a reason for concern, in fact
they look nicer than i expected. though i can definitely imagine situations
where it might turn out worse.

the intuition i am currently subscribing most to is this: during training for
the more "difficult" datasets we see generally the vxx loss being the highest
during training, especially in the long tail section. combined with the fact
that the vxx weight is mostly on the same order of magnitude as the other
ones, i believe that the total gradient signal is dominated by the vxx loss,
slowing down progress on vx. current experiment: try lower vxx weights. maybe
we can split the training into phases? first train with vxx weight 0 until we
have a decent fit, and then enable it again. or do some trick like clipping
the influence of the vxx gradient? but this is probably a bit of a hassle and
i would really rather specify a loss function and use a standard optimiser
than tweak the optimiser when really we are kind of encoding a different loss
function by doing so.

i also have the feeling that the vxx loss, and to a degree the vx loss, may be
more "ill conditioned" as a function of nn parameters. not sure why though.

anyway, with a lower vxx weight it is actually not that bad! we get 1e-2 ish
vxx test loss on the full dataset (4 second uniform backward shooting, 200
trajectories.). the trajectory vs nn plot looks alright, not very thrilling,
certainly approximation could be made better still, but the overall shape of
the vx functions is well captured.

closed loop testing of the resulting controller is also pretty nice! for
initial states sampled from N(0, 5I) we see probably about half of them being
stabilised. although, I feel like once the fast system (phi, omega) is
stabilised they kind of move inwards "too slow"? should compare w/ lqr again
to be sure, maybe not.

but nevertheless, I think it's impresseive that even though the trajectories
generated are from a *very* specific weird distribution (fast system
immediately excited in backward time, so no progress to speak of on slow
system), the stabilisation works for a relatively wide range of initial
states! this is promising, as it is exactly this type of extrapolation we are
planning to rely heavily on.


did an experiment to find out how many epochs we actually need. learning rate
from .01 to .0001 exponential decay. epochs 2 ** np.arange(14), largest 8192.
while longer training does usually help, it looks like the last two at least
are a case of overfitting, where test loss is higher again. redoing it up to
2**12 epochs to look at that more clearly.


still the fits are pretty shitty once we go above values of about 1000, or
where the input constraint switching starts to become more regular...


made a small utility function to plot the cdfs of the different normalised
variables. maybe we find the reason of our shitty fits? although, the
distributions seem very reasonable, even for the thin high value band cases
that fail completely with current settings...


aaaaaaaaalso, been thinking about state space topology. if we want to make
global solutions make sense, then the state space should be uniquely
paramerised (does this have a precise meaning?), and right now we have the
angle which is mod 2pi, so several state vectors for the same physical states.


a transformation Phi -> (sin Phi, cos Phi) solves this easily. but introduces
new questions.

if we transform the whole system, before simulation, there are some more concerns.

  the state could leave the manifold ||sin Phi, cos Phi|| = 1. Baumgarte to
  the rescue!

  more importantly, does PMP change significantly when we go to manifold state
  spaces? skimmed some papers on this and it seems very cumbersome. i am ready
  to learn the basics of differential geometry but not about to do a math
  major! but this is probably nevertheless the "most correct" solution...
  maybe i should ask some of the control guys at ifa if they happen to know it
  even a bit?


if we transform this after simulation, just for nn training,
then how do we transform the costate, and vxx?

  probably this is some not too crazy derivative trickery. just that the
  derivative is not defined in all diretions anymore. i mean it is still the same
  basic change -- we now have a function defined on a manifold, and so are its gradients
  so this still needs to be treated in the correct way if we do it like that.


2024-03-18

took another look at the deteriorating training with higher value level. found
suspicious vxx entries (like 1e17, surely this is an instance of the random
instability rather than the actual solution).

took vxx loss out completely. now if we specify only 2 sobolev weights all vxx
stuff is skipped entirely. at first glance it looks like the fit is indeed
(practically and qualitatively) just as good as the one with vxx -- still we
have a much larger RoA in closed loop than the training dataset which is very
nice. all while saving lots of computation. was calculating vxx misguided from
the start?? but this still for v_k = 100.

increaing it to 3000 where previously it failed. got nice training loss curves
but unbearably high test loss. whyyy? visually the fit looks alright though,
with vx curves quite close to training trajectory vx but oscillating around it
(certainly with very bad vxx approx)

not able to get better results with training on v and vx only after doing some
parameter tuning. also not able to fix vxx issues. whaaat do


plotting the offending trajectories it seems that they are all ones where
angular rate is first in one direction, then changes, before finally
stabilising at the equilibrium. does this somehow connect intuitively to the
toy example above?


-> did some more thinking about vxx growing unbearably large. made a long
writeup here, but ended up converting it to latex. is now in idea dump.
conclusion: vxx terms can definitely go unbounded, and specifically when an
eigenvalue goes to -infinity it corresponds to a starting self-intersection of
level sets, or equivalently a focal point of characteristic curves.


... did some more tuning, switching on and off the vxx loss,
over/underparameterisation etc. got pretty much nowhere, same things work as before.

interestingly, when trying to fit a thin value band (example from 250 to 300),
the train/test losses look really quite bad but the trajectory vs nn plot
seems to indicate that the fit is pretty nice within the value band, and even
extrapolates a bit.


2024-03-19

experimenting with NN ensemble. specifically, the sort where we take multiple
parameter sets encountered during training, and basically repeat the "tail" of
the training loop N times. Like SGLD but wihthout proper noise injection, only
relying on SGD noise.

This seems to be much worse than just vmapping the entire training loop
though. We get very overconfident predictions outside the region of training
data, barely larger uncertainty than within. OTOH, with vmap they are still
overconfident but seem kind of reasonable and *do* grow once we leave the
training data region. also, vmapping is just easier, and actually not even
that much slower than training a single one! maybe a single one doesn't max
out the vectorisation capabilities (of my shitty local cpu...)

I will go with the vmap approach for now. it is simpler, slightly brute force
for sure, but I am fine with that (especially if we switch to Euler/TPU or
whatever.). Also it is much more obvious how we might include "functional
priors" there: just generate a set of random points and labels at the start of
training -- vmapping with individual PRNG key will also randomise this prior,
thus we should be able to get even better model diversity in this way.


ALSO recently several things were on my mind where we are not using the
"proper" implementation and I always feel the itch to change it to a better one.

- u* optimisation: this would surely be better addressed by detecting active
set switching events, stepping exactly to those, then changing the RHS. kind
of like that paper about optimisation-constrained ODEs as DAEs. ATM the
adaptive solver + explicit mpqp(-ish thing) works, but generates much more
data than necessary. maybe just subsample the data before giving it to NN?
although this should be qualitatively the same as using the whole data set but
fewer epochs... also same as that old question about resampling trajectories
with the interpolation to be evenly sampled...

- fundamentals about the state space being a manifold, see bottom of
2024-03-15

tell this lenart&bhavya to get predictable but reassuring answer to not worry
too much about those?


2024-03-20

meeting traktandenliste:

mainly:
 - go over active learning problem formulations.

"daily business" recap:
 - realisations that Vxx can have infinitely neg eigenvalues
   -> need practical fix to avoid messing up nn training
 - NN ensemble experiments
 - stuff is quite slow. beg for like the third time for a little euler tutorial?

other questions:
 - administrative: when is handin date? when are those 2 weeks of vacation?
 - any guidelines on the shape of the final report? then i can start writing down basics
 - how to stop myself from endless & mindless parameter tuning?!?


meeting notes (after, cleaned up. some of this very roughly in idea dump):

about main active learning formulation.

looks good, hinges completely on acquisition function. should think about that
next. is an instance of (or at least very similar to) the concept of safe
active learning, as briefly introduced in PAI. I think the difference is this:

  in safe active learning we are desperately trying to stay within some set X
  = {x: g(x) <= 0}, while we are learning about the function g (and probably
  also some other function which we are mainly interested in). We are then
  trying to avoid sampling x where g *might* reasonably be <= 0, or
  equivalently we constrain the sampling process to the region {x: p(g(x) <=
  0) >= p_min}, with p_min some high threshold. This region is with high
  probability an inner approximation of the true region X.

  whereas in our case we are learning the function V, over some domain V_k =
  {x: V(x) <= v_k}. However the fact that we "think" in value level set is not
  to avoid unsafe region, but to simplify the pruning of suboptimal solution.
  Therefore it doesn't hurt to be "optimistic" with respect to the region over
  which we propose samples: Instead of only sampling points which very likely
  are in V_k, we can sample points which *maybe* might be in V_k. We prefer to
  use an outer approximation rather than an inner one. Concretely this means
  going from the "pessimistic" or "cautious" case:

    propose x with p(V(x) <= v_k) >= p_max  (p_max = 0.9 or something)

  to an "optimistic" or "aggressive" one:

    propose x with p(V(x) <= v_k) >= p_max  (p_max = 0.1 or something)

If we have a gaussian posterior distribution mu_v(x) and sigma_v(x), the
former version corresponds to using an upper confidence band, and the latter
one  to a lower confidence band. if we specify a "confidence level" of Beta
standard deviations (as in: propose x with mu_v(x) + Beta sigma_v(x) <= v_k)
we have the simple correspondence p_max = Phi(Beta) (with Phi the cdf of N(0,
1). is this true? smelling like a sign error or other sneaky confusion)


about *batched* active learning.

in standard formulations, active learning goes like this:
- propose point (= argmin acquisition function)
- evaluate oracle, add result to dataset
- fit model

obviously it is dumb to retrain the whole NN because one point is changed, so
batched active learning should be practically preferred. but then there are
some caveats: if the acquisition function has one clear mode (and we are
*actually* optimising it) then all proposals will be there, rendering batching
useless. even if using thomposon sampling or other proposal strategies naively
there is the risk of a "low diversity" set of proposals which we want to
avoid. Byavya showed one particular way to do this, effectively what they used
in https://arxiv.org/abs/2310.19848, originally proposed in
https://arxiv.org/abs/2203.09410

if i understand correctly, they do the following: after training their NN (a
single, non-bayesian one) they perform bayesian linear regression with the
features given by the last layer activations. In that framework they are able
to use the recursive update formula, similar to kalman filtering, to update
posterior uncertainty estimates without retraining the whole NN (and before
knowing about the labels -- in BLR uncertainty is independent of label
values). This updated posterior can then be used as a cheap but accurate
surrogate of the full retraining procedure, making the whole active learning
procedure a double loop:

D = {}
while approximation accuracy not satisfactory:
    # propose batch of samples
    batch = {}
    for j in range(batchsize):
        x = minimise_acquisition_fct(nn_params, last_layer_distribution)
        batch = batch \cup {x}
        last_layer_distribution = kalman_update(last_layer_distribution, x)
    ys = vmap(oracle)(batch)
    D = D \cup {(batch, ys)}
    nn_params = train(D)

looks nice. holzmüller 2023 talks AT LENGTH about different ways in which the
acquisition function can depend on the last layer distribution, or IIRC more
generally on how it can depend on some kernel, and how kernels may be derived
from NNs, NN ensembles, etc. that paper is a goldmine.

practically, all this is about is just ensuring that the points in the batch
are  far apart enough. a VERY crude baseline (that should still be better than
random suggestion of points) could be this:
- sample many points uniformly from the value band V_{k+1} \ V_k
- weight each one according to the BNN uncertainty
- sample $batchsize points without replacement from those many points
  according to the distribution given by weights.

I feel like due to uniform sampling we should already obtain decent diversity.
if not, the next logical step is probably to select the points in a loop,
marking  all those that are d-close (or the k closest...) as unavailable, to
avoid taking close points. From there on, using those nicer kernel-distance
methods basically amounts to replacing the euclidean or (state-space cost
matrix weighted) distance by that kernel.


about euler.

asked for and got a tiny euler tutorial again. managed to login, git pull,
setup virtualenv, and get the python script running on the login node (after
changing it so it doesn't do any actual work. I am a good euler user i swear).
next up: the stuff with slurm and the fancy launch script they gave me.
specifically i am unsure whether installing the whole venv on the login node
was the right approach..? does that transfer to the compute node when we
schedule jobs? or are we constrained to whatever is installed there... guess
we just test and see

bhavya recommends to prototype mainly with simpler 2d or 3d examples on own
machine. then we can get euler to do the heavy lifting for 6D and larger
systems.


about state space being a manifold.

bhavya suggested, as I already planned, keeping the optimal control theory and
simulation the same, working in R^n (if we aren't pedantic about angles modulo
2pi). For training we should then transform to a more natural representation,
which respects the modulo 2 pi topology. Obviously this means transforming
angles theta to [cos theta, sin theta] and angular speed omega to something
something [-sin theta, cos theta] * omega (orthogonal to "angle").

This involves transforming vx and vxx correctly, which is not entirely
obvious. been working it out in idea dump for some time. The basic idea is
this: we impose a loss involving only derivatives in the direction of the
tangent space, which can be calculated from the vx labels in original state
space. maybe we also regularise the derivative in direction of the normal
space. I am pretty sure this will work nicely, but will have to convince
myself a bit more that the math is correct.


</meeting notes>

implementation thought: would it be nice to essentially always rely on hitting
solver maxsteps?  that way we can always fill up the array and not waste too
much space. (well not actually since maxsteps is apparently for accepted and
rejected steps). but still, maybe we should aim at shorter trajectories -- we
can always continue integrating them without incurring NN approximation error
whenever v_k > v(t1).

maybe vxx is not worth it? certainly i'd like to do some (basic) experiments
comparing with and without. conclusive statements are impossible without
extensive hyperparam opti for both but maybe we can make some smart-ish
qualitative statement...

next concrete steps:
- work out, implement, sanity check state space -> manifold mapping with vx and maybe vxx
- hack together a first implementation of the active learning loop!

goodnight


2024-03-22

experimenting w euler. it seems like the "module load" command needs to be ran
at every login.

generated sbatch command using this:
https://scicomp.ethz.ch/public/lsla/index2.html also the virtualenv seemed to
deactivate after starting that? is this how it should be? giving

  /cluster/apps/nss/gcc-8.2.0/python/3.10.4/x86_64/bin/python

as answer to "which python". after deactivating and activating the venv it is
suddenly the correct one from venv.


on the other hand, running

  sbatch --wrap='./flatquad_landing_experiment.py'

seemed to work!! output is written to a file called slurm-$ID.out. watching it
with tail -f feels like the real deal!!! very cool.

...but then the task is killed. why? did i break some rules? (maybe it can't try using any GUI with matplotlib...)

https://slurm-jobs-webgui.euler.hpc.ethz.ch/ shows a nice overview of past and
current jobs. I was out of memory! apparently default is just 1 GB. well they
got loads more so that's not really a problem.

lots of general tips also on https://www.gdc-docs.ethz.ch/EulerManual/

with srun -pty bash we can also get a shell on an actual computation node!
works just the same, task is also killed due to memory limit. but virtualenv
and everything works just fine on that node too. so it seems that I did that
correctly after all.


also experimenting with sampling procedures and acquisition functions. put
some thoughts in idea dump about how they are in some sense two sides of the
same coin, but still w important differences.


2024-03-25


made draft of selecting v_k+1 based on dv/dt and a max time horizon, something
I had in mind for some time but never concrete.

worked on implementation of proposals and oracle step. small implementation
questions sorted out. baby steps. still need to see how this all turns out
when we put the pieces together.


goals tomorrow.
 - whip out a working first implementation of the whole loop, no matter how shitty
 - think in a bit more detail about the train_and_prune function


2024-03-26

made a draft of the train_and_prune function. adjusted nn ensemble code so
PRNG key would also randomise initialisaiton (previously the whole ensemble
used the same init...).

feeling kind of unmotivated in general. maybe i need to take lenart's advice
in the beginning that i should definitely take that week or two of vacations
more seriously. listening to pillowy piano music from lambert due to a
particular suggestion ages ago. maybe i also need to go for a bike ride first

turned off vxx because everything just runs much much faster and still kinda
works. starting to think vxx is not worth calculating. (didn't i reach that
conclusion already way in january or something, only to change my mind again?)


other idea i had during lunch: maybe the main loop needs to be changed from:
  propose pts -> call oracle -> fit nn/prune -> find out boundary of known set
to
  find out boundary of known set -> propose pts -> call oracle -> fit nn/prune

then we can have an simpler initial step (uniform or lqr guided backward
shooting, nn fit, enter main loop).

also, how do we actually find the boundary of that known set? trivial approach
would be to sample loads of points, and say we know the set for the highest v
such that all points with v(x) <= v have sigma(x) <= sigma_max. but i am
pretty sure there are some tricks to improve the estimate given that we have a
finite sample.


first time the active learning loop runs!!!

run = finish without any errors. not sure if the results are any good. but we
got ourselves a starting point :)))). also it looks like v_k, the smallest
known upper bound of the actual known value level, keeps increasing
monotonously*, up to like 60 in 10 steps!

* edit, no, one time it sank as well. is this a but or does this happen "by
  design" due to nn training and other randomness? for that it went a bit low...

issues still to tackle:


immediate, most important:

- reconcile our proposal distribution sampling thing with existing research
  about batched active learning

- make sure the prune&train function does what it should do
  (really gravitating towards "aggressive" pruning. having no data in
  small regions is better than having contradictory data in small regions)

- find a half-decent working vesion of sampling from value level sets for
  proposal selection (atm maybe it won't work because the box is too small)

- test it and try to find nicer acquisition functions etc.

- finish the state space manifold mapping stuff. proably simpler if only
  1-order sobolev (no vxx)

- start writing...?

implementation "details":

- find a way to use constant sized datasets by thinning out the data for
  "past" low values.

- how to sample points for rejection sampling? cod hits here. making the set
  2x bigger means we have to sample 2^6 times more points to have the same
  density. do the same move with scaling uniformly sampled pts with logspace?
  then we cannot say anything about distribution anymore. fancy mcmc???

- use the data we already have in the value level band before generating new
  data. optimally somehow "fuse" it into the active learning process too so we
  don't blindly extend ALL trajectories leading to ever growing dataset

- make sure everything that can be jitted is jitted

- move some of the hardcoded constants in to algo_params

- find sensibe algo tuning. esp. with regards to:
  - length of trajectories to balance memory & recalculation load
  - proposed vs achieved step size.

- data normalisation? atm we keep it constant to the initial scaling. is it
  a better idea to change it (periodically? every iteration?)

very long term stuff:

- everything else like general explicit acitve set cvx opti / optimisation
  constrained ODE -> switched DAE reformulation

-


questions from last time.

- euler. managed login & launch of experiment once (hit memory.)
  how to manage launching lots of them? -> launch script.

- administrative: when is handin date? when are those 2 weeks of vacation?
- any guidelines on the shape of the final report? then i can start writing down basics


2024-03-27

running some small experiments with proposal strategy. until now i tried
proposing the k lowest-v points with excessive sigma. alternative i am
running now is just highest-sigma itself. this selects more points toward
the upper value band -- also decreased the value band width, by assuming a
shorter useful time horizon. we'll see how it goes.

as a proof of concept it works which is really nice! however, the
equilibrium is not "exact", there seems to be a whole set of states towards
stabilisation happens, spread over a small range in x position. this makes
me think that maybe calculating vxx was after all the right thing to do?
maybe we just need to generate even less data to keep it manageable.

this is probably not really a problem though.

what's worse is that in the current implementation the "known" value level
jumps *down* again quite often! intuitively this shouldn't happen. is this
due to bad implementation or some sort of theory oversight?

do we even have to set a v_next / v_target? what about this basic strategy:
 - find many points with v(x) > v_k and std > std_min
 - take the lowest-v ones and simulate forward.

is this the same as what we do currently, with proposal strategy =
'lowest_v_among_uncertain'? as long as the v_next limit is not active?

easy heuristic for finding out if our box uniform sampling thing 'catches'
the whole value sublevel set: plot it in random directions? if for any
direction the plot doesn't go above the level set we need to expand in that
direction. is that worth anything?

nearly at firabig, started first longer run (100 iterations). but it gets
stuck at like 36ish value. maybe instead we should ask instead of
    sigma_v(x) <= sigma max
that only
    sigma_v(x) <= atol + rtol * mean_v(x)
where atol = sigma max and rtol is some small constant? would make sense
that as value increases so does uncertainty fundamentally.


2024-03-28

put the above atol/rtol type parameters and ran a couple relatively
unsuccessful experiments.

(update: the implicit assumption here is that the manifold is embedded in
an ambient euclidean space)

been reading some more geometric (= on manifolds) optimal control theory.
what I 'know': when X is a manifold, that is also the domain of the value
function. Any gradient of V (the costate) therefore must be taken in the
direction of the manifold as well. (this does NOT mean that the numerical
value of the costate is a member of the manifold, or its tangent space, or
anything)

What happens when we just take usual results and try to apply them if X is
a manifold? let
    x' = f(x, u)
    H(x, u, p) = l(x, u) + p.T f(x, u)

The PMP states:
    x' =  H_p(x, u*, p).
    p' = -H_x(x, u*, p)
    u* = argmin_u H(x, u, p)

We would hope that in this case the family of solutions to this system, in
backward time starting from suitable terminal conditions, covers the
controllable set.

The first line reduces to x' = f(x, u*), so this system will stay in the
manifold X just like the original system, no matter what weird stuff the
costate will do.

The second line is a bit weirder. What does H_x mean when X is a manifold?
Literature calls it a member of the *dual space* of the tangent space. IIRC
this means the set of linear functions defined on the tangent space. Which
makes sense, classically the costate is something we'd like to use to form
an inner product with f(x, u), <λ, f(x, u)>. So in the manifold setting
this is just a way to say that.

Lee's Smooth Manifolds (and many others) introduces the concept of tangent
covectors like this (p272, start of ch11):

 - a tangent covector p is a linear functional on the tangent space at a
   point x \in X
 - the space of all covectors p is the cotangent space = the dual of the
   tangent space.

This fits nicely with the fact that it is called the "costate" to begin
with :) Pontryagin etc sure had something up their sleeves.

I suspect that the structure of the "manifold" PMP is exactly the same as
the basic one, but the added technicality will be due to this: the costate
evolves as the derivative of the value function. Normally this is easy
because we have the same global coordinate system for the whole value
function, given by the standard basis of the vector state space. In the
manifold setting this is complicated by the fact that the "direction" of
the state space (i.e. the tangent space), in which we take the partial
derivatives of V, also changes with time. Not sure how to address this, and
the fact that all references about it seem relatively intransparent is
probably enough of a reason to ditch it. OTOH doing everything in a
topologically (more) accurate state space would be really cool...

this is probably all riddled with elementary mistakes. found a section of
CVOC that talks about this without being too crazy! reading right now.


2024-04-02

idea for a slightly different paradigm i had over the easter weekend. we
define at the start a large set of "test points" which cover the state space,
and keep them for the whole time. then, instead of finding points from a
specific value band at every iteration, we just choose among those test points
the lowest-v ones outside of the known set (and with sigma too high probably).
this is a bit of a different way to control "propagation speed",  with the
density of test points (we can also put more near the origin like currently
done). Because atm we frequently observe 100% of the forward simulations
reaching the lower level set I have the hunch that we might be a bit more
aggressive.

this might solve the issue that we don't find points within the value band,
which might also  just be too narrow.

this also makes the propose_pts function jittable probably :)

and second idea: the problem of uncertainties rising again could be practically addressed
by storing for each test point the smallest sigma ever recorded there, and using that in
the value level estimation rather than the current one. the idea being, even if the currently
fitted NN is unsure there, we have demonstrated that it is possible to get a good v/vx estimate
there. maybe also only record small sigmas if they are inside the value level set?


and finally third idea: do everything in ambient space in which manifold is
embedded. See Chang 2011 "simple proof". More comments in dump.

TODO for a first implementation try:
 - make branch
 - restate the flatquad problem accordingly with sin/cos of theta (done)
 - and manifold definition with constraint function (done)
 - sanity check if two trajectories that should be equal are equal (done, they are)
   and if they depend at all on the "unnecessary" costate boundary condition in normal direction (done, they don't)
 - figure out technicalities with terminal lqr only in tangent space (like twingtec
   quaternion-lqr attitude controller) actually (done)
 - write loss function with tangent/normal projection (done)
 - sanity check if the tangent/normal projections do what they should (done for 1 pt)
 - also adapt accordingly every time we sample points from state space (done)
 - new tuning params:
   - normal direction regularisation loss strength (done)
 - once all of this works for short-ish trajectories, see if we need baumgarte
   -> probably not! for 5 second horizon the extent to which we stray off the manifold
      stays at a couple times 1e-5. because we are planning to integrate in short segments,
      this is not much of an issue. probably it will be just fine if we:
       a) terminate the integration once it is at a much higher value level which
          is definitely not (yet) interesting
       b) when that level does become interesting and we want to resume the trajectory,
          then project the state back to the manifold and the costate back to the
          cotangent space (i.e. set unnecessary "normal costate" to 0)
      or, if that is somehow insufficient, making a custom solver like this
      https://github.com/patrick-kidger/diffrax/issues/277 would be not too
      much effort, just insert projection to the manifold and reset of the
      unnecessary costate in there. this would make the interpolation a bit
      discontinuous but (probably) only in normal direction. This guy here
      https://scicomp.stackexchange.com/questions/34129/conserving-energy-in-physics-simulation-with-imperfect-numerical-solver/34131#34131
      also says that projection keeps the order of the solver which is
      nice. More details on theory in Hairer's 'Solving Differential
      Equations on Manifolds'.

      If we're being very precise the whole thing should probably be solved
      as a DAE, with the constraint being m(x) = 0. We might also constrain
      the corresponding costate to be 0, but then we have to make sure the
      differential equation doesn't contradict this by specifying some
      nonzero rhs for that costate. Probably by converting to Ez' = f(z),
      for some singular E which removes that degree of freedom. There's a
      name for this but I don't remember it. Mass matrix? But anyway, this
      is probably too much effort for no concrete gain. Manifold
      backprojection after the solve, or after each step should be very
      adequate solutions.


started this stuff after lunch. coded basic example of baumgarte stabilisation
with forward sim. it works, but not for a huge range of "feedback gains" (it
is a P controller basically...). error in m(x) (which defines the manifold as
M = {x: m(x) = 0} and is scaled so ||\nabla_m(x)|| = 1 for x in M) stays on the
order of about 1e-6, which is good enough.

atm trying to do PMP backward shooting just with the ODE as it stands. we DO
get an optimal-looking trajectory. to make this a meaningful sanity check,
definitely need to compare w/ "local coordinates" (theta) solution. doing that
just before calling it a day.

the two LQR solutions seem to agree with each other! no massive mistake there.
IT WORKS!!! the backward solutions also seem to agree w/ each other. more
verification to come.

even better, plotted 10 trajectories old and new version with rather long
horizon, with the old one transformed into the new (embedded manifold) repr as
well to compare. the match is LITERALLY FUCKING PERFECT. on the first try.
even in clearly nonlinear regions. how did I deserve this? i basically slacked
off half the day! but some good things do happen. perfect time to call it a
day. m(x) does grow to about 1e-5 in magnitude for a time horizon of 5 sec.
maybe if we only ever use this short horizon we get away with not doing any
baumgarte stabilisation, but projecting back on the manifold every time we
"continue" the solution? would be neat.

next tiny goal for tomorrow: look at the unnecessary costate (inner product of
costate on R^n with normal space direction) to see if it does some weird
stuff. we've already established that it doesn't influence u* at all so only
a concern if it becomes like NaN or really huge and messes up the entire ode
solver.


2024-04-03


as said, plotting those costates in normal direction. looks alright actually!
they do not stay very close to zero, instead growing at rates about comparable
to the state itself. perturbing the costate in normal direction, even very
heavily, does not change the state trajectory at all (except for tiny noise
due to ode solver). even the "normal costate" trajectory stays the same shape
precisely, it is just shifted upwards and downwards respectively. this is
pretty nice.

I expect that if/when we add baumgarte stabilisation for the state manifold
this becomes a bit messier. if we add a stabilisation term (for backward time)
to the dynamics function and then calculate the PMP solution, we should expect
that as we are making the state stable (in normal space direction) the costate
becomes correspondingly unstable in that direction, due to the volume
preserving nature of the hamiltonian system.

Maybe this is solved by using a backward-baumgarte-stabilised RHS in the state
ODE, but giving the costate ODE (lambda' = -Hx) a forward-baumgarte-stabilised
RHS. that way the state stays on the correct manifold (approximately), but the
costate thinks we are solving an equation with unstable state in normal
direction, so the costate *should* be stable in that direction, avoiding
problems where it becomes huge.

but this just as a thought for the future. ATM I am thinking the practical
solution is just the use of a general purpose ODE solver without any manifold
considerations, augmented by resetting to the correct manifold before
continuing the solution in a subsequent solver call. (involving both
projection to state manifold and setting "shit costate" to 0). this should
work alright as we are anyways only integrating the trajectories in short
segments. (would be further aided if we have a terminating event based on
value so we don't integrate "highly dynamic" maneuvers [which incur large
value and also risk throwing us off the state manifold] over too long)


further todo which I haven't written in yesterday's list until manifolds work:
 - make sure data normalisation doesn't mess it up -- it probably does atm
 - adapt the visualisation to the new state repr
   - maybe we can directly make it a quaternion? or just blindly use atan2...

working on data normalisation atm. first training plots do not look stellar.
trying to figure out if mistake. i suspect that either normalisation, or
tangent space projection etc is still incorrect.

after doing a couple of nn training runs, it looks like the vx fit is better
the lower the vx sobolev weight. definitely something sus going on here.


other idea (probably similar things written a couple days ago...): maybe
instead of the heursitic of achieving a given time horizon that tells us how
big of a level set we can expect to find in one active learning round, we can approach
the problem much more pragmatically. something like:
 - take the lowest-value 200 points that are both above v_k (probably? maybe?) and uncertain
 - as long as most of them reach the Vk sublevel set, keep trying it with the
   next higher 200 points and so on

this will only stop once the extrapolation fails to stabilise the system,
which may happen rather late. maybe clip it with an upper bound, rather
generously chosen, based on the dv/dt considerations?


2024-04-04

high level todo list, copied&updated from 2024-03-26. mostly the same bc
i've only been working on proper implementation of manifold stuff...


immediate, most important:

- make sure the prune&train function does what it should do
  (really gravitating towards "aggressive" pruning. having no data in
  small regions is better than having contradictory data in small regions)

- fix Vnn going below value level when it is definitely outside
  (see comment below)

- test it and try to find nicer acquisition functions etc.

- start writing...?

implementation "details":

- fix current bug prob. to do with data normalisation & manifold...

- either a) baumgarte for backward shooting or b) back projection (to M x
  T*xM, meaning project x to M and set unnecessary costate to 0) before
  resuming trajectory & some mechanism to not integrate for "too long". In
  all cases, monitor |m(x)| in dataset & warn if not tiny.

- fix practical issue with uncertainty increasing again. generally, find
  more "robust" way to mark points as "known", combining uncertainty, value
  sublevel set (probable) membership, maybe others. maybe even dumb hack
  such as just marking test points "known" permanently even if sigma
  becomes "too large" again.

- find a way to use constant sized datasets by thinning out the data for
  "past" low values.

- use the data we already have in the value level band before generating new
  data. optimally somehow "fuse" it into the active learning process too so we
  don't blindly extend ALL trajectories leading to ever growing dataset

- make sure everything that can be jitted is jitted

- move some of the hardcoded constants in to algo_params

- find sensibe algo tuning. esp. with regards to:
  - length of trajectories to balance memory & recalculation load
  - proposed vs achieved step size.

- data normalisation updating? atm we keep it constant to the initial
  scaling. is it a better idea to change it (periodically? every
  iteration?)

very long term stuff:

- everything else like general explicit acitve set cvx opti / optimisation
  constrained ODE -> switched DAE reformulation


first, debugging shit nn fits. turning off data normalisation -- data
happens to be nicely scaled, a bit less than unity in most directions but
not hugely so. did not improve things. even turned off everything related
to tangent space, now we try to literally fit the v and vx straight from
PMP solutions -- no luck.

suspecting error in pontryagin stuff after all. but after plotting
trajectory vs lqr solution (v and vx) they do match up perfectly at the
start. actually for ALL the trajectories we have, below the small value
level of 5, with a loglog plot of v_trajectory vs v_lqr(x_trajectory), it
is basically a line. ruling this one out. there is some gradient fuckery
going on in the training procedure.

increasing epochs from 64 to 512, and lr_final from .0001 to .001 made the
example with "fake" (lqr) data work nicely it seems. was it really this
simple? trying it (in ipdb) with actual data too, and it looks good!

okay, putting the tangent projection of costate loss back in. as expected,
some costates show a great fit, while others are completely off. actually,
precisely two of them are off, and it looks like they are the right ones --
the ones for cosPhi and sinPhi states. that is good! still zero normal
space regularisation. trying that next.

seems to work perfectly! with very weak (0.001) regularisation terms the
unnecessary costate is 0.something near the origin. it seems my streak of
luck has resumed after a short pause!! will need to see how the normal
regularisation does if we actually use the whole manifold.

turning normalisation back on, just for y axis though. still works
perfectly! HOWEVER, turning on statewise normalisation by changing

'normalise_states': np.array([False, False, False, False, False, False, False]),

to

'normalise_states': np.array([True, True, False, False, True, True, True]),

in algo_params makes everything worse again! i think we isolated the bug.
kind of makes sense that probably we did some linalg wrong when doing the
tangent space projection stuff... let's rethink that. or actually, atm just
keep the first option \o/


looking at the v_means vs v_stds scatterplot again. there is this weird
"curve" of much larger density looping around the data cloud. I suspect
this is somehow the image of the manifold, but not quite sure how.

what's actually a bit worse: we get lots of points where the value
uncertainty is relatively high, but value is negative! this cannot happen
theoretically but if the NN is that way we think that the current sublevel
set is actually much larger than it is.

how to fix this? loose thoughts and possible approaches:
 - functional prior type thing that "pushes up" the value function a slight
   bit.
 - somehow demand/ensure that the current sublevel set be connected? (but
   maybe those "bad regions" are connected to the current set?)
 - ignore it, and throw out the failed forward sims later
 - but wait, for estimating the currently known level we cannot ignore
   it...
 - standard "active learning" wisdom would be to just treat this as a
   "feature" of the acquisition function, propose those points again and
   find out that they are in fact not in the level set. but our oracle
   cannot do that!

the negative nn value means appear at states which also have low lqr value!
the only nonzero state where that happens is an upside down rotation (sin,
cos) = (0, -1). let's check that. CONFIRMED! man my intuition is god level.
the 100 states with lowest nn v_means are ones close to upside down
rotation, with the other state variables all being relatively small.

it seems to me like the "functional priors" route is the prime candidate
for fixing this. it is the only one that fixes the problem (nn estimating
values which are clearly nonsensical -- there cannot be two disconnected
value sublevel sets), instead of hacking our way around it.


immediate todos:
 - address this just above ("artificial" value sublevel sets far away &
   disconnected from the real one)
 - find&fix bug in normalisation with new tangent space vx loss, or ignore
 - large list from this morning


radical idea: store log of value function? or to avoid nonsmoothness at 0,
log(1 + V(x))? with gradients modified accordingly? would this replace data
normalisation? maybe we still need it. it *would* probably make it easier
to achieve uniform sigma bounds wrt log(1+V), which wrt V mean that we
accept more uncertainty for higher values. maybe that would "destroy" the
value gradient by making it very small for high values though.

can we "emulate" this squashing by an appropriate loss modification? i.e.
make the v loss behave as if we were using a simple loss for fitting log(1
+ V(x)), but keep the vx loss "unscaled"? this would probably amount to
making the value loss weaker for higher values, from the bayesian viewpoint
again corresponding to "noisier" data. from a purely numerical standpoint
that is probably even quite accurate (if we assume the error is somehow
proportional to atol + rtol * ||ode state||...)

update, this https://arxiv.org/pdf/2301.05579.pdf has an entry about a
"Root Mean Squared Logarithmic Error Loss", basically:
    l = log(1 + y_k) - log(1 + f(x_k)) = log((1 + y_k) / (1 + f(x_k)))
is this what we should be looking for? does look like it is suited for
uniform-ish "relative accuracy".


finally, short meeting with bhavya. main points:
 - rejection sampling for uniform sample over constrain set followed by
   uncertainty sampling among that set is a *completely* valid approach for
   doing it :)
 - suggestion: keep close to "known" acquisition function, don't reinvent
   the wheel.
 - Expected Improvement w/ Constraints ("EIC") could be inspiring.
 - problems with value level set shrinking:
   - not main focus. select "robust" nn settings like many epochs etc.
   - maybe try warm starting the NNs after all?
   - assume for theory, at least in the beginning, that model is perfect,
     behaves like a GP.
   - if every now and then a hiccup happens, don't care
   - if it completely destroys it, then think about additional
     implementation safeguards, like just not decreasing v_k, retraining if
     it decreases, marking test points as known whenever sigma was low
     once, even if it rises again.


2024-04-05

at the moment I am preferring to nerd out about ODE solvers on manifolds
rather than tackling the central task which are the active learning
formulation and implementation... I think this is fine so long as it
doesn't become an addiction. maybe it does \o/ put some additional thoughts
on that on line 2500 in this file (todo list 2024-04-02 in case of change).


trying out that relative v loss instead (see note yesterday about "Root
Mean Squared Logarithmic Error Loss", and other comment a few days ago I
can't find rn). was astonishingly easy and seems to actually work! just
replaced
    v_loss  = (v_pred - y['v']) ** 2
by
    v_loss  = ((v_pred - y['v']) / (1 + y['v'])) ** 2
the latter is basically the same as the old one for y['v'] << 1, but
becomes scaled down for larger labels. this is connected to the log
transformed loss:
    l(x, y) = (log(1 + y) - log(1 + f(x)))**2
by replacing both logs with their 1st taylor approx about 1+y, which is
log(1+y+dy) = log(1+y) + 1/(1+y) dy. Thus the transformed loss becomes
    l(x, y) = ( log(1+y) - log(1+y) - 1/(1+y) (f(x) - y) )**2
            =                       ( 1/(1+y) (f(x) - y) )**2
which is exactly what we replaced in the code. very neat! this will
(hopefully...?) make normalisation along the y axis unnecessary as well.
very cool, one might almost say HUGE if true.


current implementation issue though: the Vnn function becomes negative or
very small around the point (cosPhi, sinPhi) = (-1, 0), rest of state=0.
yesterdday I decided a functional prior to "push up" the value function
would be a good solution. but how exactly? priors in general just map to
additional loss terms. the "prior loss term" probably should be evaluated
at a separate set of points. should those be uniformly sampled from the
state space? should they be also more dense around the origin? somehow I
feel like uniform is better somehow. not exactly sure why.

what loss term? we want the value function to be "pushed upwards" in some
sense. this suggests that we want a loss whose gradient points upwards.
maybe just abs(V - some large number)? technically we don't need the large
number even, we can specify an unbounded loss of proportional to -V, which
optimised on its own would be very problematic but used as a small
regularisation like term maybe less so. would need to try to say for sure.

at the same time we could also add a small penalty for vx to kind of
"balance" the other term, so we still get overestimation of the value fct
outside the region where we have data, but not at the expense of huge vx
values.

How do we handle this training? a) generate separate dataset w/ labels or
b) generate test pts on the fly in training loop? b) seems more reasonable
to me. first guess for implementation:
 - do everything in the loss function
 - in its definition, evaluate the data loss just like before.
   additionally, evaluate the modified loss with some random datapoint
   and add it with small factor.
or would it be better to do this outside? that is:

def prior_loss(y):
    x = random x from suitable distr.
    return abs(v_nn(x), large constant)

def sobolev_loss_with_prior(y):
    loss = sobolev_loss(y)
    priorloss = prior_loss(y)
    return loss + 0.01 * priorloss


doing that atm. with literally the simplest prior loss -v, and loss
strength of 1e-9. obviously the unbounded loss *could* cause issues.
and indeed it seems that for a tiny range of strengths around like 1e-6 it
kind of seems to work, but the tuning is way too sensitive to be practical
imo.

next attempt: loss proportional to (v - 100000)**2. ideally this "large
constant" would be adjusted to always be like 10x or 100x higher than the
data distribution, and the loss strength as well. this would probably be
easier if we did a log transform for the value function, or emulated that
somehow with a log scaled loss function.

or, considering the below definition of the loss scaled by the y label,
just do the same with the prior loss -- namely,

(v - v_prior) / (1 + v_prior) ≈ (v - v_prior)/v_prior = (v/v_prior - 1)

doing a parameter sweep on that now. still the tuning seems a bit
sensitive, looks like nothing happens below 0.1 and above ~5 it blows up
(although, within the data region it still stays accurate and low sigma
which is neat). doing a second sweep from 0.1 to 10, while also monitoring
smallest v mean among all points with cosPhi<0.

it seems at least initially the lowest v actually decreases! always between
-5 and -6 though. and the trajectory data seems to do just what we want,
outside of the data region. maybe the lower part of the manifold is still
considered "inside" somehow?

okay, now there is still a point where it suddenly goes from very slight
effect (and still wrong negative Vs) to completely messing up the
extrapolation. maybe this is just fundamentally how the optimum behaves as
a function of prior strength parameter? not much at all, then suddently
moving much faster? would kind of make sense intuitively.

But it makes me think that at least this kind of functional prior (trying
to push V up unboundedly but weakly) is not what we should be doing.


additional/alternative options:
 - penalise ||vx|| slightly, AND push up v - they should work against each
   other to a degree but make v large outside data while avoiding huge
   gradients. but surely still hard to tune...
 - penalise ||vx - x|| or something? idea being that the optimiser of just
   this loss is a quadratic function. or compare with lqr vx and penalise
   that difference a tiny bit? BUT well, lqr value function stays zero in
   normal space direction, which is exactly where the issue currently
   appears.
 - calculate not the gradient but the jvp <vx, x> and encourage "outward
   growth" based on that
 - do the same but with lower v_prior? (trying that rn, last thing today)
   seems actually not too bad at first glance! then we just have constant
   v=v_prior way outside the known region. we should be able to work with
   that just fine...

in similar fashion we could also add a functional prior stein repulsion
type term. but that would require pairwise interaction between nn ensemble
members, complicating the implementation.

completely different (from yesterday, no functional prior, initially rejected)
 - somehow demand/ensure that the current sublevel set be connected? (but
   maybe those "bad regions" are connected to the current set?)
 - only propose points with distance to data < some threshold? seems hacky
 - ignore it, and throw out the failed forward sims later
   (and write an acquisition function that does not ONLY choose wrong pts)
 - but wait, for estimating the currently known level we cannot ignore
   it, if we do that the obvious way at least...
 - just sample only in a small region around current data.... but the
   particular manifold problem we have now is only fixed if this small
   region also means close on the manifold, requiring a different sampling
   method...
 - outrageous idea: (mis)use the normal direction regularisation.
   regularise not such that the derivative in that direction is small, but
   rather nonzero as to reduce the risk of the extrapolation that
   intersects the manifold again being negative. would be a bit hacky. but
   i can actually see that working for this example and also quaternions,
   as in both of them there is really only one point in danger of having
   small v_mean, IF the extrapolation is well behaved, i.e. close to the
   LQR solution initially. but, I mean if we sink that low we might as well
   just impose a prior at that specific point that pushes V up in the
   absence of strong other information.


other random implementation thoughts:
 - switch system state to dict as well? x = {'x': 0.1, 'sinPhi': 0.7} etc.
 - switch aux output of loss function to dict? then we can more easily look
   at / plot the prior loss etc.

man this daily log volume is getting out of hand...


2024-04-08

doing some long overdue software engineering: switching loss term outputs
to dict and adapting plotting function accordingly. we include everything
we want in the dict and the plot is adjusted automatically :)

from the individual terms we get more of an idea how the training actually
works. it seems that the prior is "fulfilled" last, when all the other
terms have pretty much already plateued. so we have to be careful that with
more data/other scaling/whatever we don't completely ignore the prior. BUT
also, it is not "bad" if we completely optimise the prior whereever there
is no data, even a "partial" fit where V_nn is not at v_prior but still
above data and v_k is useful, and arguably fits the term "prior" better.

the most important question is still this: how do we balance the loss
factor and the sampling density for the prior? the sampling "box" extent is
quite important, changing it by a factor of 2 effectively changes the prior
strength by a factor of 2 ** nx, which can be quite a lot! this is central
and really determines the shape of the uncertainty band, in the GP analogue
a bit like the kernel width.

this seems nice and all but still our problem, that {0} x M contains points
with V_nn < 0 (mean), persists. trying to up the prior strenght a bit to
see what happens.

okay, it seemed to work finally! there are no more v<0 regions, which means
that estimation of the known value level should work again. though, if we
select a big value increment, still there might be proposed points which we
wouldn't actually want to propose. handling this in a nice way is probably
too much effort, let's try if we can just keep the number of times this
happens a bit at bay.

running the whole active learning loop for the first time in a while (since
adding the whole manifold stuff, loss scaling stuff, functional prior
stuff...)  and maaaan it takes long. already almost 10min of training, for
6487 data points. crazy. is it something I did wrong performance-wise?
unsuitable code for jitting? too much memory access when drawing samples?
(maybe the trivial vmapped version of random.choice is worse than e.g.
random.choice 8x the amount and then distribute it in terms of memory
access... no clue about this stuff though. Let's keep this an
implementation detail not to be worried about unless absolutely necessary)


generally it looks nice, v_std goes down as it should. BUT there is the
same issue as last time: even though *most* points are below the
sigma threshold for a long stretch of v's, even one of them that is above
throws the algo off. this follows directly from our calculation of v_k: we
find the lowest-v point that violates the threshold even slightly, even if
above the that v, still *most* of the points are below it.

should we change the v_k calculation to do something like: choose the
largest v such that at least 95% of points inside it are below the sigma
threshold? would probably add some robustness against outliers. We cannot
be sure that V_k indeed has uniformly sigma < threshold, but we cannot
provide that guarantee in any case given we are using finite samples. also,
I am pretty sure that considering we check again for each trajectory
whether or not the data can be trusted (= has low sigma and is in value
sublevel set), having some points of excessive sigma within the level set
will not throw the algo off.


okay, implemented that, after a couple rounds of fixing elementary errors
it seems to work nicely. though, obvs doesn't fix the case where the NN
approximation is *completely* messed up, but serves as a more reasonable
guess about the known value sublevel set -- much closer to what a human
looking at the scatterplot might pick intuitively.


2024-04-09


finding it hard to tune prior sensibly. mostly in the first iteration it
looks perfect, but already in the second one it is suddenly way too strong,
destroying the extrapolation completely.

maybe one way to make it work is this: for each state point x in the
training set, impose the prior loss at 10 x? or something similar?
10*cartesian part of x and random position on manifold? is this dumb?

other idea for the prior: instead of specifying the point should have
Vnn(x) close to v_prior, just specify *at least* v_prior. that means
probably going from a symmetric loss function:
 l = ((v - v_prior) / v_prior)**2
to something like:
 l = softplus(-(v - v_prior) / v_prior)
which penalises underestimation but doesn't care about overestimation.
not sure if all these prior thoughts are worth it...

or, again similar idea as last time, have ONE hessian loss at the origin,
with the LQR value, but also with a p.d. quadratic term in normal direction
to the manifold. in the absence of any data that should already make the
extrapolation positive definite, right...?


yet another one. whenever we notice that the proposals don't go anywhere,
add a "prior datapoint" in that spot with really high value? intuitively,
the fact that we were NOT able to reach the lower value level set indicates
that the value may be much higher than the previous extrapolation said.
maybe this is the way to go?


other idea, maybe a bit unhinged: have the PRNG key for sampling the prior
x point (and noise) be a deterministic function of the data point, e.g.
some low significance bits mixed together and cast to int. then we have
"noise" in the prior which is fixed and which might help increase
uncertainty in unknown regions. (though practically, we can just treat "v
really high" the same as "sigma large" even if simga is actually small)


thought over lunch: maybe i have already been doing what I couldn't wrap my
head around? explanation -- if we generate solutions over the value
interval [v_k, v_target], but then later find out the fit is only accurate
up to v_k+1 which is between those two, then still we are using the NN
trained with data over the whole value interval in the forward/backward
sampling process. so we are already going into those "unknown" crevaces,
while still being helped with an extrapolation from other regions! is this
nice or what

sometimes seeing test vx regularisation loss go to suspiciously low
numbers, like 1e-14, but then back up. is it really possible that the
normal regularisation works that well (for some points)? other batches seem
to be in a more expected range.

often seeing "warning: densely sampled interval smaller than value
interval". should think about a way to address this. just keep it
simple&stupid and turn off data thinning? runtime is probably not a huge
concern if we manage to write a good report nonetheless. -> turned off.

should we prioritise dense data close to the previous, "certified", v_k?
because that is where we need it first to expand the level set.

other thought over lunch: the fact that atm we don't continue solutions
exactly is maybe not (only) a bug but also a feature. if we continue
solutions we benefit from less approximation error but have to either
brute-force continue them all, or include them in the active learning
process, which would complicate that a fair bit. on the other hand if we
just stop the trajectories we have kind of a natural data thinning going
on: if the fit is good despite low trajectory density, we won't propose
those states.


current run is going VERY nicely! made a meshcat viz for the first time in
a while. while the stabilised set is not yet huge (v_k around 100), it is
also not tiny! very optimistic rn. i cannot overstate how nice it is to
finally see it run as i want it to. putting it on euler for a (maybe)
longer run.

noticing that with the current proposal strategy (lowest_v_among_uncertain)
we always set a high v_next target but achieve much lower. (like atm: going
from 112 to a target of 278, but only achieving 125.) still nice because
the progress seems to be reliable. am trying the other
(uniform_among_uncertain) on euler to see if markedly different.

whoops maybe it is the other way around (uniform on xps, lowest on euler).
but maybe also I changed it since running and it is lowest on both. man i
should have a cleaner workflow there. ask again for launch script with all
combinations of whatever args we want?

noting that the cpu is not even 50% loaded later in the run (on xps). is it
IO bound?

still got a 1000 implementation thoughts in the back of my head. probably
though i'll distance myself from the nitty gritty for a day or two while
writing basic theory a bit more neatly.


2024-04-11

took a well earned day of slacking off yesterday, had plans to start the
writeup of the basic algo today.

instead got caught in the parameter tuning spiral. but with good success!!
first, nn layer size 64 -> 32, batchsize 32 -> 16 keeps NN quality about
the same but is cheaper to train.

then, larger active learning batch sizes are actually worth it! increased
from 64 to 256, and now to 1024 even. the results are kind of crazy: in
some cases the value level set grows faster than the target! (this is kind
of suspicious though because then we are building our further trajectories
from essentially extrapolated data. if this keeps happening might need some
safeguard against it. but lets wait and see, it also happens only once
during a small run.)

also playing around with proposal strategies. realised that it makes more
sense to scale the stds by sigma_max in the softmax one, then they all have
kind of the same dimensionless meaning. from then we can scale it by
another constant to interpolate between max_sigma and uniform_uncertain
essentially. probably the practical difference between all of them
decreases as we

maybe another similar thing (to softmax) is a mixed uniform and max_sigma
strategy? literally just maybe 50% max sigma and 50% uniform from the
remaining. that way we "shoot for the stars" (= fast progress to high-σ
regions) while still not failing completely if we asked for too much. very
similar in concept to softmax but maybe easier to tune?

i think sometimes a weird "oscillation" starts to happen: in iteration k we
set a rather high value target, then with max_sigma or softmax strategy we
propose many points which are "too far", and we can only use a small amount
of the simulations. then in iter k+1 we observe small progress in value
level. the iteration goes very well and we estimate high value level in
k+2, and the cycle repeats.

implementing that 50/50 combined proposal strategy in hopes that it will
fix this behaviour.


2024-04-12

meeting.

    methods: nice till now. most important advice:

    - cleanly separate acquisition function & optimisation procedure!
    - ensure proposals away from each other w/ maxkernel, maxdeterminant.
    - try to reduce number of "magic constants"


    about writeup

    important: sensible baselines.

    basic obvious ones:
    uniform sampling/uncertainty sampling everywhere/restricted domain

    other ones:
    - compare result w other trajectory optimiser
    - compare w neural fitted bellman recursion type thing?
    - compare w any RL algo?
    each one probably significant effort, and unclear wheter comparison even
    meaningful...

    couple examples sent to inspire :)

    euler launch script -> lenart gave git repo


implemented basic version of max kernel proposal generation, for simple
acquisition function σ(x) (x constrained to be in plausible value level
set). works as intended in itself, but fails because it tends to propose
ALL states in some high-sigma region that also happens to have Vnn(x) in
[v_k, v_k+1], but only because of extrapoltion weirdness not because the
actual solution is there. So all oracle calls fail and we get stuck.

possible remedy:
 - revamp "prior" type thing so that this doesn't happen
 - add all points where forward sim failed to the dataset temporarily, with
   some high value label.
 - mark those specific points as "invalid" already in the proposal step,
   and re-propose from the rest and hope that forward sim will reach Vk?

these do essentially the same. i am gravitating towards the second bc that
probably involves fewer problem-dependent tuning & heursitics. work for
next week.


2024-04-15

removed some dependencies to hopefully make the venv easier on nas storage.
noticed it is probably smarter to do everything on the local directory of the
eth machine /local/home/username. it has a warning there that nothing is
backed up which we don't need anyway if we put everything important in the
repo. also the NN fitting seems to run twice as fast if doing it from there?
no clue why precisely.


idea to "unify" max kernel distance and possible heuristics to ensure our
proposals are distributed somewhat evenly on the value interval [vk, vk+1].
wy don't we just define a kernel function not in terms of just the state x,
but in terms of an augmented variable z = [x, mu_V(x)]? this is basically a
standard nonlinear feature transformation. then the kernel function can be
such that it also prioritises sampling from different v means. or even
v_mean and v_std... then our kernel function used for maxkernel can
replicate all the visual intuitive thinking we are applying now, and
basically ensure diversity of points also (or, only?) in this new extended
space.


stopped being "optimistic" when estimating the next value level set for
sampling proposals. it seems to make it run much, much better!! so that might
have been a dumb idea. an unrealistically high goal -- situations where the
optimistic level set FAR exceeds the "neutral" estimate based on the
extrapolated v_mean were quite common and caused many states from which the
forward simulations were unstable. Now in just 4 iterations it goes to value
level 100 :) albeit with 512 batchsize. but as discussed I think relatively
large batch sizes, in comparision to other active learning solutions, are what
we should go for. we want to learn the whole function in value level set
slices, not individual points. really this is probably more "continuous time
dynamic programming" or "particle based neural HJB solver" or whatever, rather
than active learning itself.


also the fact that the estimated known value level increases in smaller and
smaller steps (pretty much regardless of all algo tuning) is probably a
consequence of the fact that the volume of each value level set is much larger
than the previous one. if they were all spheres, that volume would just
increase with its radius to the power of nx (or nx-1?) therefore, if we think
of our NN like a GP, in that it has a fixed region around each datapoint where
it decreases uncertainty, we still need to "fill" the state space with a
constant density of samples. Of course NNs are better than that, but maybe we
need some smart regularisation terms or something else still? We want the NN
to be quite confident when it notices that the stabilisation maneuver of the
fast states is "almost" invariant wrt the slow states. How can we encode that
assumption?

Intuitively this would mean the feedback controller remaining "almost"
constant as we vary the slow states a bit, or Dv du*/dx (where v is a
direction only pointing in the "slow state" direction) is small. Can we
directly get a regularisation term from this? u* is a function of (x,
\lambda), so taking two x-derivatives of that we are at a third derivative of
the overall value function. is this practical? is this even needed? can we not
just directly penalise some version of the "feedback gain" in direction of
slow states? which would correspond to Dv u* with v pointing in slow state
direction. That would be nicer to implement probably, corresponding to just a
(value) hessian-vector product probably. maybe we can even skip the u* map and
do the regularisation to the costates directly? then this would correspond to
penalising

 || d/dx_i  d/dx_j  v_nn(x) ||

for all j and i which correspond to slow state. this is now actually just a
block of the hessian matrix. maybe we'll try that.

maybe some other ways to hand-wavily encourage "smoothness" in the NN work as
well. like adam with weight decay? trying that next.

adamw with standard setting DOES improve progress another small bit, like 10
or 20%. neat. however the trajectory plots look worse, with the extrapolation
going haywire quite quickly. but still always 100% of forward simulations
reach the level set. not sure what to make of this.

weight decay + no warmstarting makes the confidence bands look smoother again...


BUT in general all these things are probably "secondary" considerations. I
think the most important thing is to write clean theory, at a level where the
ML model is treated as a black box. i.e. just the level-set DP idea, the
proposal strategy, the pruning/fitting, the proof that it works under
simplifying but reasonable assumptions like well calibrated model, probably
some sort of smoothness/regularity of the solution, etc.

then in a second more detailed exposition we can go over the practical
implementation, including particularities of ensemble training,
prior/regularisation components, how/if we ensure smoothness and good
interpolation, how we empirically verify calibration (or brush it under the
rug if not...).

now probably as said before is the time to start with 1. instead of spending
time on possibly diminishing returns in 2....


BUT, just had to do another experiment. data thinning again. specifically very
simplistic: just throw out (= don't use for training) all the data below a
value of v_k/10. last time we tried this stuff there was no permanent
"marking" of points as known when were known *at some point*. seems to work
even better than expected! we achieve not only smaller datasets per iteration
but also faster progress! maybe due to the fact that we are "asking less" of
the NN in each iteration.

despite this, and the warm starting, the final train error seems to increase
as main the iterations progress. is this OK? intuitively because we only have
relative losses we should be able to achieve low loss constantly. or are we
already in the "nonsmooth" situation??


still there are also quite regular spikes in training losses. could this be
already a sign of nonsmoothness (and working pruning, or at least somewhat
working?) in the case of nonsmooth decision boundaries such spikes almost have
to be expected i think. which unfortunately have the opportunity to heavily
slow down training progress. maybe we do need to ensure the (practical)
possibility of reaching zero or very low training loss everywhere by pruning
"generously", leaving a strip empty of data around the decision
boundaries/watersheds where the interpolation is left to be smooth. also we
have to think about what happens to the uncertainty in those regions: maybe it
never goes down enough for us to "accept" those points which would be
unfortunate.


2024-04-16


made the manifold projection finally correct. project back before starting
backward sim, and for good measure project costate to correct cotangent
space (numerically same as projecting to tangent space). also wrapped the
entire "oracle" function with an equinox filter_jit, makes it noticeably
quicker :)


ideas about that regularisation/inductive biases/approximate invariances
thing, not particularly organised.
 - penalise hessian in "slow" state direction
 - something smart w/ weight decay and other "classic" smoothness tricks
 - include some sort of HJB PINN type loss term after all???
   probably this is the most precise sort of inductive bias we can give the
   model... impose that loss at data points or different points?
 - normalise data to "squeeze together" slow states in x-axis direction? if
   the NN has a GP type kernel analogue this will fit more data into the
   region influenced by the kernel. does it work like that?
   would need to work out correct transformation including manifold stuff
   which might be not trivial.


but this is REALLY not the time to do these small implementation tricks...


thinking about pruning fct.

generally: i think conservatism is misplaced here (in the sense of "only
remove data if we really are like >99% sure that it's suboptimal). if we
prune conservatively, then we may get situations where different local
solutions for the same(-ish) x are in the training dataset contradicting
each other.

even if we prune "perfectly" based on knowledge of all local solutions
everywhere, the resulting nonsmooth watershed will still cause difficulties
with nn training.

so i think the better approach would be to leave an "empty strip" with no
data close to the watershed. that way the NN is free to interpolate between
the two sides, maintaining some degree of smoothness while still having a
practical approximation of the globally optimal solution. it will be
interesting to see how the closed loop trajectories behave near this
"interpolated" watershed.

(github copilot wrote like 20% of this damn. it also writes lots of shit)

how can we realise this "empty strip" idea based on data?

first very trivial idea: whenever we notice that a point is clearly
suboptimal (i.e. v above vk and vnn(x) < vk with high probability), remove
not just the whole preceding (speaking in forward time) trajectory, but
also a short segment after that, like half a second?

maybe it is smarter to do this based on value, not time: whenever a
solution has v(t) > vnn(x(t)) with high probability, then prune the part of
the solution "above" vnn(x(t)). (mean? confidence band? think about that)

for all these strategies,... what


dammit been tuning irrelevant params for half a day again. found that
longer training (with lower final lr) lowers uncertainty inside Vk and
makes extrapolation nicer too (more successful forward sims). not so
surprising i guess.

thought: if we are "only" adding about 1 or less than one datapoint per
iteration to the known set along each trajectory, aren't we kind of
completely invalidating the supposed benefit of doing the approximation
only in larger steps? i guess only half becuase still the data does not
switch between trajectory and function representation every time. but still
something to think about.


2024-04-17

can we impose some sort of hessian loss in the direction of the
trajectories? we don't have vxx anymore but maybe just with the
interpolation? because visually the nn fits do seem to be nonsensical close
to the boundary of the region with data. the label vx is matched well but
not the shape of the curve. which was expected anyway.

looked at longer run started yesterday evening, 60 iters, got up to value
level 1069.046!!! very nice. even cooler, that corresponds to 8% of state
space volume (as measured by large uniform sample, unscaled part only).
it looks like the volume goes up exponentially with the value level, so
there is after all a chance of decent progress!

but the trajectory vs nn plot shows a rather non-smooth fit. do we really
need to find even better hyperparameters? or it could be a sign that
pruning doesn't really work. also the training (even training!) loss looks
almost flat towards the run, going from like 0.3 to 0.1 or even a bit above
still (the vx train loss).


changed sobolev weights to lower vx weight just to see what happens.
interestingly training vx loss goes down even further (1e-3), test not
quite as much (1e-2), sign of overfitting. visually the trajectory/nn plot
looks pretty much the same. is a higher v cost the "secret" to achieving
smaller v uncertainty, facilitating active learning?? calibration plots (in
terms of v...) look nicer too. this probably doesn't matter as much as i
thought initially.

relaxed relative uncertainty tolerance from 1% to 5%. also looks nice. at
least we get faster test results that way -- at a later time we can switch
to tighter tol again. i have the feeling that some "known points" creep in
which are above the value level -- at least the scatterplot looks like it.
either this is wrong code somewhere or the code is as i think it is but
does not do what i think it should do.

but in general, i think i finally found algo settings that allow us to
prototype at a reasonable pace. now all that's left is to actually do
that...

about that pruning operation, continuation from yesterday. this is the
central missing pillar from a theory standpoint and must be addressed.
basic idea is still this - for each point that is "definitely suboptimal",
proceed along the trajectory like this:
 - all preceding states are obviously also definitely suboptimal, so remove
   those (concrete proof though?)
 - the subsequent points will probably at some point become optimal, as the
   trajectory was started (in backward time) inside Vk which we presume to
   be nicely optimal no matter what. How do we find a sensible point to
   start including those new points again?


some a bit more concrete versions of this idea. prune every point that is
very likely suboptimal (based on just posterior, or posterior+Vk
membership). Additionally
 - prune some time duration before that as well.
 - prune some fixed value interval before as well.
 - prune every point of the trajectory where v is above the nn value that
   prompeted the pruning.


2024-04-18

maybe the reason for training slowing down as the algo progresses is also
not the (lack of a good) pruning function. could also be:
 - errors creeping in, making data "contradictory" meaning that
   interpolation is much less smooth and not achievable w small training
   time
 - data scaling issue after all. maybe fitting a very high value label just
   need a lot more gradient steps? an actual log transform like thing might
   help this.


glossing over this: https://arxiv.org/pdf/1612.01474.pdf. they have some
neat ideas about training nn ensembles using proper scoring rules. this I'm
not very interested in, the basic vmapped ensemble should do just fine I
feel like. The interesting idea is adversarial training. Almost exactly
like the classic Szegedy paper, they perturb the training point in the
direction in which the the loss is increasing the most ("fast gradient sign
method"). In the context of uncertainty quantification, can this be
replaced by the direction in which uncertainty increases most? also the
natural way to do this in our case would probably be to adjust the value
based on the costate (1st taylor expansion) and keep the costate constant.
maybe this leads to smoother costate fits? although, I've now disabled
warmstarting which seemed to greatly alleviate that problem.


did a couple of experiments with (full) forward sims and meshcat after 5
iterations, vk = 326. looks MUCH better than what you would expect from the
trajectory plots! even initial state sweeps starting from 3x vk look pretty
much perfect in meshcat. sweeping the angle around the whole manifold is
also really promising, even though v mean is like 3000 at upside down
position, the "watershed" is already really nicely unstable and the
controller looks quite confident considering that it's only a pretty far
fetched extrapolation. really excited about this! this means, i think, that
my intuition about extrapolated decision boundaries are not too wrong.

also looked at a scatterplot of v_std vs vx_std (mean across all states'
individual ensemble "posterior" std). it looks like they are both about
equal, but in the far extrapolated regime, v_std grows further while vx_std
doesn't. But this is nice, could be justification for measuring uncertainty
in terms of v when we really care more about vx. i suspect the version with
v makes less problems around watersheds...

looked also at backward optimal trajectories (in meshcat) after a couple
iterations. this too confirmed my intuition: most trajectories start (in
backward time) with a short section of "slow" value increase, corresponding
to a stabilised fast subsystem and a "benign" trajectory on the slow
manifold. After that it goes into the weird states where the fast system is
fully active. This is nice :)

Here the extrapolation becomes more visible which still shows a lot of
"nonsmoothness", or at least unnecessary fast changes in vx. would be great
if we found a nice way to regularise that, possibly the low ||vxx|| type
prior is an idea to be pursued further after all.

still suffering from writer's block/lack of motivation when it comes to
even just a skeleton of the report.... here is a first guess based on feels
and vibes and the three past examples given to me:

- abstract
  as per usual.

- introduction
  high level problem setting

- related work
  similar solutions to similar problems. go over:
  - MPC: gradient based /zero order sampling
  - learned trajectory optimisation: CACTO, Parag 2022, Köhler/Hertneck
    2018, LQRTrees, Nakamura-Zimmerer, Izzo, early ideas by Atkeson, etc.
  - previous practical usage of backward PMP integration for simple
    problems & attempts to make it scale
  - other approaches to model based, globally optimal nonlinear control:
    - SoS
    - Koopman type stuff? no idea about that
  - explicit MPC
  - approximate dynamic programming/neural fitted bellman iteration/etc.
  - more general RL type stuff
  (maybe this is already way too much and just an unnecessary flex)

- fundamentals
  - optimal control, HJB/PMP
  - relationship w/ method of characteristics, rarefaction/shocks
  - PMP from R^n to PMP on manifolds
  - gradient-enabled/sobolev learning? or this only later

- proposed optimal control solver
  - central ideas at a high level
    - level set dynamic programming
    - pruning of suboptimal solutions
    - active learning formulation
  - implementation example specifics
    - only here expose NN model.
       - regularisation/"prior" trickery
       - practicalities w/ loss scaling
    - how do we make the active learning batched (or this one subsection
      earlier?)
    - 1000s of algorithmic tradeoffs
    - hacks to circumvent "forgetting"

- extensive section with very cool experimental results
  - qualitative demo of strengths & weaknesses of proposed approach
  - comparison w/ different "baselines" for specific problems

- discussion
  - qualitative reflection on pros & cons wrt different ways of "solving" optimal control
    - such as: global optimality vs trajectory optimisation, less
      model<->function approximator switching wrt DP, undiscounted problems
      wrt usual RL formulations, tractable "one-step" optimisation compared
      to anything discrete time
    - and also the cons obviously
  - include some "dead end results"?
    - continuous time DDP
    - backward PMP including vxx for local linear feedback
    - *robust*, HJI version of the whole thing?
  - thoughts on how to address challenges/improve in future work

- conclusion
  as per usual

look at this ^^ tomorrow and decide if shit or lit. goodnight <3


2024-04-19

put template in overleaf, made that skeleton from yesterday in there with
some small adaptations.

falling into the tuning rabbit hole again. noticed that often v is
underestimated especially at high values. this makes sense given that SGD
has a finite "distance" it can cover in finite iterations (with exponential
lr schedule even in infinite iters...). Tried adapting the loss scaling so
that we care a bit more again about higher values. Previously we had:
    loss = (error / (1 + v))**2
corresponding to noisy labels with std of 1+v in the bayesian situation.
changed this to
    loss = (error / (1 + v)**.5)**2 = error**2 / 1+v

secondly, randomly changed NN from (64, 64) to (1024, 16). Training takes a
similar amount of time, but the trajectory plot looks nicer with smoother
costate function, it seems easier to achieve low uncertainty and also the
extrapolation looks more well behaved. is this a fluke? Is the fact that
universal approximation is shown most naturally for wide networks at all
relevant here?


2024-04-21

got the inexplicable urge to try some parameter fiddling on this snowy
sunday. tried the idea with appending the data transformation directly to
the nn again, so the NN itself doesn't have to output very large values.
seems to improve underestimation problem! very nice.

felt emboldened to change the v loss scaling back to the first attempt
(assuming noise std proportional to v). This too looks nice from the
trajectory plots but the training v loss actually increases in the end
again. what is the interpretation here? that vx loss is being decreased at
the expense of v loss? but if vx loss is satisfied then v loss should also
be 0 if the NN is a precise HJB solution. But it's not so maybe the
increasing v loss may be explainable as the NN stretching and bending to
make vx more accurate and somehow messing up v (slightly).

fundamentally: should we somehow incorporate a test set into the whole
thing? evaluating accuracy on a test set would seem like a better approach
than just going by NN ensemble std deviation, and even if just for
monitoring and NN tuning purposes it should be better than not doing it.

seeing weird bumps in that one trajectory vs nn plot we are always
watching. is this the pruning already (half?) working? this is definitely
where we should focus next...


pruning strategy thoughts. it really depends on the type of collision.
If a new trajectory (v > vk) "collides" with old ones (v < vk) then it is
really easy, we just remove the new trajectory from consideration before
doing any NN training. we might even be generous and remove not only the
part proven to be suboptimal, but also some part before that which we only
suspect might be suboptimal. implementing this is a matter of a couple
software engineering tricks.

if new and new trajectories collide it is more complicated, because we
can't really tell in advance whether or not that will happen. So, basic
idea: We detect when the NN training doens't go so well (= fails to achieve
low training and/or test loss) Whenever that happens we assume it to be
because of a collision between two "fronts" of new data. In that case we
can do something about it "after the fact":
 - retrain with smaller value increase, maybe repeat until it looks good.
   THEN do the usual pruning step, maybe generously.
 - still go for some asymmetric loss type magic but i kind of don't feel
   like it
 - try do detect which trajectories are causing the loss to not decrease.
   remove those and retrain. maybe we can even write down some theory about
   probability of being suboptimal under some bayesian posterior or sth.


2024-04-22

trying those pruning strategies. quickly devolved into just re-implementing
the substeps based strategy. it is an attractive idea because we decrease
the distance over which we have to anticipate the merging of level sets. so
we can actually just prune the solutions very likely to be suboptimal
(which are in the level set but with higher value) and a tiny bit before
them, and be fine.

it seems to run okay-ish even with warmstarting and only few epochs per
substep (otherwise this would pretty much be infeasible...)

what's stopping us from just treating the substeps as full steps and also
generating new data in between though? from only ever targeting small-ish
value steps but in turn gaining the ability to warmstart and use fewer
epochs? would be much more elegant than this two-level kind of situation we
are imagining here. are we coming full circle?

if we go that route (small value steps, profit from warm starting) one
implementation trick to try could be sampling the newly added data with
higher probability at minibatch selection. this is basically the same as
weighting theiry loss function more strongly.

the pathological case always is the one where two fronts meet but are
moving into the same general direction. then neither of the trajectories
can be pruned because they never enter the known sublevel set of the other
front. this is very sad indeed.

the alternative always is messing with asymmetric loss functions and the
like again. i suspect complications will arise when also trying to use vx.
intuitively, it only makes sense to use the gradient label if the value
label is already correct -- otherwise the "wrong" gradient label will make
the function completely wrong even though our "basic" asymmetric loss has
made sure that only the correct values are matched. therefore we need at
the very least something which attenuates the vx loss if the v label is far
from correct. and already this is probably a loss function which is
nonconvex in the function V. In any case everything is nonconvex due to NN
weights, but that is pretty well studied and works in practice. Introducing
this entirely different kind of nonconvexity though is probably a different
beast -- at the very best we can hope that our NN approximates gradient
flow in function space well enough, but even that could go to the wrong
local minimum in this case.

other alternative: just accept that collisions will happen. after they do
happen, just remove the points that cause problems (high loss or sth
similar)


okay, what's implemented now is basically this:
 - different tuning (smaller batchsize, smaller value steps)
 - warmstarting NN and only repeating last "fraction" of training, fewer
   epochs and lower initial lr.
 - propose pts from 5 second equivalent value band but only learn for a
   smaller step than that
 - prune any point that is definitely suboptimal (based on knowing a better
   solution < vk when the point has v > vk), and the whole trajectory
   segment "around" any such point as long as it is > vk (so as to not
   delete older trajectories that only just "became" suboptimal)
 - store a buffer of pruned points and never use them again.

initially it works nicely. it looks like it gets stuck around 250-300
though. don't know the reason, probably pruning is still either incorrect
or not working in the context of the many other approximations.

this pruning stuff is still a major weak point. this was supposed to be the
whole reason why learning in level sets is better than just learning with
arbitrary "propagation speed". I feel like there must be a solution that
gets this working as I imagined it initially but somehow I can't put my
finger on it.

TODO tomorrow: put my finger on it

though there are other weak points still, like: active learning batching
implementation & theory still the absolute bare minimum. nn trainig still
shaky.


2024-04-23

list of literally all possible, smart and dumb pruning strategies.

- do only small value steps, every time prune only solutions proven to be
  suboptimal (= v > vk but v_nn <= vk, and preceding segment) and a small
  subsequent segment
  -> will still miss suboptimal trajectories if their RHS's have positive
  inner product
- train with all data, then detect after the fact which ones caused
  problems, remove those & repeat
- train as before, use a decrease in estimated known value level as a
  signal that training didn't go as we hoped. assume the reason are
  conflicts, repeat from previous params with smaller value step.
  hope the conflict can be "pruned" away (with one of the other
  strategies)... in the next iteration.
- asymmetric loss function idea from way back (followed by actual pruning
  of the ones considered suboptimal)
  - obvious issues with vx. do two runs, one with asymmetric loss and only
    v, a problem which should be convex in the function being represented,
    then prune according to that nn, then train with vx but now everything?
- already while training try to mark the data points that lead to high loss
  & exclude?
- continuously sweep the training domain upwards? combined w/ idea of
  removing data
- assume Lipschitz-ness of V or Vx and prune somehow based on that, a bit
  like Otsu/Maruta/Fujimoto, before doing any NN fitting? (could estimate
  lipschitz constant from previous nn fit....)
- somehow make use of the extrapolation for pruning directly? if a solution
  is far above the extrapolation consider it suboptimal? but this too runs
  into many cases where it just doesn't work... maybe only use the
  extrapolation in the domain where we still have *some* data? intuitively,
  if the NN is pulled up by a bad solution (which it underestimates) and
  pulled down by a good one (which it overestimates) then we could do well
  with just throwing away the solutions that look suboptimal given the NN
  posterior (and then in the next training round the NN will fit the
  remaining low, good solution) but this too seems quite shaky, not so sure
  how to implement...


feeling again not particularly motivated in light of all these
difficulties. just spending the day doing some random tweaks, with low
chance of success, and plotting whatever data i feel like looking at.


more on that lipschitz pruning variant. let us assume a lipschitz constant
L for the vx function:

    ∀ x1, x2: || vx(x1) - vx(x2) || <= L || x1 - x2 ||.

this means that the costate is not allowed to change very quickly. in
reality it is discontinuous at "watershed" points, however, excluding a
couple more points is something we should probably do anyway.

this lipschitz constant gives us a conical "exclusion zone" for each data
point -- all other points that violate the lipschitz condition could be
removed. or actually BOTH points? need to think about this, let's try not
to delete all of our data. but maybe this is indeed the right thing to
do... maybe this is where we try to "prefer" optimal data again: if two
points violate lipschitz condition, remove both if their values are
similar-ish, but remove only the higher-value one if the difference is
large enough (along with preceding segment as always).

do we do this only for new data? or for all data? lots of question marks.
also i feel like lipschitzness is something that does not perfectly reflect
the intuitive idea behind different local solution "branches"... it is not
even guaranteed that standard 2-norm in state space is of any use. but
still i feel like out of all the pruning ideas this is not the dumbest one.

maybe assuming lipschitz-ness of v is smarter after all? because we can not
only estimate a tight-ish local lipschitz constant from vx, but even more
nicely some sort of polytope in which we want the gradient to be, kind of a
"direction-dependent" lipschitz constant. probably if we do go that route
it is a nicer option to use v rather than vx. BUT, still, I fail to see how
that solves the hard case of two "fronts" moving in almost the same
direction while merging slowly. TODO think tomorrow about that.


2024-04-24

more on that lipschitz pruning idea. comparing every possible pair of
points is really the least we could expect to do, after all we do have to
decide if one is better than the other one or if they look like conflicting
data. the big upside is that due to the level-set idea, the amount of
points we have to do this with shrinks greatly, we don't have to treat the
entire dataset but just a thin band sandwiched between two level sets (or
the whole "superlevel set" of data in the yet unknown region.

(rest moved to idea dump, "what is actually implemented"/pruning section)


wrote a bit in idea dump, and made fist implementation draft, at
    elif algo_params['pruning_strategy'] == 'lipschitz_both':
in levelsets.py/testbed/prune_and_train_simple.

prunes points based on V lipschitz condition. this is probably alright.
then also prunes pairs of points which violate vx lipschitzness. this i am
a bit less sure about. and if it works at all then comes certainly some
amount of algo tuning. but i think if implemented properly this should be
relatively decent version of the pruning strategy.

currenlty only v lipschitz, not v bounded gradient. the latter in
combination with local, neighbor-based estimation is probably much nicer
still. i actually think that k-means type separation of different local
solutions based on just their x/v/vx combination *could* be a viable
strategy too if we manage to implement it half-efficiently. but let's not
get ahead of ourselves.


2024-04-24


meeting traktanden

daily business
 - report structure feedback
 - pruning strategy headaches
   - many shit ideas, now: lipschitz/bounded gradient assumptions leading
     to pruning rules. maybe not shit \o/

questions
 - difficulty with time management. mindless parameter twiddling > actual
   progress. any general advice? esp in light of only 2 months left so
   corners will have to be cut one way or another
 - switch to simpler examples again for the moment? like 2D/3D things?
 - experiment tracking (wandb/aim/etc): worth it?
 - euler gpu nodes: available and worth it?


notes

report:
    safe -> constrained active learning
    proof of correctness
    GP-UCB, SafeOpt -> BO and active learning proofs
    simpler 2d/3d system. manipulator?

ideas about handling decision boundary/pruning business
    smoothness priors
    explicit modelling of aleatoric uncertainty + proposals based on ratio
    epistemic/aleatoric
    mu sigma output & nll training
    just never simulate backwards further than the proposed point? (in
    terms of value or some kind of distance check)
        -> but then we lose "long trajectory" upsides. can we still reconcile
        this somehow?

endless parameter tuning advice
    no brainer: use euler w/ gpu
    use launch script, set gpu=1
    see if queue says gpu (squeue)
    dependencies: gpu jax tutorial


thoughts on pruning / avoiding contradictory training data altogether.


stopping the simulation at V_nn(proposed_x) is certainly worth a try. but
at the very least it requires rethinking some central design decisions.
namely, we now have to generate ALL data with proposals, rather than
relying for the most part on already existing simulations. we could still
use the existing simulations, *if* the active learning proposal step is
somehow reconciled with selecting those, as opposed to blindly continuing
them all. this is something I've wanted to do for a while anyway.

still doing this does not guarantee no collisions, mostly because we're
really not very certain about the extrapolated value function, and because
the backward trajectory can really go just about anywhere if the dynamics
model is not *very* nice. drawing an example with two fronts merging while
going in a similar direction makes this point clear -- still the backward
trajectories can cross each other. maybe we add some extra "safety margin"
and stop earlier? although, parameter tuning hell emerges once again.


then again the lipschitz ideas while certainly workable for simple examples
might also prove to be not "the right" assumptions for the interesting,
medium or high-dimensional case. but really I don't see anything else
besides just living with a tiny timestep and basically rendering the
advantages of continuous time formulation meaningless.


2024-04-26


goal today:
 - get jax + gpu running on euler
 - use fancy launch script just for a first spin
 - (if time) make it use the arguments


installation w/ gpu seems to work. there is a warning about some version
mismatch causing slower compile times. should I take care that the nn
training scan function is not re-jitted everytime? some options:
 - always give the whole batched training functions equally shaped data,
   by repeating or subsampling, so they can stay jitted once
 - define the scan function outside the train function so the train
   function can re-use the jitted scan function
but it doesn't seem to be very slow atm. NN training is a fair bit faster
than before.

okay, got the launcher script working as well (i think). because now I
don't specify an output file anymore, it just makes new files like
slurm-56541825.out in the current directory.

next step would definitely be proper experiment tracking.

trying out aim at the moment. very cool at first glance. we give it the
entire algoparams dict (except functions), give it values to track and it
does everything neatly. apparently it can also track matplotlib figures?

next step: get aim working on euler.
https://aimstack.readthedocs.io/en/latest/using/remote_tracking.html


2024-04-29

put even more tracking data into aim. figures as images. refactored a bit,
separated tracking/plotting from main functionality (relatively) neatly.

started evaluating train/test loss after each prune/train step. noticed
test loss being above train loss by a large margin. feared overfitting.
however I think the explanation is simpler: our train/test split function
just chooses the first 90% of the data as train and the rest as test. this
happens to coincide with old/new data respectively. so we are reserving a
large part of our new data for the test set! this should clearly be
randomised, changing rn.

this might have actually been the WHOLE reason for the plateau at 250-300
value -- there must be some point where the new data is less than the test
fraction, and so is not used at all. will see after lunch if the change
affects this as much as this would suggest.

project for afternoon: intermingle aim & euler.

instead, first got some more local runs done. it seems like the plateau is
not (purely) due to the train/test split -- it persists after fixing that.
tried around a bit in an ipdb sessions after 10 iterations, at vk=250ish.
tried to find the lowest-value states with [sin, cos] = [0, -1]. there are
many of them with v lower than what we have already! meaning we are already
at the situation where the decision boundary comes up, and might be messing
things up for us.

having difficulty with the remote setup. first fought with github keys on
euler again, i think the setup is correct now though. second, tried to grok
how to reach my laptop from euler. the docs [*] just say use
aim.Run(repo='aim://IP:PORT'). This guy [**] says to use ssh port
forwarding. how to setup?

IMO it should work like this but it doesn't seem to:

 - on own machine, called "server" in tutorial, do 'aim server --repo .'.
   It will say: server is mounted on aim://0.0.0.0:LOCALPORT
 - login on remote with ssh -L LOCALPORT:localhost:REMOTEPORT username@host
   (i think this is "port forwarding")
   - but mayb ethis is already wrong, it says "address is alredy in use"
 - run experiments on remote remote with
   aim.Run(repo='aim://localhost:REMOTEPORT')
   - but that runs into: aim.sdk.errors.MissingRunError: Cannot find Run
     aim://localhost:16000 in aim Repo /cluster/home/...
   - hmm, now it runs into RuntimeError: Failed to connect to Aim Server.
     Have you forgot to run `aim server` command?
   - and now yet another error: requests.exceptions.JSONDecodeError:
     Expecting value: line 1 column 1 (char 0)??? man it is getting more
     cryptic by the minute


* https://aimstack.readthedocs.io/en/latest/using/remote_tracking.html
** https://github.com/aimhubio/aim/issues/2261


2024-04-30

giving up, doing it with wandb. easier than I thought, is working now. had
to do 'module load eth_proxy' to make euler connect to the outside world.
back to the actual work i guess.

most importantly, definitely some sort of pruning that works, even if only
to a low-ish standard. in principle I'm fine with removing everything above
the current known set if it even looks like a conflict might be happening.
then we can still have nice overall motivation: small timestep for ODE
integration and larger timestep for "approximate" global optimality. in
contrast, NN fitted bellman iteration would have to do both things with the
same timestep, so we certainly claim that advantage.

recap again of possible strategies:

 - assume Lipschitz constant for v and vx, remove everything that violates
   it. spent some thought on this, bhavya&lenart are not thrilled about
   extra tuning constants which I agree with.
 - only simulate to the proposal. that would cause the "extrapolated"
   decision boundaries to be used properly. however i can draw pictures of
   situation where this fails, when two fronts are almost parallel.
 - nn asymmetric loss. huge can of worms which I am not keen on opening.
 - sweep up value level continuously, removing datapoints (and preceding
   trajectory segments) where low loss is not achieved.
 - "detect" decision boundaries already in extrapolation by some sort of
   negative curvature, if we encounter it during backward integration, we
   stop. problems: might skip over it w/ large solver step, boundaries can
   have very different "sharpness" (-> another tuning parameter).
 - set up some sort of "outlier-resistant" loss function -- huber-ish loss
   with both v and vx? then, remove points that are in the outlier regime.


but then again probably we will not be doing the pruning step without
assuming something extra about the function. and in that regard I am much
more in favour of explicitly assuming a lipschitz constant, rather than
something like "we assume the NN+training procedure describes somehow the
functions we're interested in so we'll just remove points that don't
achieve good loss"


getting some entertainment out of plotting ||x-y|| vs ||vx-vy|| for points
in value level set. if (vx) lipschitz pruning is supposed to work at all,
at least for small ||x-y|| there should be two distinct point clouds, one
for the lipschitz constant of the "local branches" and one larger one
indicating the conflicts. BUT it just looks like one connected point cloud
af first glance...


qualitatively there are two distinct situations:

a) the two parts of the level set are moving towards each other, in
opposite-ish directions. then we can remove everything proven to be
suboptimal by the already known solution, and a bit after that, and be
fine.

b) the two parts of the level set are moving "alongside" each other, while
slowly merging. this case is harder, as the new data cannot be proven
suboptimal based on the known solution. instead two yet unknown parts of
the solution will collide.


also reading that Holzmüller paper again. pretty sure I'm not doing
anything close to their methods, just loosely inspired. read again sometime
in more detail.


2024-05-01

looking at some plots of that promising run from yesterday. it does look to
be not just a fluke! huge if true.

also changed min_l estimation. previously we would take the smallest of all
l(x, u(lambda, x)) with v_k/2 <= v <= v_k. the new thing takes the
highest-v points below v_k, a fixed number, and then outputs their lowest
v. this gives larger min_l values and larger targeted value steps. not sure
if it is due to estimating the min_l over the next interval better, or due
to missing some important points and selecting a too large interval. my
intuition though says the new one is actually good: by selecting the top-k
values in advance, we favour the trajectories with small value steps, which
corresponds to low l(x, u) (the 'benign', 'slow manifold' trajectories,
which are exactly the limiting factor in the selection of the value step).
anyway, it seems to work even after the change, with much larger steps.
leaving the change until it maybe messes up.


doing some sims to see the watersheds. all in all they seem to do what I
expect! however, there is always a region in the middle where some
simulations don't go to either side but just do nothing and fall down.
trying to investigate rn whether this is an ODE solver problem or actually
a problem with the whole approach. possible solutions:
 - smaller timestep
 - tiny amount of randomisation in costate or input?

but also most states are still in the WAY extrapolated regime where we
don't really know anything. if we take a state which is definitely inside,
a similar thing happens though. it looks like the input in this case
(linspace between [+/-10, 0, 0, -1, 0, 5., 0], very close to x=0) is
actually zero initially, over a small interval close to x=0. to both sides,
one of the inputs become nonzero. this means the two "fixes" above would
not do anything. Intuitively I thought that the input would change so as to
create an unstable dynamic equilibrium near the decision boundary, however
that seems to not be the case. if the unconstrained optimal input passes
outside of the input constraints, we are pretty much bound to get stuff
like this. can this be fixed without huge interventions?

this region can probably be made smaller by tuning such that the costate
switches more quickly, which is kind of a hairy combination between loss
function, training params esp. weight decay, NN architecture, and which
data we decide to throw away. especially this seems to indicate that a
generous "exclusion zone" (and along it a strong "smoothing" of the
solution) also has its downsides. but do we want to make it smaller? do we
just say the corresponding states are unlikely?

this is probably fundamentally unavoidable though, if we stick to a) using
a smooth approximation of the value function and b) using the actual
control-theoretic u* function with the approximated costate. either of
these could be changed, but both are pretty much the core of everything and
changing them would involve significant rethinking.


had plans of doing nothing the afternoon. instead, running some jobs with
configs

flatquad_configs = {
    'T_value_target': [1/2, 1., 2.],
    'weight_decay': [.0007, .001, .002, .005],
    'pruning_strategy': ['conservative_past', 'conservative_bidirectional'],
}

first lesson: all the conservative_bidirectional jobs went into ipdb
(because not enough level band points were found) or had +inf as next value
target. not sure why. are we pruning too much there?


noticed that all euler runs got stuck at 300ish like before. then it struck
me that I didn't git pull recently enough. all because of those keys
somehow expiring as soon as we login a second time.


2024-05-02

ditched git for putting code on euler due to ssh key difficulties. now there is
a simple rsync script that just puts everything relevant on euler, no git repo
there anymore at all.

disabled the square output transform at the NN again. seems to actually work
better without it. goodbye I guess :) as a nice side effect we need fewer
rejection sampling iterations for the proposals because the extrapolation
presumably is less steep.

thought about that decision boundary indecisiveness yesterday. another way to
fix it could be to use a smoothed version of the u* map. that way we will not
output EXACTLY 0 input on a nonzero-width strip around the decision boundary,
but only EXACTLY on it.

Either we could just hack something together for forward simulation which has
the desired smoothing, which is probably not too hard, similar experiments were
done when calculating vxx. OR we could properly redefine the whole problem as
not input-constrained but with a large input penalty cost. then the u* map will
automatically be smooth in the desired way. but probably harder to optimise.
anyway, let's keep this on the back burner as it is probably a pretty
"isolated" issue that can be solved without breaking anything else.

seeing the occasional failure (3 out of like 40 jobs) due to estimated vk
suddenly being VERY negative. can we do something about this? repeat step with
different seed? or more data? or maybe something more fundamentally is wrong in
these runs...

got a bit sidetracked because amsgrad looked to perform much better (past
gradients window? is it like L-BFGS?) anyway after a couple iterations it lost
its edge. now doing euler sweep on lr params instead. preliminary finding after
15 min running: old settings (0.02 -> 0.0002) work best with vk at ~800.

more important issue: still more testing to see if/how that huber loss business
works. adapt pruning if necessary to reflect this?

I guess pruning before training doesn't hurt. we can already throw out
everything that is definitely suboptimal, so we should do that.

however, should there now be a second pruning step after the training? which
throws out all the data that the NN considers an "outlier". that would probably
help it to avoid messing up the fit unnecessarily in the next iteration.
actually, since the whole premise of the asymmetric loss is to detect
suboptimal solutions, marking them as such after the fact is basically a
no-brainer, and should be relatively simple. whatever is DEFINITELY suboptimal
or in the linear huber-loss range is thrown out. there is nuance though in
whether we consider v or vx or both.

when only looking at v we preserve BOTH solutions at the watersheds. with the
vx huber loss this looks to not be problematic, and the v'x agree anyway.

when we also remove data where vx is in *its* linear huber range we create that
empty band of data around the watershed. for a long time I've thought this was
desirable but I'm not so sure anymore after yesterdays experiments, where
smooth interpolation of vx results in nonsmooth interpolation of u* with a
"deadband" in the middle. either we fix that by smoothing the u* map, or we
just watch out so the "exclusion zone" around watersheds does not become huge.
atm I am thinking it is better to keep that "slightly conflicting" data --
maybe we can use it after the whole active learning to create a last NN for
closed loop control, with different tuning choices favouring less smooth
decision boundaries.  probably also not worth worrying too much about.

then again i am always thinking of weakening the vx loss whenever the v label
is much higher than the data, and thus probably a suboptimal solution. holding
off because my gut says this will be not be convex anymore wrt the NN function,
which all previous losses were. might do it if it seems strictly needed, but it
doesn't look that way rn.

the trajectory plot definitely has a "wiggly" behaviour of vx in the region
where it previously started to break down. this indicates to me that
conflicting data is still present but not with a HUGE effect. removing this
somehow is really a no brainer.


other things I should do sometime:
 - log plots with wandb for easier problem spotting
   - done :) saved as images in wandb.
 - find some metrics that characterise closed loop performance and look out for
   those when doing hyperparam selection
 - store nn weights (and data?) somehow somewhere.


more fundamentally, sometimes I feel like I'm gaining lots of "intuition" but
not much concrete understanding. is this fine? probably in the end there's not
going to be the ultra nice error bound or proof or whatever, rather "we see
practically this works". I guess this is fine so long as the basic approach is
motivated clearly enough, which I feel like I can do. ('learning in level sets
minimises the amount of conflicting solutions, and we observe that this type of
NN model IS capable of handling moderate amounts of conflicts nicely' or
something like that)


17:20. brain is becoming muddy. immediate next goals:
 - develop additional pruning step after NN fit
 - think more about NN cost function and interplay with pruning, data
   density, weight decay, etc etc etc.
 - read holzmüller, decide if worth implementing
 - find out why the low oracle_frac_usable is happening


2024-05-03

found out how to record animation in meshcat. open javascript console.

this.viewer.animator.record()
# wait until captured enough
this.viewer.animator.stop_capture()
this.viewer.animator.save_capture()

no clue why apparently this doesn't happen when the record button is
pressed. is there just no button to do the other two commands? man this is
a clunky piece of software. also, when the images finally are saved, the
objects that are white on web become completely black (as in completely
#000000, not like a properly rendered black surface texture). not fun


could not let go of the idea of doing something smarter with the nonsmooth
watersheds than just saying "well, we approximate it smoothly". very old ideas
about multi-output NN come to mind again. well, again on a whim I tried the
simplest most minimal version of it: just change the NN architecture to include
a last-layer "min-out" operation, that is z = MLP(x), y = min(z) (the size of z
can be chosen, atm it is 8), no changes to training at all.

well it seems like it is doing better than the previous "barebones" softplus
MLP!!! this looks to be VERY good news. am planning to do some plots including
all of the last-layer activations to see if something similar to the desired
effect is actually happening, or if rather the better performance is a result
of adding another (small) layer but not that special one necessarily.

if this works we can

a) leave it like that and chill on our lorbeeren, or rather, get to work writing
   and collecting data etc

b) try to debug&modify it even further so that it actually represents the
decision boundaries perfectly, based on good approximations of the two local
solutions, even a bit beyond the watersheds. this could involve:

 - imposing the loss of "suboptimal" samples not on the output (= the globally
   optimal value function), but rather on the matched suboptimal output, making
   the "hidden" solution accurate for a slightly bigger region

 - trying to crank up weight decay even more to make the individual smooth
   parts of the solution actually smooth? maybe minimimses the amount of
   unnecessary decision boundaries where the argmin changes.


after looking at some more plots and closed loop sims, i have to say the
improvement is not very huge. still the watershed is left more decisively, but
not all problems are solved. making this properly very nice probably requires
addressing those points just above in b).

sadly, euler runs seem to also indicate that this "minout" network in itself
does not improve anything measurably.


not sure also if this is at all "compatible" with the fact that we use a whole
ensemble of those NNs. at the very least we create unnecessarily many nonsmooth
decision boundaries which are close together if there is actually an underlying
nonsmoothness, but just anywhere else if not. we'll see if this is a problem.

atm we are using z = MLP(x); y = min(z) for training and inference. if we want
to properly include solutions a bit before they become optimal, we should
change the training part, while leaving inference the same. would it be
sufficient to do this:

def loss(x, v, vx):
    z, z_x = value_and_grad(MLP)(x)
    grad_dists = [ || zi_x - vx || for i=1,..., nz ]
    i_closest = argmin(grad_dists)

    v_pred = z_i
    vx_pred = zi_x

    # from here on the old loss

so basically, instead of "predicting" the lowest value function we instead
predict the one that is already closest to the label in terms of the gradient.
maybe this fails due to producing spurious low solutions -- could this be fixed
by biasing the selection towards low v? e.g. (elementwise product):
  i_closests = argmin(grad_dists * (z - min_z))

other tweak, disable the entire loss once v label >= 1.5 * v_pred or something
similar, but watch out that this does not give rise to "too low" solutions
which are never "catched". maybe just make that loss really weak instead?


can this be formalised and reasoned about at all? smells like some sort of
shitty k-means type clustering of vx, where the means are given by the
function.

actual k-means:
 - find mean of each cluster
 - cluster = everything closer to that mean than any other mean

this thing:
 - adjust nn vx towards closest label
 - hope it develops somehow a coherent function approx.


in general now we have two main paradigms and a significant decision between them:

a) consider local/global distinction a part of the data generation as before,
develop pruning strategy that works & doesn't confuse the NN (which maybe does
have huber-type outlier rejection)

b) consider local/global distinction a NN modelling problem. continue with this
idea of today that even slightly suboptimal solutions are somehow (?)
represented by the NN so the optimal ones are all firmly in the interpolation
regime, so we have nicer accuracy and precise nonsmooth watersheds.


probably both of them will work out alright in the end but I think we have to
choose one and stick with it instead of wasting tons of time.


wrote coarse time plan in todo.txt which


2024-05-04

i think the minout approach works not very well because the minout layer
"collapses" to a single of its n inputs, so we have essentially the same
behaviour as the barebones softplus MLP.

small new experiments with representing nonsmoothness: not minout layer,
but relu in output layer. atm i am doing a "half relu" layer:

    x = nn.Dense(features=self.penultimate_dim)(x)

    half = self.penultimate_dim // 2
    x_relu = nn.relu(x[:half])
    x_softplus = nn.softplus(x[half:])

    x = np.concatenate([x_relu, x_softplus])
    x = nn.Dense(features=1)(x)

so we still have enough "smoothness" possible in the output layer. maybe
that will work? first obvious difference: uncertainty grows MUCH more
quickly as we leave the known region. but at some point it stops behaving
that way. I suspect this too collapses to a single, smooth region of the
relu output layer. if we put more relu (the full layer) it instead is
nonsmooth almost everywhere, only a bit, but still not what I wanted it to
do. probably it is smartest to stick with smooth approximation for now.

tried a few different (smooth, softplus or swish-shaped) activation
functions, makes absolutely no difference, who would have thought.


2024-05-06

playing around again a bit with these nonsmooth function approximation
ideas. I really should probably ditch this idea and live with smooth
approximation. nevertheless, here the newest thoughts. current experiment
is this: we calculate a "softmin" output layer, that is a smooth
approximation to z = MLP(x), y = min(z). namely, we do z = MLP(x), ws =
softmax(-z), y = ws @ z. this means that if the z's are all close, they are
all mixed in similarly, and the backpropagated gradients affect all
functions instaed of just one. not sure if this is any better than hard
"minout" layer. running a small euler sweep to compare the "nn_type"
options.

found out though that if we try to use more layers and it fails to train at
all sometimes (presumably due to gradients zeroing out at initilaisation)
it can be fixed by making the activation function leaky, like for example
doing .9 * softplus(x) + .1 * x. makes sense


nonsmoothness "heuristic" i've thought about in rough terms a couple times
already: specify k different random points in vx space, for each one there
will be a separate NN (or a single one with k outputs). for each label, we
train only the output whose corresponding vx is closest to the label. this
is essentially a voronoi partitioning of the vx space, encoding exactly how
we will "patch up" the different local solutions. it works if for all
distinct local solutions that are "close" (-> watersheds) the two vx fall
in different voronoi cells.

For inference we will have to evaluate all outputs and also figure out
which one the correct one is. the lowest? the one with lowest variance, if
ensemble? not so trivial. probably not much better, or worse, than any of
the other ideas up until now.


on a higher level though, it is probably smartest to stick with smooth
approximation instead of trying to handle the nonsmoothness correctly. just
because it seems to be too late to work on another relatively hard, 90%
unsolved problem. smooth approximation is simple, does work in many cases
and in the cases it doesn't we can write it in the report and get points
for that.

therefore, here some thoughts on data pruning / loss function interplay.
currently we are still suffering from the fact that suboptimal vx labels do
affect the NN. while suboptimal v are (apparently...) successfully ignored
due to the very one sided loss function, it might pay to similarly care
less about the vx loss for those labels. I'll just try this right now
while ignoring that it makes the whole thing less convex.

first attempt: scale down the vx loss with a factor
    rel_err = (v_pred - v_label) / v_label
    scaling = min(1, exp( rel_err / 0.1 )
could work. maybe it will try to "cheat" that way: make the vx loss smaller
by underestimating v entirely. if that happens only around the watersheds,
why not? otoh if it happens in larger regions it is bad. perhaps then we
could disable or weaken the vx loss non-differentiably? or with
"artificial" zero gradient, corresonding to a staircase like approximation
of this attenuation.

starting to think that also this type of "loss function tuning" is not the
way to go. probably the most practical thing at this point is some type of
"generous" pruning. After all all of this huber loss business is just a way
to get a smooth-ish function that fits "most" of the data. the initial idea
about assuming lipschitz constant is essentially the same though right???

i guess more generally it should not be forgotten that these problems
despite their apparent simplicity are not solvable in the general case.
everything we do here is an approximation that maybe works in some cases.
from that angle we should probably relax our requirements about these
pruning strategies a bit.

ML guys tend to call these piecewise smooth functions "mixture of experts"
if constructed with any collection of N functions plus some gating
mechanism (which might be as simple as taking the min). could a half-decent
approach look like this?
 - inference: minout last layer like in experiments before
 - training: replace "gating mechanism" with
   - choose output that already has closest vx to label, and push all other
     outputs "up" somehow. (maybe suffices to have some softplus(-x) type
     loss to make it go enough above the lowest v).
   - or, impose loss on closest label AND all other ones that are perhaps
     at most 3x further. if another mode is already quite "certain" because
     lots of data defines it it might hopefully resist this new vx label,
     but other ones, even if they are further, might not. all while the
     same loss on the mode with already lots of data will cause the loss to
     only be imposed for that one.
   - for both of these, completely disable the loss once the label is some
     relatively low percentage above the "optimal" expert. we don't want to
     spend too much effort training the "hidden" experts.

but probably same difficulties with reaching a globally "coherent"
partition and making sure the different "experts" don't disturb each other.


blindly "accept" the decision boundaries already present in the
extrapolation? can we do this? similar to lenart's idea of only integrating
until the proposal point. for this we kind of need both of these:
 - a way to detect the decision boundary in the extrapolation explicitly
 - a way to enforce that decision boundary by pruning the right data.
but isn't this also achieved with vx lipschitz pruning, if we manage to get
that working?


lipschitz ideas again. i feel like this is equally good but much simpler
than somehow relying on the NN to tell us about the suboptimality. also
once a basic version is working we might imagine finding local lipschitz
constants, or finding it with some vxx-enabled sim, etc. all of these have
no, or only very shitty, pendants in the NN world.

say we assume lipschitz constants L_v and L_vx.
Then, for any two points x_i, x_j, these need to hold if they both are
samples from a lipschiitz continuous function:

    || v_i - v_j || <= L_v || x_i - x_j ||
    || vx_i - vx_j || <= L_vx || x_i - x_j ||

these we can easily check. BUT the lipschitz bound on vx actually gives us
a better bound on the v difference. consider just the line between x_i and
x_j, and a third point x = x_j + t (x_i - x_j). suppose that the vx
lipschitz bound is satisfied with equality, i.e. we have the maximal rate
of change of vx, and that this change is in direction of x_i - x_j.

todo evening:
 - finish deriving this upper bound
 - write it out in code and see if/how often it is violated and whether or
   not those correspond indeed to decision boundaries.


wrote some stuff on paper, not entirely clear yet. For now, assume that
just vx is Lipschitz, || vx_i - vx_j || <= L_vx || x_i - x_j ||. There will
be a "double cone" of possible vx values around some known one. Say we
known V(x0), Vx(x0) and L_vx, and are after an upper bound of V(x1).

Investigate the function v(t) = V(x0 + t(x1 - x0)). We know:
 a) v(t=0) = v(x0)
 b) v_t(t=0) = vx(x0) * dx/dt = v(x0) (x1 - x0)  (chain rule dx/dt)
 c) v_t behaves such that the Lipschitzness of the original Vx(x) holds.

that means, from small v definition and chain rule:
|v_t(ta) - v_t(tb)| = |(vx(x0 + ta(x1 - x0)) - vx(x0 + tb(x1 - x0))) (x1 - x0) |

Now, using the triangle inequality:
   <= ||(vx(x0 + ta(x1 - x0)) - vx(x0 + tb(x1 - x0))|| ||(x1 - x0) ||

And using the Lipschitzness of vx:
   <= L_vx ||(ta - tb) (x1 - x0) || ||(x1 - x0) ||

Suppose both of these actually hold with equality (that means firstly that
the gradient change is aligned with the direction of the line in question,
and secondly that the Lipschitz constant for the original vx(x) is "maxed
out" on the line). Spelled out with tb=0 that is:

|v_t(ta) - v_t(0)| = L_vx ||ta (x1 - x0) || ||(x1 - x0) ||
                   = L_vx ta ||(x1 - x0) ||^2

pretty sure this is not yet entirely correct... feel like it should be a
lot simpler.

https://math.stackexchange.com/questions/1698812/lipschitz-constant-gradient-implies-bounded-eigenvalues-on-hessian


isn't the other direction much easier? that is, say there is an upper bound
of the form: V(x1) <= V(x0) + V_x(x0) (x1 - x0) + C ||x1 - x0||^2. What vx
lipschitz constant does this correspond to? Say the upper bound is actually
tight, and x0=0 just for cosmetics. Then we have:

V(x1) - V(0) = V_x(x0) (x1) + C ||x1||^2

taking the gradient of this wrt x1 gives:

V_x(x1) = V_x(x0) + 2 C x1
V_x(x1) - V_x(x0) = 2 C (x1 - x0)

Does this mean that the Lipschitz constant is L_vx = 2C? and conversely,
that the sought after upper bound is this? 1/2 in front of square certainly
seems well placed, looking like the antiderivative of L_vx ||x1-x0||

V(x1) <= V(x0) + V_x(x0) (x1 - x0) + 0.5 L_vx ||x1 - x0||^2


plan tomorrow:
 - see if this has any glaring mistakes
 - if not try in code
 - think about what happens if using different norms? or i guess it would
   be more idiomatic to just use standard L2 norm and say that this covers
   all matrix norms by state space transformation of the whole problem


2024-05-08

meeting

 - thinking intensively about "minimum of smooth functions" = "mixture of
   experts" function approximation, seeing only dead ends at this point
 - going through with lipschitz pruning. combination of:
    - acting like vx is lipschitz by pruning
    - smoothness prior (weight decay)
    - huber type loss
 - Holzmüller business
    - probably not important to follow this very closely.

fsvgd functional prior gp
test simple cases & get results even if mediocre
uniform sampling ok too


cannot figure out why it starts getting stuck at 120-200 again, even worse than
last time. pushed old git state to euler from when (i think) it was working
just as sanity check. but, it seems to work with that old state at least! (this
one from may 2: 03b7942c)

now I have a 1600 line diff to find the culprit, fml. what is in that diff?


interesting:

 - nn_type not present at all, only vanilla softplus MLP without output
   transformation.

 - algo_params differences (so beautiful table right):
   +-------------------------+----------------+----------------+
   | name                    | old            | new            |
   +-------------------------+----------------+----------------+
   | seed                    | 0              | 32             |
   +-------------------------+----------------+----------------+
   | nn_type                 | None           | softplus       |
   +-------------------------+----------------+----------------+
   | thin_data               | True           | False          |
   +-------------------------+----------------+----------------+

   BUT sweep over second_pruning_sigma shows that everything <= 5 fails, and I
   settled for 3. why?

 - sobolev loss. prob most important. in outlier_loss_experiment=True branch:
   in v version:
       cosmetically different version of the same thing, also without huber width parameter. or is it?
   in vx version:
       parameterisation of pseudohuber loss with
         lengthscale = 0.1
         vx_label_loss = 2 * (np.sqrt(lengthscale + vx_label_loss) - np.sqrt(lengthscale))
       instead of "standard" one. also no scaling down of vx loss as v becomes wrong of any sort.


trivial:
 - removed old_params in experiment definition
 - v_nn wrapper api change (passing whole problem/algoparams now)
 - lipschtz_plot -> lipschitz_plot
 - plot v_nn(x0) in meshcat forward sim function
 - lipschitz pruning experimentation (but that is disabled now)
 - wandb logging cosmetics
 - log, todo.


trying right now to revert the loss function and nn changes in the new version
to match the working, old one. hoping for the best! seeing in the first train
plot the type of stagnation that is certainly fixed by making the activation
function leaky, though.

it appears to work like it used to! will surely need closer investigation of
which of the loss function changes triggered the failure. after all my
intuition says the loss functions are essentially the same, and the old ones a
special case of the new ones. maybe this is wrong, maybe i also confused
over/underestimation, maybe the "weakening vx loss" idea was fundamentally
flawed. will surely try again.

next steps:

 - plot loss distribution / huber outlier classification things for intutition

 - think about that second_pruning_sigma parameter. apparently it only works if
   that is >= 5. what does that mean? generally in the newest learned level set
   we will have sigma about equal to the sigma limit of 5%. that means we only
   prune solutions which are at least 25% suboptimal wrt the NN solution. should
   that rather be hardcoded instead of parameterising with sigma???

 - debug low oracle_frac_usable (like i set out to do a week ago too...)
 - try parameter sweep over loss function which is now working, maybe it will work better
   -> for that first reconcile old & new versions fully
 - try different variants of "throwing away outliers after training"
 - try that "scaling down vx loss for suboptimal looking data" with artificial 0 gradient again, in isolation starting from working state :)


 - do efficient-ish parameter search and accept the solution relatively soon
   even if it looks shite


about low oracle_frac_usable (often it is initially good then deteriorates).
the only reason why a trajectory is "unusable" is that the terminatingevent did
not stop the integration. the terminatingevent in forward_sim_nn_until_value
has the condition:

    (v_mean + 2 * v_std <= Vk) & (sigma <= sigma_max)

meaning we stop once we are both very likely in the level set considered
"known" AND also NN ensemble uncertainty is low enough. but it is not that
useful to know more about those since if

looking at forward sims visually. even within the "known" set many don't arrive
at all :( wat do? even for much lower value level sets it often fails. not just
for a small set of "watersheds" like we would expect from those x0 sweeps to
illustrate those. very sad


finally, experimenting with smaller "band" of data. until now either everything
or the value band [v_k/100, v_next_target] was used. doing a random local run
with [v_k/5, v_next_target] instead. seems to somehow work nicely!
approximation seems to be a lot better around the level set, kind of
unsurprising. is at step 10, v_k = 625 right now and up to that point always
reached value target seemingly without problems!


despite me promising not to, thoughts about MoE. If we know or assume a
Lipschitz constant for each *local* solution then maybe this gets easier? can
we then set up each single expert to be lipschitz constrained (some results on
this via weight clipping and lipschitz activaitons)? can we transfer this to
L-smooth (gradient lipschitz)? If so, this would give a mechanism to certainly
avoid "stretching" a single expert across a nonsmooth watershed. still all
previous difficulties relating to suitable domain decomposition apply, but
maybe this is a small piece of the puzzle? wie dem auch sei, not in this thesis


2024-05-09

plotting cdfs of loss functions w/ individual terms. not sure how to interpret.

it seems that at the largest relative vx errors DO flatten out a bit which to
me smells like a sign that the huber linear region is doing its job.

calibration plot used ALL data up until now, changed it so it only includes the
data we trained on, meaning in value band, not infinity. (so also the test
set). looks much better.

got not much further than that, had to marvel at the acually working results
and calibration plot. 'next steps' list from yesterday still stands.


2024-05-10

enabling vx loss fadeout again: weakening the vx loss but with that staircase
function hack to make it zero gradient so the optimiser won't cheat. put big
parameter sweep on euler with huber loss parameters and vx loss fadeout.

from first 1v1 run though it starts looking like using no loss fadeout
progresses more quickly. that doesn't mean with better approximation error
though. that's something we should definitely also track more closely...

interestingly pretty much all the loss function parameters seem to work! varied
them both by +- 1 order of magnitude. the loss fadeout also seems to broadly
work but not all the runs are started yet (it became the 'outermost' for loop
in the dict combinations).


concurrent MoE ideas, cannot stop myself. always had the idea until now that we
can also include the "suboptimal" data to train the correspondign "hidden
expert" -- this can be achieved by imposing the sobolev loss not on v=min(z)
but v = z_i with i = argmin_i d(vx_label, z_x).

auxiliary loss terms will have to take care of

 a) also imposing the sobolev loss in a weaker or seldomly sampled form on the
    other outputs, in case the closest one is already "fixed" by a lot of data and
    will not come to the current sample
 b) pushing up any spurious solutions that are lower than all other z's but not supported by any data.

is this self conflicting? how can we distinguish between the case where a
datapoint is (very) suboptimal, in which case we should ignore it, and one
where a datapoint is actually optimal but there is a 100x lower random spurious
solution?

does this have some sort of theoretical basis given that the "solution
manifold" has no self intersections when lifting to (x, vx) space? quite
trivially actually because it is the solution of a smooth ode in that space...


2024-05-11

spending some more time getting orbits to work. had to fight for quite some
time against forward sims only slowly going to lower level set, with a long
detour. am thinking that the culprit is lqr guided initial sampling which
neglects certain regions of the state space but makes the nn a bit too
confident there. uniform shooting at least densely fills some regions and
doesn't cover others at all.


2024-05-13

new insight. difficulties in getting the right proposals in orbits example
were caused by nonuniform (origin-concentrated) sampling. the "fast" states
which we would not urgently need data at are very densely covered in
samples, while the "slower" ones where we want proposals are not. setting
log_min_scale=0 in the sampling code fixes it but that would certainly
destroy the flatquad example. what do?


2024-05-14

almost finished section on manifold -> Rn reduction! so proud.

was sad that the whole night's euler runs were not very successful. had
suspicion that pushup prior (soft constraint that v>500 at the problematic
point) was too strong. weakening it to v>50, and also way decreased vx normal
regularisation. let's see.


lessons from this morning's euler sweep:
 - inv_vx_loss_fadeout=0 is the only one that EVER goes above vk=1000
 - v_loss_d and vx_loss_d barely matter at all
 - weight decay .003 better than .001 but not all results in yet
 - proposal_strategy seems absolutely irrelevant


currently trying (locally) the "opposite" strategy from before: making v loss
stronger and vx loss weaker. the hope is that the asymmetric v loss can keep v
"correct" (= at the minimum of all local solutions) without being pushed off by
the stronger vx loss. and vx will hopefully also be somewhat correct with just
a weak loss term, after all they should both work in harmony whereever there is
a clear, smooth unique solution.


2024-05-14

new idea which will definitely fix all my headaches (TM):
 - use quadratic loss for data in "known" set, huber loss only for new data
   -> this will require passing vk into all training & loss functions

 - after training, consider all data that falls into the linear huber region to
   be suboptimal. this will also mark data as suboptimal that might be optimal,
   but where it is "too hard" to find a smooth approximation, which we would expect
   close to watersheds.

 - then maybe perform a second training run, shorter, with only the remaining data?
   to remove the bad influence of the gradient of "wrong" costates.

then do a parameter sweep on the size of all these huber regions, and maybe
sobolev weights again?

going to implement it in two stages. First:
 - classify outliers/non-outliers after training, train again with only non-outliers
   but same (huber) loss.

if that works great.

second stage: if not, we *might* make it better by having quadratic loss for
already accepted data? but if we throw out the outliers then everything should
stay in the quadratic region anyway... maybe doing that will improve problems
with uncertainty increasing again in lower level sets.

more ideas:
 - first run keep that idea with decreasing vx loss when it looks suboptimal.
   second one not, keep the whole loss nice and convex.

finding lots of hyperparams that don't work (i.e. are stuck at 200-500). local
small experiment: T_value_target=1/4? seems to improve at first glance.


2024-05-16

tuning insights:

sobolev vx weight from 1 to 100 is fine, lower bad (with v weight 1)
small inv_vx_loss_fadeout makes no difference

had an error in min_l estmator - would always give too low min_l later in the
algo due to wrongly including early trajectories stopping at way too low v.

orbits example works nicely with less weight decay. is weight decay actually
hurting (increasing) uncertainty artificially? could be if it "overpowers" the
gradient at low lr / significant weight decay.

making huge sweep with lr_final vs weight decay for flatquad example to find
out. the interplay of the two is basically our smoothness assumption so it
should be interesting.

if any of this turns out to work for flatquad as well: very nice!!! this may be
the point where we stop fiddling with the algo. otherwise, last option that
will probably work but make everything more cumbersome: sweep up the value band
instead of learning it all at once, basically the old prune_and_train_substeps
thing, but including the "new" "outlier removal" in between. goodnight <3


2024-05-17

implementing that value upwards sweep idea. but instead of doing it in
discrete steps like in the initial prune_and_train_substeps thing, it is
done in one single training run, just by sampling from just the sublevel
set V < v_sweep, with v_sweep from linspace(vk, vnext, trainsteps). still
discrete i guess but a lot nicer eh?

if we go with constant lr towards the end i think we are closer to a
bayesian interpretation of NN training. weight decay = some gaussian on the
weights, loss gradient = likelihood score function. from that perspective
we should expect that turning up the learning rate, which is equal to
multiplying the loss with a constant, makes for a "colder" posterior, with
more emphasis on likelihood and less on prior. therefore larger lr are
actually good? but then again we inject more noise when lr or loss is
scaled up, and if we view it as approximate SGLD then that corresponds to a
warmer posterior...

is the paradigm of training w/ huber loss and then marking "outliers" as
suboptimal prone to false positives (false suboptimal-s)? if the funciton
is somehow hard to fit but still unique and smooth we will mark data there
as suboptimal, making the function more wrong and more uncertain there. in
turn we might generate more data for that spot in the next round, but who's
to say the same cannot happen again? who's to say such a situation won't
somehow propagate further and grow? this would be very uncool indeed.


other idea to improve sweeping: save multiple "checkpoints" from training
run, e.g. with 1/2, 1/4, 1/8... the value step. then, if the final nn
parameters turn out to not increase the known level set as we hoped for.
then, if the final nn parameters turn out to not increase the known level
set as we hoped for, check those saved parameters to see if they are any
better. similar to adaptive stepsize in ODE solvers, or line search in
optimisers.

this perhaps together with "deciding" on local/global optimality not only
after the whole sweep but after the sweep has "passed" some value level???


filtering out the good runs. a decent number actually got above 1000! they
have:
    lr_final = 0.001 or 0.0003
    weight_decay = 0.0003, 0.001, 0.003
but also it depends heavily on the random seed \o/


2024-05-20

seeing that the sweeping thing actually has slower best runs than similar
params without sweeping. thinking it might not after all be the greatest
idea: because our "global optimality" resolution is severely hampered
by the fact that our NN can only represent smooth functions, it might have
been too much to ask to "increase" that resolution by growing the level
sets continuously.


2024-05-21

bit more writing. also, minor tuning epiphanies. managed to recreate the
first successful runs after switching on the value sweeping strategy. wandb
examples: dnyjthga, lxejiqks, vvnc00yo. crucial thing does seem to be
lr_final=0.01 and weight_decay=0.001 or close. inv_vx_loss_fadeout is
almost irrelevant, vx_loss_d 0.3 seems favourable.

also, on a random local run that was going well switched off thin_data (to
get nicer closed loop sims). this completely stopped progress! maybe it is
because we are just at the limit of the NN's information capacity and
achieving low(er) ensemble variance is just not practical? started a couple
euler gpu runs with larger NN layersizes, otherwise the settings that
worked nicely plus. indeed it seems that the resulting runs are not too
bad, most of them seem to blow right past the recently problematic region
of 400-600. another type of relieving "bitter lesson" \o/ or maybe first
check the closed loop performance of these runs.

set myself perhaps the last nontrivial implementation goal: that
checkpointing thing for value sweep. would be very cool if works. also,
pretty certainly the value sweeping would work better if already during the
training run we would remove points (trajectories...) which are suboptimal
due to a conflict at a lower level set. this might also be deteriorating
training. essentially then we do the full, proper "substeps" strategy from
in the beginning. not today though.


2024-05-21


training data saving done. msgpack in gz. very nice. only 100-200 kB of
data per step, stored on euler scratch together with figures.

next task:
 - use the data and do some proper closed-loop eval.
 - do the checkpointing thing from yesterday? or not.
 - other probably very marginal implementation thing:
   instead of doing a random subsample of the whole dataset to select
   minibatches, take the more usual route of shuffling the dataset once and
   then reading nice and sequentially? v sweep could still be done by
   somehow masking out everything that has v too large and making sure we
   account for that in the batch mean (otherwise the gradient randomly
   becomes smaller and maybe messes things up. scaling makes gradient the
   right size just with larger estimator variance)


2024-05-23

next major refactor to pull out train_and_prune function from main, also
make switch for last retraining with all data from msgpack. sadly it does
not seem to work very well at first glance... maybe training from scratch
is kind of dumb? does it construct the v itself by integration of the (much
stronger) vx loss in pinn style, rather than just fit the function nicely?
should I just also include the nn params and do the much nicer retrainig
step like this?


2024-05-24

meeting:

plan:
 - not even that off at the moment, very nice :)
 - collect results, make nice plots, find sensible metrics to measure
     - define (a bit more) concrete plan for that & just do it
     - implement it in a half decent way, optimally the evaluation is also done
       inside the euler run...
 - comparison w/ modelfree rl algo: implementation + estimate for how much
   work it will be?
 - (publication after all? if so, during or after thesis?)

q: know interesting results on nn approx accuracy vs n params vs
smoothness? would really help tie together the central finding: this stuff
works very nicely so long as everything is smooth, nonsmoothness is a
hassle fundamentally.

writing:

how precise should we be in terms of math?
 - cases where V is or is not discontinuous: investigate or brush under rug?
 - all this differential geometry stuff: how correct? prove independence of
   embedding, e.g.?

still kinda shaky around that active learning problem description, happy
for tips, otw continue just somehow

modelfree rl for cmparison
when to control repo

https://github.com/lenarttreven/when_to_control/tree/master/experiments/normal_ppo_training


2024-05-28

finally doing a good chunk of writing after spending the last 1.5 days
explicitly finding level set intersections in the 2d case. during that i
experimented with "unrolling" the orbits example, to polar coordinates
basically, in which the solution seems to be always unique if we remove the
identification between angles theta and theta+2kpi.

can we also do that for the flatquad example? would probably be easier to
solve the system written with angle (not cos/sin unit circle embedding), up
to moderate value level, and then just say V(x) = min_k V(x + 2kpi
e_theta) for closed loop sim... would surely give very satisfying meshcat
simulations. did a short experiment but probably it is very okay to call
that "out of scope" and not bother.

wrote more in "central ideas" section and started implementation details.
former is content-wise 90% ready, asked for feedback about that. latter is
still VERY rough. points to continue: prune&train write (and motivate) loss
functions cleanly.


2024-05-29

currently struggling to think of a sensible set of metrics, plots, results
to showcase. what do we want for sure?

common to all examples:

 - control cost vs learned value function. maybe distribution of the ratio?

easy example: orbits.

 - compare w/ reference sol. plot value function in
   2d state with contourf. maybe plot the reference sol too.

 - control cost vs learned value function, as function of state? so we
   nicely see the smooth vs nonsmooth part.


flatquad example (and other nontrivial examples if applicable).

 - uniform x0 over learned sublevel set. also find out control cost vs
   reported value. plot cdf/histogram of both of them?

 - plot that control cost vs value along a couple intersting x0 sweeps.

 - link to meshcat screencaps? otherwise the intuitive explanation might
   suffer


 - comparison with sac??

ablation studies.

 vary together:
 - AL batch size and sampling strategy? (so we see: uncertainty sampling
   useful if low data, otherwise unimportant)

 - show complete failure for pendulum case due to discontinuity / long thin
   lines?


finally managed to actually start. now the evaluation thing calculates
control cost and vnn for a 2d grid of states (orbits example). then we can
output this to a file and read again in plotting scripts. similar situation
for flatquad, just need to define relevant metrics i guess.


2024-05-30

did some writing. mostly procrastinating with math nitpickiness. only
slowly progressing towards software-engineering wise implementation of a
sane test process.

disabling asymmetric v loss as I think that has no influence. rather, the
sweeping is doing the heavy lifting (of determining which solution it is).
loss function only has to aid. simplifying now to reflect that.

also want to test if applying scaling to v loss too makes sense. probably
does similar thing to asymmetric v loss but avoiding the "slowed down"
optimisation due to linear region.

is our staircase hack just a shit implementaiton of jax.lax.stop_gradient?

noticed vx loss was defined very inconsistently (scaling entry-wise vs
whole vector), refactored a bit, i think it is different now, both cases
scale the whole vector now. planning to do euler runs to see if it broke
something.


2024-05-31

meeting q&a:

 - how much of that boilerplate is required/wanted? like index of figures?
   list of all abbreviations/math symbols? table of tables????
   a: really none necessary. table of contents yes. notation list maybe
   especially to help myself stay with one convention. rest not important
   (but easy with right \usepackage).

 - how to decide what the hell even to plot? suggestions?
   for sure: distribution of control cost vs learned value, control cost vs
   "reference" cost (2d: ref sol, others: sac?)
   closer investigation of discontinuous points?
   a: SAC decent baseline. watch out for Δt, important tuning param. use
   same x0 as eval of our solution.
   for 2d systems, 'proper' ref sol probably not so relevant. cool if time
   allows.

 - balance between
   a) "we present this and it works", only present algo and then the
      results
   b) also talking about failures along the way, or difficulties *within*
      the algo? only in discussion? only where it currently still fails?
      highlight some failure cases in results?
   a: minimise intuition. only say what we can say with high confidence.
   maybe a tiny bit in discussion, but clearly marked as such. dead end
   results only if they are interesting in some way.

 - publication?!
   a: after handin, maybe meet florian to talk about this. L&B will be
   happy to support but unsure if within their expertise.

other advice

notation: recommendation of fat vectors/matrices (and also abstract linear
maps therefore probably?) in general go over notation to unify

writing: 'red thread' very important! give to others to read too.


immediate, tiny next tasks:

 - make first shitty version of 'orbits' plots and put in report to see how
   it looks

 - do controlcost eval for flatquad.
   - how to get x0 that are the same for each run, but also in each run's
     sublevel set? possibly just learn to V=2000 but only test up to a bit
     lower?
   - try to get SAC working for same case.
     - discretise system, try dts. also discretise inputs??

 - optionally, compute ref sol for orbits by unwrapping angle to isolate
   local solutions if motivated.

 - report:
   - go over notation & unify, use lenart's preamble
   - write where it is most necessary.
     - half-empty parts in central ideas
     - implementation
     - intro


just spent like 2 hours adapting notation to the "vector bold, scalar
normal" style. man this is boring work. should probably first work on
clearing up notation for V, V_theta, V_mean, V_std, \tilde V.


2024-06-03

even smaller tasks:
 - controlcost eval: make switch between
    a) generate x0s from sublevel set BELOW learned one, write to file
    b) read thoe x0s
   then do forward sims & save data.

plotting control cost vs learned v. it deteriorates quickly for v above the
first intersection of distinct solutions. some of them are accurate but
many of them also like 10 or 100x worse. why am i doing this just now for
the first time? this is shite

    as a result, felt the urge to revisit that idea of properly taking into
    account previously generated data in the proposal generation. atm we
    are "assuming" we have no data outside the known set, then making the
    proposals and optimal trajectories, but then learning with the whole
    (old & new) dataset. probably we end up generating "too much" data in
    regions where we already have it.

    obvious approach is this: give propose_pts the ys in current level set,
    then before making any proposals do the same kernel scaling thing but
    with current data, as if they were also proposals. this should leave
    the "empty" regions to be filled with more data.

    plan: implement, run all tests, if good good, otherwise leave it. no
    time for too much twiddling.

other way too late implementation idea: maybe uncertainty in terms of vx is
actually not that bad? bc then it makes more data near the problematic
discontinuities which might help them. kind of like chebyshev points for
fitting polynomials. not sure though


anyways. what are sensible *metrics* for controlcost? should do some
hyperparam sweeps with that in mind (instead of optimising for fast vk
progress which is meaningless if solution is incorrect).

afterwards in the report it is probably nicest to plot a cdf of cost ratios
(achieved / estimated), or a scatter plot v_estimated vs v achieved, or
v estimated vs (achieved / estimated).

ratio of achieved/estimated cost? some sort of percentile summary of that?

calculating now: what percentage of x0s (from scale mixture type sample)
has a ratio of achieved / estimated value lower than 1 + {0.05, 0.5, 5}. we
want all of these to be high, and optimising for those in hyperparam
selection seems reasonable (after all there is no "incentive" to
artificially boost these by overestimation of v --- but could still turn
out like that if we try to optimise too hard...)

sneaking in some euler runs before maintenance shutdown, including some
stats about controlcost in the end. tried to do flatquad without thin_data,
becomes really unreliable. is it just a fundamental limitation of the NNs?
they won't take in any more data even for arguably still quite simple
functions, and 512 layer size? certainly sucks. also closed loop
performance really leaves a lot to be desired.  thinking of just producing
results a *tiny bit* beyond the first intersection....


2024-06-04

results from yesterday seem not that great. at the very best
consider_old_data keeps performance the same, it also looks like it makes
it worse though in terms of progress speed. cost evaluation was not done
for almost all runs because it stopped due to memory limit :(


coding:
 - sac reference
 - controlcost calculation & plotting scripts
   (prepare so we are ready when euler runs again)
 - optionally, compute ref sol for orbits by unwrapping angle to isolate
   local solutions if motivated???

report:
 - make first shitty version of 'orbits' plots and put in report to see how
   it looks

 - report:
   - go over notation & unify, use lenart's preamble
   - write where it is most necessary.
     - half-empty parts in central ideas
     - implementation
     - intro


tried making SAC run. it runs, but performs shit. really not in the mood
for another long algo tuning session though. also i am doubting whether it
is even a sensible comparison to make as we are addressing a *really*
different problem (discrete vs cont. time, discount vs undiscounted), and
any concrete cost comparison would be relatively meaningless. and for a
qualitative, also almost meaningless comparison i think this is too much
effort. or am I just rationalising my laziness like this?

is the learned vs actual cost in any way meaningful, in a concrete sense?
i.e. can we say that if they are actually similar, then we are also close
to optimal? i guess there *might* be a "wrong" value function reporting 10x
optimal cost for each state, and also achieving that suboptimal cost in
closed loop somehow.

can we argue this: we expect the *value* to be estimated quite accurately
(because: V continuous so less approximation issues than vx, vx still
"helps" get an accurate V fit, active learning [perhaps misguidedly]
centered around decreasing v uncertainty), and so ASSUME that learned v is
about equal to optimal v for practical purposes. data seems to support this
because closed loop cost is either almost the same or much worse, but never
much better than learned v (except close to origin where v is learned w/
low absolute, not relative error). ???

what do we do? weasel around concretely comparing w/ anything else? compare
with local trajectory optimiser still? which is close enough to continuous
time with fancy collocation type method. could be made nicer if we manage
to keep track of both local solutions near the watershed.

doing orbits results plot. maaan the more i plot results the worse they
look

trying a local run, flatquad, with these arguments:
--nn_layer_dim=64
--wandb=False
--showfigs=True
--lr_init=0.05
--nn_N_epochs=1024
--active_learning_batchsize=1024
--relative_kernel_lengthscale=0.1
--nn_warmstart_fraction=0.02

which seems quite nice. results in local_runs/run_1717508890 (cab machine).
huge batch size looks like it makes the local/global thing easier again.
can be 'offset' with smaller epochs for similar total steps -- function is
still equally simple. maybe this is our key to nice plots in results? but
still only progress is good, no clue about accuracy. will have to wait for
euler.

is the consider_old_data flag doing weird stuff because the "thick" regions
(low l(x, u), long trajectories) contain much more data (but also larger
empty regions)? prompting data generation in precisely the other parts of
the state space. what do? maybe only take one point per trajectory to
inform the active learning stuff? would make sense also because for the
proposals we are also only considering one point...


TODO tomorrow.

 - make results section, basic structure and first plots, no matter how
   shitty.

 - look at rest of TODOs from past few days & plan in todo.txt.


2024-06-05

bounded weights and lipschitz activation means the NN is lipschitz. do
bounded weights also tell us anything about lipschitzness of gradient?
consider the layer x -> σ(Wx + b).


how to structure results section???

 - introduce orbits system, show results -> basic working
   plot: learned v, controlcost vs v, (also ref sol vs controlcost)?
 - introduce flatquad system, show results -> half working
   plot: (relative) control cost vs learned cost?
         (relative) control cost cdf?
         control cost and learned cost along state sweep(s)?
          - plus also "reference" solution from trajectory optimiser?
   meshcat videos somehow? would be really cool


nach zmittag first: continue results skeleton, moving fast and breaking
things. do NOT be concerned w/ earlier sections or further non-euler param
tuning.


about trajectory optimisation baseline. also not particularly liking most
of the options.

- myriad: no pip, loads of dependencies, including ipopt through apt, thin
  documentation

- trajax: looks nicer software engineering wise. but NO documentation at
  all, and fundamentally discrete time (could cheat though by augmenting
  with input integrator to convert ZOH to linear interp).

- own one: with input vector -> piecewise linear interpolation for ct ->
  diffrax solver (small fixed step) w/ cost in dynamics -> minimisation via
  standard adam or something. (= direct single shooting)
  surely this would work but also might take a couple days to work
  properly...

- resurrect the continuous time ddp from way back? (really probably not
  smart)

not sure what to do aaah this sucks

results section is now there as a skeleton.


next tasks:

 - prepare eval (plots, code, data juggling) as far as possible without
   euler
 - clean up already written parts of report


2024-06-06

finished function to plot control cost along line. doing all the evals
right after run now. saving results in .msgpack.gz files in plot_data or
$SCRATCH/plot_data.

feeling massively overwhelmed by the mess of files, directories and results
i am making. is that a good reason to completely revamp all of this
somehow?

we have these separate components:
 - euler run calculates all relevant data already and puts it on scratch
 - pull script pulls ALL plot data from euler to local. (TODO)
 - plot script works the same as it did until now

which is pretty nice already right? and if we want to get different metrics
from previous euler runs we recover the state with all_data msgpack stored
in each run_dir, go to it with some adapted eval function and plot the
data. but would be nice if we could avoid that.

also for all sorts of sweeps it would be useful to get the list of wandb
ids. how?
- set dummy algo_params['sweep_name'] string by which we can easily filter
- go to wandb, find runs (sweep_name or otherwise), un-hide ID as column of
  run table, export csv.
- the exported csvs also have all the algoparams in it. we could also save
  the csvs separately, read them from the plotting script first, then get
  the relevant data & plot... like:
    $(./human.sh --export-csv-from-wandb) > sweep.csv
    ./plot_sweep.py sweep.csv
  maybe w a plotting script that automatically plots the relevant columns?
  maybe that is too much.


todo fr:
 - make sure all this euler data juggling business works. test for simple
   case, fix bugs if needed, then do the real work:
   - make param sweep over everything relevant to find good base settings,
     achieving low control cost.
   - do the interesting sweeps for the results section
   - additional plotting scripts to visualise those
     (according to plan just above)

   case result & different sweeps to plot.

wat do now though?
 - fiddle with traj opt for reference solution
 - include another example to try???


trying to implement own very simple traj opt. diffrax w fixed step,
piecewise linear interp of input vector, define cost and grad, use
optimiser. was probably a dump idea. first order things work like shit
(wildly ill conditioned hessian), second order wants hessian which diffrax
apparently can't do


2024-06-07

results infrastructure/pipeline/whatever works now. made additional
launchscript to do random subsample of (huge) dict permutations.
rudimentary controlcost lines plot in report now.

q: controlcost lines fig looks really quite nice, with V agreeing w/
controlcost except in small region around discontinuity, as it should be.
why is the random x0 controlcost eval (controlcost_common) still looking so
shit? is it actually mostly okay but the plot makes it look worse? is it
actually as bad as it looks? nevertheless, continuing to run experiments
and optimising hyperparams for high frac_below evals.

next tasks:
 - continue finding good params but stop once diminishing returns.
 - make controlcost_lines figure (and other interesting ones) for parameter
   sweep.
 - TO refsol...
 - other system???????????
 - WRITE!


2024-06-10

time getting tight aaaah!!!!

plan for today:
 - make plotting script for controlcost curve type plots but parameter
   sweeps.
 - refsol idea: backward shooting (from approx value function but over a bit
   longer horizon?)
   - define function xT \mapsto x0 defined by:
     - evaluate λT(xT) according to Vnn
     - backward shoot PMP
   - solve rootfinding problem x0 = (the x0 we want)


struggling again to think of good stuff to plot/measure. this relative cost
is kind of dumb right? in theory the learned v could be COMPLETELY wrong...

v refsol works for orbits example!!! based on angle unwrapping, closest
neighbor + 1st order taylor extrapolation, and selecting best among
plausible solutions. i think if this is very nice we can afford to show the
other results without concrete comparison.

what do we plot now?
 - closed loop cost / learned value, closed loop cost / actual value?
 - learned value / actual value???

did lots of plotting script work. stuff is here now, still not beautiful.
current state:
 - orbits:
   - 1x3 plot with cost, learned/optimal cost, closed loop/optimal cost
   - learned/closed loop/optimal cost along two lines to illustrate
   - TODO maybe in between those one of those nice cost ratio cdf things?
     or something that allows more precise assessment of the suboptimality
     than what we have now at the very least.
 - flatquad (same as before):
   - 2x2 plot of v mean/confidence and closed loop cost along interesting
     sweeps.

started euler sweeps, put them in launch config nicely. sweeping over
weight decay, active learning batchsize,

TODO:
 - label the plots nicely
 - make plot for entire sweep. (same 2x2 thing? single line only?)
 - work on a nice concise and "well flowing" description of the results.
 - everything else (main part polishing, intro, discussion...)


2024-06-11

did some writing, polishing&shuffling in methods section.

made plot nicer for orbits example, with correct labels and everything.

next up:
 - still, sweep plots flatquad

 - more interesting sweeps that actually "use" the manifold part?  can show
   neat failure cases too. actually, should probably do that bc otherwise
   the reader will wonder why the "average" case is so bad but the plots
   look so good.  but have to revamp the parameterisation which is
   currently calculated with linspace. directly specify curves through
   state space as lambda fcts?

 - still writing, same stuff

what to ask in meeting tomorrow?

 - recent progress generally, writing, result collection.

 - ask about sensible plots in results directly. tell current plan & ask
   for advice.

 - some sort of video result??? would be *very* cool, a bit like:
   https://hongkai-dai.github.io/figures/quadrotor3d.html

 - extensive proofreading of whole section 3.
   - most importantly: does it flow? is it understandable?
   - secondly though, everything else too. language appropriate? notation
     clear? comments which I'm unsure too

 - already ask to set aside time or make schedule for final proofreading?

(- notation: will the world collapse if I decide to not follow the vector
   bold convention? )


conclusions:
 - generally time is enough and we should not stress
 - SAC and/or Trajax (iLQR) reference solution is strongly recommended.
   (it should definitely within a day though. cry for help if not)
 - polish section 3 today & maybe tomorrow, then send for proofreading
 - video results:
   - screenshots in report are nice
   - can show whole video in presentation.
 - okay to ditch vectors bold convention.
 - ALL plots need axis labels :)
 - diffusion policies: https://arxiv.org/pdf/2303.04137
   another approach to multimodality/discontinuity, with a nice review of
   other previous option for our discussion.


coded up trajax sol, works nicely initially. about 3s per trajectory
though. (-> it re-jitted every loop. fixed by using scan. much quicker now)

also somehow the cost is HIGHER than ours in many cases - how?  ZOH input
discretisation? numerical errors in our sim? numerical errors in trajax
sol? hard to say. but still we can present it knowing that the alternative
is not the PERFECTLY optimal solution, but a practical alternative.

is piecewise linear input interp possible? would go a long way towards
approximating the ct solution more closely.
 a) change integrator/dynamics/etc somehow hackily to account
 b) add input integrator, with input u' = u+ - u. then the input constraint
    becomes a state constraint, and possibly the whole thing becomes harder to
    solve
atm i am thinking: stick with the non-perfect solution. compare mainly to
"see" nonsmooth solution structure, rather than establish precise measure of
suboptimality.


2024-06-14

combed through section 3 all day. sent it to lenart & bhavya for proofreading.

2024-06-14

TODO today:
 - make plot for sweeps
   - especially: sensible way to get all data. am thinking we do this:
     - supply just the sweep_name algo param
     - get those runs from wandb so we have all tracked data & run ids
     - pull those run ids from euler for eval data
       (single rsync call rather than looping the pull script?)
     - plot nicely

 - run all sweeps again & chill out while they run


2024-06-17

made rudimentary plot for sweeps. plotting the fractions below x%
suboptimality agains the swept param.

reran sweeps, noticed that eval_controlcost_common was False. Redoing
instead of trying to get data in some smart way from almost finished runs.

qs for next time (wed?):
 - obvs: feedback on section 3
 - results:
   - thoughts about reporting "relative suboptimality" generally and
     comparison with refsol only for specific x0 sweeps?
   - sweeps: more interesting stuff to plot than this fraction below x%
     suboptimality stuff?


todo next:
 - think about better things to plot in sweep plots, or sweeps to do
 - write:
   - start intro
   - start discussion etc.
   - continue nitpicking in main section, but with that 80/20 thing in mind
   - results a bit more concise?
   - think about where to sensibly include "implicit" assumption about V
     continuous, and maybe search for some theory to back it up concretely
   - be consistent about defining abbreviations the first time they occur
   - concrete notation for single datapts and trajectories? currently we
     are handwaving it away.


2024-06-18

wrote large part of intro (literature review). is in a surprisingly nice
state already imo. also cleaned up some of the assumptions.

tried to make a plot showing the main idea of being able to use longer
trajectories to minimise approximation error. not yet completely there,
only shows the "bad" version atm.


2024-06-19

made that really nice plot to illustrate remeshing/integration of longer
trajectories. still todo: adequately explain it. the concept, and how it is
practically addressed in the active learning proposal generation.

after lunch: address todos from previous days

2024-06-20

feedback:
 - yes, section 3 is messy and hard to follow
   (lots of specific comments in overleaf)
 - general recommendations about section 3:
   - MUCH shorter and clearer initial overview of idea (like half page?)
   - figures to illustrate everything which might be hard to imagine
     (probably first&foremost how learning in level sets pervents
     evaluating too many suboptimal solutions)
   - anything more specific -> implementation sections
   - lots of implementatin details can go into appendix (comment from bhavya)
 - specifically: level set estimation & motivation of selecting relative
   uncertainty bound.

 - upside: parameter sweeps not so important what to plot. maybe put in
   appendix even. recommend to plot (closed loop cost - TO ref cost)^2 / TO
   ref cost^2 rather than closed loop / learned value. does not matter if
   only evaluating at those six sweeps. BUT could also reasonably just
   compare uniformly sampled pts with TO, initialised with the closed loop
   sol started right there. (but might fail where our closed loop fails
   too)

that second point: we are setting
    σ_max(x) = σ_abs + σ_rel μ(x)
and asking for points where σ(x) <= σ_max(x) (and then finding the largest
level set only containing such points). Now, they suggest the
reformulation:
    σ(x) <= σ_abs + σ_rel μ(x)
    -σ_abs + σ(x) <=  σ_rel μ(x)
    -σ_abs <= σ_rel μ(x) - σ(x)
    -σ_abs/σ_rel <=  μ(x) - σ(x)/σ_rel
so, this is the same as a lower confidence bounds of 1/σ_rel standard
deviations being higher than a small negative constant. so far so good.
does this mean anything in practice? the confidence bound is HUGE in
practice (bc σ_rel << 1) so not really meaningful. also values <= 0 make no
sense in general and should be considered a function approximation
artefact. not so sure if this reformulation adds any insight at all.

"precisify" the intuition about log transform? uniform uncertainty bound in
log domain IF uncertainty << mean (so that log transform can be linearised)

maaaan feeling zero motivation to restructure the whole main section.


2024-06-21

notes in latex which clutter everything but i dont want to lose them
entirely. about first introduction of level-set DP concept.

\bd{how precisely should we introduce the sensitivity issue, and from what angle?}

\begin{itemize}
    \item only practical/handwavy: trajectories diverge, only short horizon useful
    \item literal sensitivity $d\text{solution} / dx_f$ and how its norm somehow increases exponentially with time horizon?
    \item incremental stability $\rightarrow$ time-reversed incremental "instability"?
    \item volume considerations: forward flow maps large volumes of state space to small volume, therefore in backward time the opposite happens
\end{itemize}


\bd{change terminology to more standard indirect backward shooting/PMP backward shooting?}


\bd{loose thoughts about that super clear, short high level intro}

start with causal dependence of solution?

maaan suffering massive writers blockade, really NO idea what to write here so it sounds reasonable... some options on how to start:

\begin{enumerate}
    \item We aim for global optimality, thus dynamic programming, and bc (short explanation) specifically like this. then data distribution. sounds bad though
    \item with backward integration we have limited control over the trajectories, only over a short horizon. so we decompose the problem into short horizons. (but then why level sets?)
    \item FIRST introduce level set idea in like one or 2 sentences. then motivate it: this way we can keep the sensitivity problem in check AND reasonably treat intersections of local solutions)
    \item start everything with pictures?
    \item start with level set idea: if for each value level v, we first learn the value function below and then above, then <good things> (condense explanation in intro of curriculum learning...)
    \item first talk about the 'data distribution' aspect, explain why with backward integration only small horizons work, and then introduce the level set/global optimality stuff
    \item introduce level set idea clearly and concisely. If we do this, then first learned = globally optimal.
\end{enumerate}

maaaaan why is this already an entire page??? also most of these $\uparrow$ overlap. distill something good out of it

possible structure:
\begin{enumerate}
    \item max half page explaining main idea
    \item 1 page with active learning formulation \& quick explanation of each component
    \item
\end{enumerate}

% this (motivation and level-set DP) is all going into an ultra short "basic ideas" section on a much higher level.
% \subsection{Motivation}
% \label{sec:motivation}
%
% We have seen in the introduction that every locally optimal trajectory of the optimal control problem \eqref{eq:ocp} can be found by solving the PMP \eqref{eq:pmp_ours} in backwards time from suitable boundary conditions.
% However, turning this basic idea into a practical method of calculating approximate optimal control laws requires addressing two central difficulties.
%
%
% \paragraph{Data distribution} We may hope that by sampling a large number of $x_f$ from some simple distribution on the terminal set \bd{where do we most sensibly introduce this idea? i.e. assume lqr on some tiny terminal set and start from there}, and then solving \eqref{eq:pmp_ours} from those points, we obtain a set of optimal trajectories suitable for function approximation. However, typically the resulting trajectories are very unevenly distributed, with trajectories concentrated in some regions of the state space while leaving others empty -- a simple example is shown in Fig. \ref{fig:trajectories}. This is related to the well known \emph{sensitivity problem} \cite{mease_hypersensitive,bryson} that plagues all optimal control approaches based on directly solving the PMP. This limitation needs to be taken into account when designing a sampling strategy.
%
%
% \paragraph{Global optimality} When we generate a large dataset of locally optimal trajectories,
% %with backward integration,
% it will typically contain both globally optimal and globally suboptimal trajectories. We want our approximate value function to be a smooth approximation of the actual value function $V$, which describes the globally optimal control input at each state via \eqref{eq:ustar}.
% % \bd{but only after subbing $\lambda = V_x$... -- make one equation explicitly defining optimal input based on V approx?} <-- nobody cares
%
% Therefore, we cannot naively apply standard function approximation tools, as contradictory training data would render the learned function useless \bd{too vague claim?}. Instead, we have to remove globally suboptimal trajectories from consideration, ensuring that only the globally optimal ones are used to inform the approximate value function.
%
% In this context, the points where $V$ is nonsmooth are of special interest. The discontinuous gradient cannot be represented exactly with a smooth approximation. Instead, we aim to practically minimise the adverse effects of this smoothing.
%

% \bd{isn't it a bit icky that we refer to the figures which already make clear use of the level-set DP concepts, before having introduced the idea at all? introduce here but very high-level? only point to figures later? also, this data distribution thing doesn't \emph{really} motivate the level set idea, rather the proposal strategy...}


somewhere:
bhavyad161: A lot of these challenges are highlighted here too:
https://ieeexplore.ieee.org/abstract/document/9806174 Perhaps you can have
a look there and embed this into your text


spent all day shuffling around things in 3.1. I'm sure I improved it
somewhat, less sure if it is accessible enough now.

todo next: go through 3.2.
 - make those actually high level explanations that only take up the rest
   of the page of the algo.
 - think about structure of the rest. main components in 3.2 & details in
   appendix?
 - clear up notation all the way.
   where to introduce Vkapprox? how to write indices? (index V or \mTheta?)


2024-06-24

made "gradual" level set plot. hopefully it illustrates the main idea well.
did not yet procrastinate!!!

worked on 3.1, 3.2, some intro stuff. slow but progress. great

qs:
 - from whatsapp
 - "mini literature review" ok?
   (sobolev learning, DNSS pts, NN ensemble -> bayesian inference)
   or separate all of that strictly to lit. review
 - active learning intro: correct?
 - ideas for structuring active learning intro section?
 - ideas for structuring discussion?

 - ask for proofreading of:
   - intro & fundamentals.


todo tomorrow:
 - START with roughly cleaning up 3.2.x, moving stuff to implementation
 - move implementation to appendix, make sure it roughly makes sense, dont
   spend too much effort though.
 - also, results & sweep cleanup?


2024-06-25

 - moved 3.2 details in appendix. cleaned up all of 3.2 (i think) except
   3.2.7 (training & pruning) which still contains excessively long text
 - moved problem defs to appendix.

maaan  no clue how to structure discussion section. so, what needs to go in
there?

 - tiny summary? like 2-3 sentences?
 - what is good, what is bad. WITHOUT "guessing" why it is so.
 - comparison with other methods in literature
 - further research (this one def last)


todo tomorrow: now actually do it. intro, discusssion still have lots to
do, while 3 is probably closer to being good


2024-06-27

meeting qs

- active learning feedback
- discussion feedback
- that equation with σ_max and the known set -> understand what's wrong
  with it. current presentation probably not any better
- further separation of "clean" math idea to implementation attempt in 3.1
- curriculum learning term

main stuff left:

 - fill out related work (or delete unfinished parts)

 - separate clean math from hopefully working implementation in 3.
 - shorten prune&train there & make sure nobody misunderstands

 - results & sweep plots: label cleanly, use ref cost rather than learned

 - implementation stuff in appendix: clean up a bit

 - discussion & intro & fundamentals: go over last time
 - unify titles: comparison w related work (discussion) and related work


rn: continue making nice sweep plots (probably least important though)...


2024-06-28

working on cleaning up impl details. everything but train/prune and
unwritten last parts done.

todo afternoon: train prune cleanup, both sec 3 & appendix.

if done, continue sweep plots

did lots of other small things. extra tricks written now. other ^^ stuff
not yet :(


last last todos.

 - write results precisely, redo plots if necessary
 - (try to) make 3.1 and 3.2 clearer by separating level-set idea from
   considerations about trajectories & finite data.


definite decision what & how we plot.

- orbits: keep as is (cdf & scatterplot of Vcl / Vref - 1)
- flatquad: same layout, with:
   - Vcl / Vref - 1 on γ_i(s)
   - Vcl / meanV - 1 on γ_i(s)
   - Vcl / meanV - 1 everywhere