-
Notifications
You must be signed in to change notification settings - Fork 0
/
log.txt
6140 lines (4390 loc) · 267 KB
/
log.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
2024-01-15
decided to use this log again for daily stuff.
looked into value expansion version again, from DDP type solvers in continuous
time. https://dcsl.gatech.edu/papers/acc15e.pdf here is a nice derivation for
the HJI equation.
also debating: if we go with this value expansion option, we will have to do
more computation per trajectories, the state is enlarged by an (nx, nx) matrix
(symmetric so maybe half that). is it worth it? can we effectively explore the
state space without too much unnecessary data?
thinking a lot about the "fundamental limits" of this purely backward-in-time
way of solving the problem. inevitably the bottleneck will be travelling along
long, turnpike-type trajectories, when the fast dynamics, which are unstable in
backward time, constantly push us away from it. Will we actually end up
creating much more data than what would be needed? Would it be a smart(ish)
idea to throw away *some* of the trajectories to limit dataset size?
also, inuitively, it seems like higher dimensional problems are even harder
from this perspective. If we want to re-start HJB characteristics away from the
terminal set, we need accurate knowledge of λ(x) and V(x) at that point. This
in turn we can only do when the data density in that region is alredy high, so
that either some nearest neighbor interpolation or taylor extrapolation is
accurate enough. But does this even work in high dimensions? Basic "curse of
dimensionality" intuition says that as dimension increases, this type of
nearest neighbor search becomes increasingly a bad idea: The dataset size
needed to satisfy some fixed data density is exponential in dimensionality.
(another interesting question here would be to challenge the assumption that we
need accurate V and λ to continue a HJB characteristic curve. If λ is slightly
wrong, can we prove something about the suboptimality of the "stitched"
characteristic curve w.r.t. the actual optimal trajectory?)
In the same vein, my initial assumption kind of has been: Better to "stitch
together" optimal trajectories by re-initialising the characteristic ODE (a),
than to re-calculate the whole final part of the trajectory (b). Is this true
tough? To do (a), we have to interpolate the costate and value info, which in
higher dimensions might be problematic. (and even if the sets from which most
backward trajectories start locally resemble a lower-dimensional subspace,
interpolation requires lots of care to not go "outside" of that subspace too
much. This is also something I think where quadratic value expansions can help
a lot: the hessian of the value gives us a nice distance metric, which maybe
encapsulates this "lower dimensional turnpike subspace" thing which hopefully
exists)
all of this makes me ask the question: would a DDP type solver be a better
overall fit? the benefit of our approach (hopefully) is that we only generate
(locally) optimal trajectories, wasting no time on iteratively adjusting
suboptimal ones. However, that advantage can be nullified if in the process of
exploring the state space with optimal trajectories, we generate much more of
them than what we would need for training an NN. Maybe we could have made a
similarly good dataset with much fewer DDP-generated optimal trajectories, even
if they are a bit more expensive to calculate, but we only generate the amount
actually needed, which probably is a lot less than the current idea.
this makes me heavily question the purpose of the project. if we just take a
DDP solver, why not one of the 100s that already exist? If we do our
characteristics type thing we can already guess the conclusion*, so why do it?
* something like:
we attempt to solve control problems to global optimality by evaluating optimal
trajectories backwards. end up with no global optimality guarantee, best case
some probabilistic asymptotic approximate thing. It works great for
"non-multiscale", low-ish dimensional problems, for others we end up generating
huge datasets for not much tangible benefit, all while algo complexity is
basically O(dataset size^2).
ways to address this:
- be aware that it probably won't be state of the art, have fun exploring
algo details and heuristics to speed it up and stuff.
- connect with previously existing methods (ddp) and work from there to some
sort of global optimality guarantee
- do a 180 and start a somewhat different project? (PINN? active exploration?)
also still kind of debating thoughts from last week: should we go all in on
making it efficient with the right data structures? store trajectories in a
tree or DAG like thing based on "flow" of information? use some black magic to
speed up nearest neighbor search, like k-d tree or locality sensitive hashing?
because pretty much no matter how we go about it, we will not only create a
sizable data set of trajectories, but also have to do many, many
nearest-neighbor-type searches over it.
this was a lot of maybe incoherent rambling. sorry to my future self and
whoever reads this. maybe the final paper will be more readable \o/
2024-01-16
looked through literature. found out that our gradient enabled neural network
is already known under one more name: Sobolev training (sobolev spaces are
function spaces endowed with a norm that looks like a p-norm of a vector of
function norms of all partial derivatives up to order k). specifically there is
also an interesting phd thesis:
https://gepettoweb.laas.fr/articles/amit_icra_22.html. They basically use a DDP
type solver coupled with global value function approximation, to find infinite
horizon optimal control over a large domain by iteratively using the learned
value function as terminal cost.
Should we try to expand this to some sort of provable global optimality?
Because I don't think they do that there. There the claim "So, when used as a
proxy for terminal cost functional, ∂PVPv [name of their algo] tends to drive
the locally optimal solver toward the globally optimal solution.", but no proof
or explanation.
possible decent contribution would also be to do the infinite horizon in one
single trajectory with some sort of time reparameterisation. reduces the number
of approximations and weak links, and I think with adaptive solver this can
actually be quite efficient. In fact, I think inf-horizon DDP itself would already
be a cool contribution, even without the "hopefully global" optimality.
other new interesting papers in semanticscholar folders:
- ddp-ish value expansion
- local -> global optimality
- sobolev learning
- V approx, transfer learning, etc.
small tidbits i thought about:
- in continuous time DDP apparently the hessian Vxx is always positive
definite, in contrast to the more often used time-discretised version. Does
this hold even when we have a really sloppy numerical solver? would be cool if
yes
- NN learning: only consider data points with v(x) <= v, and sweep up v during
training? this way we might explicitly identify the points when different
local solutions start conflicting, and maybe do something about it.
2024-01-17
had meeting. talked about all the concerns from last few days and got
reassuring answer that yes, these are hard and, in practical terms, unsolved
problems. advice: just explore around for another week and slowly try to come
up with a sensible research question.
mentioned frustration that "everything" has already been done, no "niche" left.
answer: "own" niche can also be very small, or a contribution to extend
existing work can also be quite small to be significant.
current options:
- explore (probably limited) possibilities with pure PMP backward integration.
also sometimes labelled geodesic flow, geodesic spray, etc. probably only
simple systems, low dim, not multiscale.
- explore backwards PMP integration + quadratic value expansion. the appeal is
obviously still no iterative optimisation over trajectories/inputs. but
scalability concern due to instability and huge dataset remains -- can we
still optimise the hell out of it? with smart proposal strategies or fancy
data structures?
- go for a DDP like solver method, similar to PhD thesis of Amit Parag, and try
to come up with tricks to "encourage" global optimality. Lots of ideas around
BNNs with asymmetric loss, Sobolev learning, active learning, etc. currently
this does seem the most interesting to me.
- actually already infinite-horizon, continuous-time DDP could be a cool
contribution itself...?
- sth else?
different, mostly-orthogonal directions to expand the problem setting beyond
inf-horizon, continuous time:
- some sort of approximate *global* optimality. probably though there just
isn't an actual shortcut to this. give up?
- input constrained (in general dims -> QP solver finds u*)
- state constrained (in the sense of safety limits)
- state constrained (in the sense that x is on a manifold)
- robust (with adversary, HJI equation)
- parameterised family of dynamics
other than that:
- devoured more literature about continuous-time DDP-ish solvers, especially
nteresting ones to do w/ HJI/robust stuff.
- found a nice paper comparing different BNN methods to the "true" posterior
obtained with Hamiltonian Monte Carlo, put it in a new BNN folder.
- found a couple more papers about what I think is essentially the same as
pontryagin backward integration like in SA, one example was planning min-time
flight trajectories on the actual globe with known wind field and simplified
single integrator dynamics. that is actually a cool application, where the
multiscale limitations do not come into play.
Random assorted thoughts, about making DDP work in inf-horizon:
I am doubtful as to whether the time-value rescaling, or some different time
rescaling, is needed. Probably the adaptive solver will be just as efficient
without it [citation needed]. What is more interesting then is how do we choose
the time horizon. Simple first idea: just choose a large-ish time horizon and
define a terminal set Xf (eg. LQR value sublevel set). Then, optimise
trajectories using DDP. If the trajectory ends inside of Xf, good. If not,
increase the time horizon. In continuous time and w/ adaptive solvers, [i hope
that] a "too" long time horizon will not be much of an issue, especially if
most of the time is spent within Xf, where the ODE solver can choose large
steps.
And about distributions/ill conditioning/timescale separations:
also, many of the problems we see (ill conditioning, bad distributions) arise
from basically the fact that systems are decomposable across time scales.
Traditionally, this makes engineering solutions easier, however if we aim to
solve the whole problem at once it becomes much more difficult. Can we somehow
use these time-scale decomposition heuristics to maybe improve/inform our
sampling process, while still ultimately finding the actual solution to the
full problem? not so sure how exactly to go about this.
about possible approximate global optimality:
this seems admittedly hard and convoluted in the framework DDP + Sobolev BNN
value function. One possiblilty: keep all trajectories in a data set. For each
proposal, also find nearest neighbor trajectories, and for each one of them (*)
try to do a homotopy method, moving its initial condition to the proposal while
keeping local optimality w/ DDP.
*: To speed this up we can first do some k-means clustering of the trajectories
in terms of lambda(x0), or in terms of some sampled points along the
trajectory. Probably we get away with at most producing k=2 clusters since
(probably?) the set of points where n+1 different locally optimal trajectories
are equally good has one dimension less than the same set for n different
trajectories. So we might choose the minimum nontrivial number 2...
Other possibility: somehow use a sobolev BNN with multimodal posterior. But
this will be worth nothing if we don't have the data. you know, I am starting
to think that global optimality is just actually not worth worrying about, and
instead we will have to keep relying on engineering intution about the specific
system to decide if we are happy with the local solution at hand.
2024-01-18
thinking about ways to do function approximation which maybe works well given
the structured nonsmoothness encountered in value functions.
What would happen with a standard (B)NN (and enough data of suitable
distribution covering several distinct local solutions) is that we just kind of
smoothly interpolate between different local solutions, making the precise
"decision boundary" rather muddy. What we would prefer is that there is a
separate NN for each adjacent region, and that the output of the whole model is
the minimum of the individual models. That way we benefit from smooth
extrapolation (which over small enough distances should work quite well with
appropriate sobolev loss) without interference due to neighboring, distinct
solutions.
How could we achieve this behaviour, even approximately? We cannot just blindly
have N values passed to some min(.) type last layer, because of this:
due to the global structure of the value function, there exists a curve γ(s)
with γ(0) arbitrarily close to the nonsmooth decision boundary from one side,
and γ(1) from the other side, but with the property that V is smooth at all
γ(s) for 0 <= s <= 1. (intuitively, this curve can be constructed by following
one local trajectory forward to almost the goal/equilibrium, then smoothly
connect to the other solution, which we follow in reverse time until we reach
the opposite side of the decision boundary). Therefore, if we apply a naive
min(.) operator, we cannot exactly represent this because there has to be
another switching surface switching between the left and right regions,
somewhere along the curve, where V is actually smooth.
Therefore my next instinct is to approximate a min(.) operator with some
smoothed version thereof, and add an extra parameter which changes the
smoothness of it somehow. e.g. with some softmax thing. These are the
requirements:
- the soft minimum selection should be a close approximation of the actual
minimum close to the decision boundary.
- the soft minimum selection should "fade out", i.e. change its selection only
very smootly, when we are far from the decision boundary.
- there should be no deterioration in performance if we choose 10 different
"classes" but only 1 or 2 are necessary.
- in general it should not completely break NN fitting.
in other news:
tried to understand some of the results on "PPO attains global optimality",
went over my head still. their convergence rate depends on the discount factor
though with a singularity at 1 so I doubt it is in any way applicaple to the
(IMO much more interesting) undiscounted case.
can we make ddp state constrained in an easy-ish way? I think I wrote this on
some note yesterday. basic idea: during the backward pass have two different
value functions: one from the standard unconstrained problem and one for the
constraint violation \int max(0, g(x(t))) dt. Then, during the forward pass, if
the constraint violation is nonzero, apply inputs minimising its local
expansion, basically replacing V with V_g, otherwise apply inputs minimising
the usual local value function. Does this work? If so, also in continuous time?
Are there problems due to nonsmoothness on constrained arcs?
is this basically a cheap, not-thought-thorugh version of an augmented
lagrangian type method? still want to grok those papers from Lutter et al. If
working that would probably properly get rid of the nonsmoothness problems.
2024-01-22
Yet another flavor of the main idea. For some time now I've been thinking that
while timescale separation introduces mostly problems for "full problem"
optimal control solvers (ill conditioning, uneven trajectory distributions,
stiff-ish ODEs), in the classical engineering setting it is mostly a blessing
because we can often design controllers separately over timescales (sacrificing
optimality but making it much simpler). From this angle I thought it would be
cool if we can somehow use this timsecale separation intuition to our benefit,
i.e. to make learning control laws more efficient, while still solving the
"full" non-decoupled problem.
Wrote a page in the overleaf idea dump about it ("Slow(er) Manifold following).
The idea is this: Instead of the RRT-style framework I thought about for some
time, try to find a backward optimal trajectory on the slow manifold (i.e.
where fast dynamics don't move much). This can probably be done with local
quadratic value expansion (= DDP backward pass) and by somehow identifying the
slow manifold or its local linearisation. Then once our trajectory inevitably
strays away from the slow manifold we can go forward in time along it until it
is ε-close to the slow manifold, then re-seed the trajectory "exactly" on the
manifold. The crucial change then is that we can immediately re-seed the same
trajectory, instead of doing nearest neighbor searches all the time. Then we
can save only the stitched together trajectory which actually stays (almost) on
the slow manifold and only do NN searches among this smaller dataset. Basically
we then reduce our sampling problem to a lower-dimensional one over the slow
manifold, where we first find "all" relevant causal dependencies and only after
that branch out in the fast subspace (maybe this will not even be necessary due
to already generated data as a byproduct). More thoughts in dump. Maybe this
would be cool still...
Other flavor: Modify the Hamiltonian dynamics of the PMP in such a way that the
slow manifold is stable in backward time, instead of unstable. However this
probably requires knowing the stable manifold in advance, which we don't in
general. Very hard to ensure that the trajectories of the modified dynamics are
somehow a good approximation of the unchanged dynamics with PERFECT
initialisation. This is probably a dead end tbh.
OTOH probably the full-blown inf-horizon DDP <-> BNN active learning loop would
probably be even more scalable and data efficient. But admittedly less flashy
and esotheric, and harder to pinpoint where exactly my "new" contribution will
be.
Both require quadratic value expansions so maybe I should start by working this
out properly in continuous time...
So the current selection, to summarise, is this:
a) Develop "inf-horizon" continouos time DDP, implement in jax. Connect with
active learning & Sobolev NN to efficiently find new solutions over whole
interesting region. Maybe some sort of "approximate" global optimality.
b) Try to take the "purely backwards" approach further, possibly like the notes
just above
at the moment I am leaning towards a) which is probably the more generally
useful and scalable optinon. Main components of that:
- inf-horizon, continuous time DDP
- theory (basically figure out those couple papers)
- implementation :)
- Sobolev Bayesian NN
- Which flavour of "bayesianisation"?
- Sobolev how exactly? With hessian approxiamated by random Hvp?
- Find some nice version that half works with nondifferentiable points?
(see notes from Fr.)
- Intermingling of the two
- How to propose new samples?
- How to handle trajectory optimisation? Fixed iterations? Until converged?
- Data structure considerations? Hopefully not
- Reweight data, e.g. by 1/density, for training?
- Bias somehow os we are more likely to encounter "surprising" trajectories
which are better than the currently known solution?
- Other stuff
- Which examples, costs, engineering tradeoff type goals?
this gon be fun
basic research questions to justify these goals (is this the valid way of doing
science? I am but a lowly engineer after all) :
- How can we achieve maximum sample efficiency? (by choosing correct NN and
sampling strategy)
- Are we able to meaningfully speed up long-horizon trajectory optimisation by
going to continuous time?
- To what degree can we automate the process and make it "hyperparameter-free"?
(probably not that much tbh, I feel like optimal control is in the end just
as much black magic as RL, perhaps with two or three less black first steps.)
other small stuff:
- had an idea over the weekend that H(x, λ)=0 describes something useful and
could help in interpolating the right costate (in RRT idea). convinced myself
though that this is nothing. for all other terminal value functions, we also
have optimal trajectories with H=0. So this does not give much information.
- thought about other ways of modifying the backwards PMP system to make
trajectories more well behaved. I think continuously modifying it, while
always adjusting the V/Vx info using the local expansion, is a bad idea though,
can't think of a way to ensure low approximation error.
2024-01-23
some more thoughts on how to represent the "multiple branched" value function
with an NN. got not much further than last time though. Current idea: have an
NN model which outputs k "possible" value functions Vi(x), and a smoothness
parameter κ(x), to finally give it to a "smooth min" type function to form the
final estimate. Scoured literature for descriptions of the "set of x where
different equally good local solutions are globally optimal", found lots of
names, made a section in idea dump.
Coded up a small demo of trajectory splitting with linear V extrapolation (from
costate), the rest constant, for really tiny "jumps". Looks okay, though I am
unsure how to verify it more precisely. Anyway I am thinking this is not really
the way to go.
Also had thoughts about trying to implicitly represent (= learn) the lifted
manifold, i.e. the set of (x, λ) described by our geodesic spray. However I
think this does not fundamentally solve anything, the "complexity explosion"
with longer time horizon is still the same, finding all local value functions
given some x is still an nx-dimensional rootfinding problem, and we make no
progress towards efficient global optimality with pruning of suboptimal local
solutions. Really I think I have to ditch "proper" global optimality soon :(
However, could we somehow "aid" the whole thing with engineering intuition? I
was just thinking always that the 2D quad has one obviously "nonconvex"
decision, which way to go around to go to the upright equilibrium of the
attitude subsystem (or perhaps how many times to rotate if we consider
outlandish states). Can we somehow encourage the local solver to "check" both
these cases? Maybe we can make something that obviously cannot in general find
global optima (we can always make pathological OCPs with arbitrary number of
almost equally good local optima: Think about a tiny UAV flying through some
sort of grid from a far distance). Maybe we can then adapt the NN to have
separate models depending on some simple condition, like: Is Vx saying we
should turn left or right? Or based on literally the number of turns from the
equilibrium. Although this could once more be some half assed solution that
only works in weird special cases...
2024-01-24
found another paper about continuous time DDP, which seems to have a nice
derivation of the propagation of the value hessian. todo read this more
carefully, and attempt to recreate?
https://dl.acm.org/doi/pdf/10.1145/3592454
also got convinced by bhavya that making some custom NN model suited to the
particular type of nondifferentiability encountered in value functions is
probably not the way to go. questionable results and not scalable to higher
dims (due to general lack of global solution).
2024-01-25
been reading that paper for some time. still don't understand all the
notational quirks exactly.
have gained new intuition about the riccatti eq. this is actually really cool:
instead of fiding the rapidly diverging backwards optimal trajectories adjacent
to our trajectory, we find a quadratic/linear parameterisation (= quadratic
taylor expansion of V(x)) of all of them. the good thing is that even if the
dynamics are really rapidly unstable, the associated taylor expansion stays
approximately constant, enabling us to "skip" all the local fast dynamics, and
skip ahead to finding out how to stabilise them locally.
From this viewpoint it is almost nonsensical to propagate Vxx information with
every trajectory in a purely backward PMP integration thing.
viable research goal:
learned *robust* optimal control a la https://arxiv.org/pdf/2107.04507.pdf? and
instead of a gradient GP we use a Sobolev NN for function approximation,
enabling us to use the second derivative. Going from there to optimise active
learning <-> TO interaction. Maybe heuristically "address" global optimality.
for robust control there are ugly technicalities with the terminal set, because
with an adversary we cannot stabilise all the way to the origin, because the
adversary has no input cost and always chooses the maximally destabilising
disturbance, whereas we do have an input cost and at some point stop caring
about small setpoint errors if we have to incur large input cost to fix it. A
very basic example of this is in Pachter & Weintraub.
~~~ other random idea ~~~
In https://gepettoweb.laas.fr/articles/amit_icra_22.html, they use a fixed time
horizon for prediction and use a learned value function as terminal cost. thus
the region where controls are accurately approximated is expanded gradually,
and they find infinite-horizon controls in the end.
I have been put off by this idea for some time because stitching trajectories
together gives us another source of approximation error (though I have the
intuition that this error is somehow very small), and also it gives us an
excuse to implement continuous time DDP, ofc with the hope that adaptive
solvers can "skip" the long boring "turnpike" parts of the trajectory quickly
and give effectively inf horizon optimal controls with feasible effort.
However, what if we do the following. We have a BNN with accurate uncertainty
quantification. To find the initial guess for TO, do a forward sim with
predicted u* from BNN. Evaluating uncertainty in u is no extra work so we might
as well do it, and stop the simulation once we are in a region where
uncertainty is below some small threshold. Then we do the TO with the BNN
(mean?) value function as terminal cost, over a much shorter horizon, hopefully
only concerning the fast part of the dynamics.
This *should* work intuitively right? If there is a "slow" manifold we expect
to have relatively many trajectories there because they all stabilise to that
region. Therefore we can hope there is a large region with low uncertainty
after some active learning iterations, and that this region is quickly reached
in the initial closed loop sim.
Are there concerns with the terminal state changing during TO, possibly to a
"worse" region? Intuitively, we should expect that it only changes to better
regions. Is this true? Try a proof:
Because the initial guess is a closed loop sim, it is a suboptimal trajectory.
Join it with the rest (from T to inf), suppose the rest is actually optimal
(reasonable since uncertainty low there). Then we know the cost-to-go from x0
will be lower after TO than before (bc. we've gone from suboptimal to optimal).
Does this tell us anything about the terminal state? Somehow I have the feeling
it doesn't if we have a time horizon in physical time, without warping...
We can also just check at the end whether the BNN at the new terminal state is
still certain enough, and if not, ditch the trajectory, or re-do with longer
horizon in the next iteration.
current list of very relevant other publications:
DDP treatments in continuous time:
Hutter et al.
https://dl.acm.org/doi/pdf/10.1145/3592454 (w/ general parameters for diff)
https://arxiv.org/pdf/2101.06067.pdf (w/ state constraints)
Sun et al, w/ terminal constraints
https://www.semanticscholar.org/paper/b00a864b457992c5f87a4e0e24382b33c26512f4
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10156422 (w/ control constraints)
+ HJI/robustness:
https://www.semanticscholar.org/paper/5ae21807233c95c5485da721a96cb9ff06a659de
https://arxiv.org/pdf/2107.04507.pdf
Trajectory Opt <-> Learning:
https://arxiv.org/pdf/2107.04507.pdf
https://gepettoweb.laas.fr/articles/amit_icra_22.html
CACTO: https://arxiv.org/pdf/2211.06625.pdf
CACTO-SL: https://arxiv.org/pdf/2312.10666.pdf (Sobolev!!!)
Initial Sobolev learning paper: https://arxiv.org/abs/1706.04859
+ something about BNNs/active learning but everyone knows how these work
...need to be precise about research question, specifically, what are we doing
...better than these?
2024-01-26
read the different continous time ddp papers for a bit longer. i think I gained
new intuition.
backwards propagation of V, Vx, Vxx is actually similar to directly evaluating
the characteristic trajectories backwards, but "modified" so they follow the
forward dynamics. Imagine the PMP characteristic is a particle following some
vector field (which it is), and imagine the Vxx term which we carry with it as
a local representation of the value level set, i.e. the "front" of possible
adjacent trajectories. We can imagine replacing our point by an adjacent point
on the front, and if close enough, we should still get an accurate optimal
trajectory.
The DDP backward pass consists of essentially the same: propagating a particle
backwards in time together with the "front" of nearby possible other points.
The difference is the particle doesn't follow the vector field itself but
travels on the trajectory obtained in the forward pass, like on a rail. Then
there have to be some adjusting terms to account for the "sideways" movement of
our point. Because we implicitly (using Vxx) keep track of all nearby
trajectories, if the optimal trajectory is nearby (read: close enough so the
system is essentially linear in the region between), we have already found it
during the backward pass and just need the forward pass to "retrieve" it (and
to check whether it is actually optimal). If we are further away it is more
complicated, proving that we have a descent direction we can use for line
search is above my paygrade.
Now, the implementation is still not done, I still have to grok some notation
from the papers. Certainly this is the next task.
Also, installed ppopt (explicit multiparametric QP solver) and ran the test
example (control allocation on octacopter). It finishes in like 5 seconds and
gives a solution object with 500ish critical regions. But evaluating the
solution at some point does not work, it thinks the point is not in any
critical region. Want to find out why, also, am unsure whether using such a
poorly documented tool is worth the hassle. Maybe I just stick with 1D or 2D
input sets...
2024-01-30
copypasting from last time: the overarching project goals as of now.
a) Develop "inf-horizon" continouos time DDP, implement in jax. Connect with
active learning & Sobolev NN to efficiently find new solutions over whole
interesting region. Maybe some sort of "approximate" global optimality.
at the moment I am leaning towards a) which is probably the more generally
useful and scalable optinon. Main components of that:
- inf-horizon, continuous time DDP
- theory (basically figure out those couple papers)
- implementation :)
- Sobolev Bayesian NN
- Which flavour of "bayesianisation"?
- Sobolev how exactly? With hessian approxiamated by random Hvp?
- Find some nice version that half works with nondifferentiable points?
(see notes from Fr.)
- Intermingling of the two
- IMO this is the actual interesting part.
- How to propose new samples?
- How to handle trajectory optimisation? Fixed iterations? Until converged?
- Data structure considerations? Hopefully not
- Reweight data, e.g. by 1/density, for training?
- Bias somehow so we are more likely to encounter "surprising" trajectories
which are better than the currently known solution?
- some sort of heuristic global optimality thing? maybe assuming we know about
the number or structure of the different "modes"?
- Other stuff
- Which examples, costs, engineering tradeoff type goals?
been trying to grok the two main continuous time DDP papers again today. Tried
to develop my own intuition, connect w/ characteristics approach, and see
whether deriving the backward propagation from that angle gives us the same,
and maybe lets me understand the papers. Got a bit further on the theory front
but no implementation that is working yet.
2024-01-31
meeting. discussion about whether or not it is worth it to implement own,
continuous time optimiser. they say: probably not, probably they are right. i
can still do this if I get the rest working. then I can focus on the more
interesting part: how to do do active learning sensibly, how to handle
different local solutions, how to make it infinite horizon.
Problem is, in discrete time we get discretisation artefacts, we get no clearly
defined time invariant value function (or do we?) Also the inner optimisation
is now concerned with minimising the cost over an actual time interval, making
everything nonlinear and nonconvex and ugly. and yet more, we don't get to
profit from adaptive step sizes or anything like that. I really don't feel like
"throwing away" all the nice continuous time things. Is this a sunk cost
fallacy?
Stuff to look up:
- Offline RL (learning optimal policy from fixed, suboptimal data)
- average-reward MDP (undiscounted/inf-horizon problems)
goal for next time: formulate overarching goal, high level problem description
in cohesive and clear form
found yet another paper from Farshidian, Buchli, etc. about continuous time
constrained trajectory optimisation: https://arxiv.org/pdf/1701.08051v2.pdf
They state pretty clearly the advantages over NLP-based approaches:
"
In general, NLP–based planning algorithms require the
discretization of the infinite dimensional, continuous optimiza-
tion problem to a finite dimension NLP. This discretization is
often carried out using heuristics, which can result in numer-
ically poor or practically infeasible solutions. Our algorithm,
by contrast, is a continuous–time method which uses variable
step-size ODE solvers in its forward and backward passes.
Given the desired accuracy, it can automatically discretize the
problem using the error control mechanism of the variable
step-size ODE solver. Informally speaking, this allows the
solver to indirectly control the distance between the “nodes”
in the feedforward and feedback trajectory. In practice, this
decreases the runtime of an iteration, since the number of
calculations decreases.
"
2024-02-01
finished own derivation of how the value taylor expansion evolves in the
backward pass. just couldn't help it. coded it up in skeleton form, still a few
things missing, I almost fear trying it out.
2024-02-02
continued hacking a bit at the implementation. been thinking about how we best
access the RHS used to make the forward sim in the backward pass.
mathematically we can just evaluate the time derivative of the solution
trajectory. however, it is a polynomial interpolation, not the real thing.
Compared the two with a plot and we indeed see rather large errors (more than
10% of the state values themselves). So, there seems to be no shortcut to
evaluating the RHS again, given that the ODE solver chooses different time
points anyways. And, come to think of it, we need the state derivative of the
RHS too which will dominate computational cost in all cases.
maybe using the interpolation is even better? the interpolation's time
derivative tells us where our actual approximate "solution" trajectory goes,
while the RHS evaluated again tells us where it "should have gone", loosely
speaking. will have to try it out. using the derivative is certainly easier
though. but even if we do that we need the previous input anyway, and to get
that reliably, the previous costate, and in turn to get that, either the full
data from the last backward pass, or some interpolation of the costate seen
during the forward pass.
ideas to start working next monday (high priority):
- formulate overarching goals neatly, maybe in own latex document?
- finish the backward pass implementation, debug, sanitycheck, work towards DDP
- if it completely fails, resort to trying to understand other papers
- think about active learning part and local/global heuristics.
- maybe start using some finished DDP implementation for other experiments?
low priority:
- implement inner convex optimisaiton in general case
(brute force active set = region-free explicit mpQP?)
- come up with example dynamical systems illustrating different up/downsides of
our approach
- think abt extensions (2 player differential game? parameterised?) in fact
just the combination of those two (basically unifying Hutter et al. with
Sun&Kleiber would be a great novelty but also considerable coding work)
2024-02-05
...vs what I actually did: re-do the calculation of \dot S along
characteristic curves. The thing from last time turned out to be not
symmetric, therefore it must be wrong (still unsure where exactly I went wrong
though). The second try is akin to the DOC derivation: write down a solution
of the characteristics plus some small variation as (x, \lambda) + (\delta x,
\delta \lambda). Then by subtracting the non-delta part and doing a first
order taylor approximation we get "differential dynamics", a time varying
linear system for the state variable (\delta x, \delta \lambda). like DOC we
then parameterise lambda as a linear function of x, S x. After juggling around
some terms the "differential dynamics" also tell us how this matrix S evolves,
we get something depending on different second partial derivatives of H^\star
and also S itself, looking very reasonable. Let's try it out in jax.
2024-02-06
was a bit sick in the morning, feared i had covid, slept in. was better by 10.
did some more philosophising about the differences between the DOC derivation
(LQ-isation of entire OCP about current iterate + solution of linear BVP) and
the weird thing I did (follow characteristic curves first and foremost, but
adjust the taylor expansion to account for the different direction in which
the forward trajectory points.) I may have found the central difference,
namely that in our version we are taylor expanding the pre-optimised
hamiltonian about the characteristic field given by current x and lambda, and
with it the u* map which is piecewise linear. In DOC this taylor expansion is
formed about the iterate trajectory. But also they have no input constraints
and no piecewise linear u*, just a linear one. Maybe read Sun&Kleiber more
precisely again to see how they handled it?
then, continued the implementaiton (of our, almost-characteristics-following
version) and got to the first run where numbers come out on the other side
juhuuu ;) On the first try the value and costate match the forward trajectory
(which was evaluated in backward time and so is already optimal). sanity check
passed! The entries of the S matrix look kind of sus to be honest but then
again it is a trajectory which initially has the uav spinning quickly so that
is kind of to be expected. more sanity checking follows tomorrow, goodnight :)
2024-02-07
implemented the forward passs! it seems to work as expected!!!! except for some
small questions, such as: why does S = V_xx sometimes turn
non-positive-definite?
2024-02-07
had big enlightement this morning by feeling like the DOC derivation is the
correct one after all. by properly approximating the problem about the current
iterate with quadratic cost and linear constraints, then solving the resulting
linear subproblem, they basically build a standard newton method, with easy
proofs of convergence, line search etc. Went over the DOC style derivation
again, I think I have it correct now, first implementation looks alright
(symmetric \dot S and very close to other version). Not sure if they are both
actually the same. Maybe the only difference is really which linearisation for
the u* map we are using.
spent some time thinking about line search. in contrast to fininte-dim
optimisation, we cannot just add arbitrarily many iterates on top of each
other, but standard line search basically requires this. by letting x+ = x +
alpha * (descent direction) we are making x+ a linear combination of the first
iterate and all descent directions.
so we need to be somewhat smart about this. brute force solution:
- attempt full newton step, if some sensible descent condition holds, accept.
- if not, repeat backward pass, with an additional "regularization" term
penalising ||x - x_iterate||. this gives still a LQ subproblem which we can
solve exactly. Do the forward pass again, check descent condition. If OK,
return; otherwise repeat the step with a larger "closeness" penalty.
Repeating the backward pass is expensive though. Other papers rely on just
applying a scaled down version of \delta x and \delta u. But with input
constraints this will be a bit complicated. Would much rather have a 'scaled
down' value function. But to do that properly we'd need to keep track of the
local value function that generated the last forward pass. And if that was
already made by combining the previous backward passes we need both of those
too.
easy heuristic which maybe works: Do only 1 backward pass. If descent condition
fails, then do another forward pass, but with local value function:
V(t, x) = v(xbar(t)) + λ(xbar(t)) (x - xbar(t)) + α/2 (x - xbar(t)).T S(t, xbar(t)) (x - xbar(t))
for some α > 1 which we may increase to increase the quadratic value term. this
corresponds to using a larger linear feedback gain which surely will not work
for significant α.
There must be something akin to "step in the descent direction but only a
fraction" that does not have all these issues...? how?
2024-02-07
did some further attempts at replicating DOC's S and s updates but to no avail.
mostly looked at pretty plots and figured out how to record webcam timelapses
from mjpeg streams.
points to start next week (this is essentially copy pasted from last week...)
- (still) formulate overarching goals neatly, maybe in own latex document?
- further debug & sanitycheck the backward pass, work towards DDP
- either decide on DOC or my implementation, or do both separately and compare
(but stop switching every half day)
- think about active learning part and local/global heuristics.
- maybe start using some finished DDP implementation for other experiments?
2024-02-12
re-did the derivation from the chacteristics angle without the error I made
previously (in confusing a total and partial derivative of the hamiltonian).
Got the same fomula I have from the DOC style derivation for \dot S. looks
good.
did a couple sweeps (move initial state a bit, do a fixed amount of DDP
iterations) to obtain a family of (hopefully) optimal trajectories. Next up,
further sanity checking: Are the trajectories actually optimal? Find some way
to see how close to a solution we are.
But all in all this looks great. Was also happy about how quick it seems to be
after jit. it does 32 iterations, for the flatquad model in like 0.1 seconds :)
2024-02-13
continued playing around with ddp. made a cleaner implementation in functional
jax style with some extra abstraction. did some initial state sweeps. looks
pretty nice :) i think for the moment this should be workable.
2024-02-14
continued the implementation & some tests. made plots and brought meshcat
visualiser back from the dead. looks all pretty nice. also removed the
initialisation with backward shooting by a simpler one with LQR-controlled
solution near goal state.
2024-02-15
trying some more "difficult" states to sweep to. quickly the ODE integrator
steps explode. first remedy i am trying now: set maxsteps to something
outlandish like 10000, and set step controller dtmin to T/maxsteps.
theoretically this should work! but maybe at the expense of accuracy. there is
also the norm argument which might be useful especially in the backward pass:
norm: A function PyTree -> Scalar used in the error control. Precisely, step
sizes are chosen so that norm(error / (atol + rtol * y)) is approximately one.
in the forward pass as long as the system is about unity scale the default,
probably the 2-norm, should be good enough.
alright the problem is not the ODE integrator probably but some other error.
when sweeping from initial state 0 to Phi=pi/2, initially all goes well for low
angles, but suddenly, the angular rate goes either way up or way down. this
smells somehow like wrong treatment of input constraints to me. I'll plot the
input time series next.
they look quite decent, and already have significant constrained parts before
it goes wrong, so probably we are alright there. the backward pass in iteration
816 (first one that goes wrong) blows up to like 1e7 which the previous ones
don't do...
lowered solver tolerance on backward pass, and the particular instance of
blowing up is removed.
finding a new problem where NaNs creep up in sol.evaluate somehow. or wait, no,
it is the same old problem where backward pass blows up.
making plotting code for backwardpass in jax-ish, functional implementation.
did not help to definitifely find the issue. sometimes S explodes, causing
crazy unstable forward pass with basically random bang-bang input, messing up
all subsequent runs.
2024-02-16
meeting notes. talked at length about "grand plan".
suggestions for next time (in addition to what I planned to do anyways):
- think about how to compare continuous time DDP to alternatives
(e.g. discretised DDP with Euler/RK4, or adaptive collocation+NLP)
and see what it can do better, and what worse
- think about active learning problem/formulation.
do we do uncertainty sampling in V or V_x? Bhavya has some points in
favour of V, mostly that the problem is closer to the standard
formulation, V is continuous, and there is an easy rule which V is the
best among different local solutions (the lowest). for V_x or even u
it is a bit hairy in contrast.
- make some sort of collection/visualisation of related works.
2024-02-19
trying to investigate the effects of evaluating d/dt solution(t) instead of
RHS(solution(t)). For the first, very easy (pretty much in linear domain)
example we already see ripples in the forward pass solution that do not appear in the initial guess (made with LQR value function). Possibly the problems I'm seeing are due to this.
turns out, there are ALWAYS ripples at least in the omega state which, while
not hugely messing up the state info, are not very small and certainly render
the derivative useless.
first tried to store the costate in the forwardpass solution using the fn argument
of SaveAt, however that does not make it to the interpolation, so is useless.
https://github.com/patrick-kidger/diffrax/issues/301
so to do it properly we kind of need to evaluate the interpolated previous backwardpass
during the current backwardpass to know the direction of the previous forward solution.
but here is the kicker: to do that, we need to know at what point the taylor
expansion (from the previous backward solution) was done. To know this we have
to know the forward solution even before that!!!
so, will this just fundamentally not work unless we are willing to evaluate a
growing number of ODE solutions summed together? This is kind of the same
issue as line search: To do it properly we have to have an iterate which is a
weighted sum of all past iterates. This is trivial in the finite dimensional
case. But in our case the adaptive discretization messes everything up. Is
this really a dead end? Would be very sad.
--> this "kicker" is actually not true. we only need for a particular backward
pass the forward solution immediately before that, and the forward/backward
solutions that produced it.
To make it nicer i reorderd forward and backward pass, making current
iteration again only depend on last iteration. was a whole day of refactoring.
Now it works (apparently) very similarly to before but we eliminated one
possible cause of uncontrolled numerical error buildup. Need to look more