CLV Management / Saving mode #19

pierrebarbera · 2020-12-15T14:20:28Z

This PR introduces a new optional feature that allows the number of concurrently allocated CLV buffers to be set as low as log_2( #leaves ) + 2, compared to the usual one per inner node of the tree (or three per inner node in the case of epa-ng).

The main idea behind how this is implemented is that of actively managing this set of limited CLV slots, translating the usual clv_index into an internal slot_index, and deciding which slot should be overwritten if there aren't enough free slots available.
This behaviour can be customized through the use of callback functions, collectively called the replacement strategy.

If/when you have time, please give some thought to what would still need to change for you to be able to use this in your code. This PR is primarily about feedback, though if we feel we can merge it safely that would be great fo course.

New structures

Like the repeats, the clv_manager is intended to be a feature with a self-contained struct that holds everything it needs:

typedef struct pll_clv_manager
{
  /**
   * Some upfront terminology:
   * - A slot is a buffer for one CLV that is held in memory
   * - the clv_index works like always, though now they function as
   *     "addressable" CLV indices
   * - a clv index that is slotted, means the CLV resides in memory
   * - a clv index that is pinned, means the CLV resides in memory and may not 
   *     be overwritten
   */

  size_t slottable_size; // max number of CLVs to hold in partition
  size_t addressable_begin; // first clv_index that is addressable
  size_t addressable_end; // one past last clv index that is addressable
  unsigned int * clvid_of_slot;
    // <slottable_size> entries, translates from slot_id to clv_index of node
    //  whos CLV is currently slotted here
    //  special value: PLL_CLV_SLOT_UNUSED if this slot isn't in use
  unsigned int * slot_of_clvid;
    // the reverse: indexed by clv_index, returns slot_id of a node
    // special value: PLL_CLV_CLV_UNSLOTTED if the node's clv isn't slotted 
    // currently
  bool * is_pinned;
    // tells if a given clv_index is marked as pinned
  size_t num_pinned;
  pll_uint_stack_t * unused_slots;
    // holds slot_id of slots that are not yet used
  pll_clv_manager_replace_cb strat_replace;
    // replacement strategy: replace function
  pll_clv_manager_update_cb strat_update_slot;
    // replacement strategy: update a slot with a clv_id function
  pll_clv_manager_dealloc_cb strat_data_dealloc;
    // replacement strategy: dealloc the custom data
  void* repl_strat_data;
    // void pointer to whatever data your replacement data might need
} pll_clv_manager_t;

Notes:

addressable_begin and ..._end is my way of dealing with the tipchar mode and its implication for what is a valid true clv_index
addressable_end equals tips + clv_buffers as specified during pll_partition_create (more on that later)
in most cases, unused_slots is kind a kind of uneccesary optimization, as usually you wouldn't explicitly mark slots as unused, so for those cases it only saves time in the very beginning, making getting an unused slot O(1) instead of O(#slots). It also complicates the codebase as I implemented a stack for it. However I think it should be left in as it may come in handy for non-standard use-cases and replacement strategies.

The callbacks can be set to custom functions to implement different behaviour for choosing which unpinned slot should be overwritten next, if there are no unused slots available. By default these are set to the minimum recomputation cost strategy, that aims to overwrite those CLV first that are easier to recompute if they are needed again.

Lifetime management

The manager struct is deallocated with the partition (dduring pll_partition_destroy). The CLV manager dealloc function tries to call the replacement-strategy deallocationc allback to deallocate any custom data used there.

Changes to existing code

I tried to keep these as limited, and as "zero overhead" as possible.

clv access

The biggest change is that, with the memsaver enabled these kind of accesses are wrong:

partition->clv[ node->clv_index ];

Instead, access to the clv buffer should now be done through these functions:

pll_get_clv_reading provides const/read-only access to a clv, NULL if the clv_index is not slotted. There is also pll_clv_is_slotted which can be used to check on whether a clv is slotted, such that the developer can possibly take action if not.
pll_get_clv_writing gives mutable access to a clv index, meaning that if the clv is not slotted it will give that clv a slot, either by assigning an unused slot or by replacing some slot that isn't pinned/ready for reuse.
there is no explicit read+write access function yet. Also, these functions are prime candidates for where thread-safety would come in.

pinning

In the existing code the access functions come into play, for example, in the call to the partials functions. There, we also encounter the second major (but small) change, as we have to explicitly pin the parent clv of the current to-be-updated partial, then do the normal update, followed by unpinning the two now no longer needed child-clvs:

first pin

then unpin after

changes to clv allocation

Like with the repeats mode, the allocation of the clv buffer is dependent on information we only get during the later initialization, namely how many slots we want. Consequently, the clv buffer is allocated after the partition creation. As this is shared between repeats and memory saver, I refactored this part into its own function to reduce redundancy.

other minor changes and additions

as we use C99, I changed pll_bool_t to use an actual bool. From what I saw it was only used once.
changed pll_utree_every to take a callback that can actually change the nodes it visits (I think there was no difference to pll_utree_every_const?). There should at least be some function allowing changes to the nodes to make it a useful apply-like function, and since there is already a const vs nonconst split here seems like an obvious change.
related to this, I added a pll_utree_foreach function that takes callbacks and data for those callbacks both for deciding if we should keep traversing, and for doing things to/from nodes. I found this super useful, only regret is the lack of const correctness with is mostly due to the exisitng const correctness problems in libpll.
I changed the test dir Makefile to be able to build in parallel (make -j) and made it explicitly use this libpll vs. possibly using some system wide version.
added memory manager to all tests where it makes sense (non-hardcoded input trees)

Usage

// first major change: we need information about the sizes of each subtree, both for the traversal
// later and for the default replacement strategy
auto subtree_sizes = pll_utree_get_subtree_sizes(tree);

auto attributes = simd_autodetect();
// enable the new mode
attributes |= PLL_ATTRIB_LIMIT_MEMORY;
// create the partition (note how nothing changes in this call)
auto partition = pll_partition_create(
    tips, // number of tips
    inner_nodes, // number of extra CLV buffers (one per inner in this case)
    ... );
const size_t low_clv_num = ceil(log2(tree->tip_count)) + 2;
pll_clv_manager_init(
	partition,
	concurrent_clvs, // !!! number of SLOTS we want !!!
	NULL, // slot replacement callback (NULL = default = MRC)
	NULL, // slot update callback
	NULL  // strategy data deallocation function
);
// since we are using the MRC strategy, we need to also initialize it
pll_clv_manager_MRC_strategy_init(
    partition->clv_man, // the CLV manager struct
    tree, // the utree
    subtree_sizes // a per-node_index array of cost to recompute. Here we use the size of the subtree
                         // starting at that node_index, toward the virtual root set in the tree
);

...

// later, when creating the operations for update_partials, we have the biggest change to normal code flow:
// we need to traverse the tree in a largest-subtree-first traversal
pll_utree_traverse_lsf(tree,
                      subtree_sizes,
                      PLL_TREE_TRAVERSE_POSTORDER,
                      cb_full_traversal,
                      travbuffer,
                      &traversal_size);

...

// create the operations array, update partials, etc

Notes:

pll_utree_traverse_lsf functions like the normal utree_traverse, except it takes in a utree instead of a unode. This is one of the things I'm totally open to change; it was more of a question of old vs. new style (and having the information about the tree being binary or not).
pll_utree_get_subtree_sizes internally uses a new function pll_utree_foreach which I guess has a lot of overlap with existing functions. We might want to consolidate those (coraxlib todo?)

Completeness TODOs

Some things I haven't been able to dive into fleshing out/implementing yet, that I would consider necessary for the manager to be "complete":

integration with repeats (this would have great synergy with the need to recompute)
making the manager thread safe
for now I've disallowed using memory saver wtihout the tipchar mode, but I think at least the groundwork is there (I actually have to go back and check why I disallowed that)

Other open questions

Everything is currently implemented on a per-partition basis, though there may be some overhead to be saved by consolodating some of it when using multiple partitions.
The "minimum recomputation cost" default replacement strategy may be complicating the default use-case due to having to supply a cost array. However this cost array can just equal the subtree_sizes array which is needed anyway to traverse the tree largest-subtree-first. Still, a smipler default may be desirable.

…g update_partial

…ict with autogen build

…eplacement data field, chagned terminology around re slottable and pinned

…s the PLL_PATTERN_TIP dependant addressable_begin (added *_end also). Fixed wrong addressable_clvs number, now correctly set to tips+buffers

…d; made changes to alloc / dealloc / init accordingly. Various renamings

… replace and pop_stack

…strategy, updated callsite accordingly

…iven clv index. Updated the update function accordingly. Fixed incorrect clv pointer calculation

…ng more systemic though

…, adding a subtree_size field to the unode_t, and adding functions for setting them, as well as a dedicated largest-subtree_first traversal function

…ithout tip-pattern

…tree size equality

… for memory mode

…of in the nodes

… order and applies a callback function

…to compute all possible subtree sizes

…slotted

…versal identifier const

BenoitMorel · 2020-12-15T19:21:56Z

Here is a quick feedback before leaving for vacations ;-)

I think it's great that you managed to reduce as much as possible the changes in the existing code and to put most of the new stuff in one structure.
I appreciate the inline descriptions of the data members and the detailed description on how to adapt existing code + the example (maybe it would be worth adding it to the code samples?)
The PR provides a better encapsulation which is always good .
I think it's fine to have this on a per-partition basis. Or do you think the overhead of repeating the operations on all partitions will be expansive?
It would be really great to support repeats. But I guess it's also a bit more challenging. Maybe we can have a look together around mid January to do a proper encapsulation of both "modes".

I can't look at it in details now, but it looks great to me at a first glance ;-)

amkozlov · 2020-12-15T19:52:23Z

Looks very nice! After reading the description and without looking at the code yet, I have a couple of comments regarding potential raxml-ng integration:

combination with site repeats would be really important, because they do not only save memory but also improve performance, so right now it's a bit unclear how often pure CLV recomputation mode will "win"
thread safety issues can be avoided by replicating CLV manager for every thread (and partition), this of course means extra overhead but probably negligible in most cases (still, might be a reason for making "unnecessary" datastructs like unused_slots optional)
obviously, tree topology and thus subtree_sizes will change after every move, so there should be a way to reset it after CLV manager creation (maybe add subtree_sizes_valid flag to keep track?)
and since subtree_sizes is always needed for LSF traversal, maybe it should be encapsulated in CLV manager? (would probably simplify the interface a bit)

pierrebarbera · 2020-12-15T19:53:22Z

Here is a quick feedback before leaving for vacations ;-)

Thanks for taking the time, and for the kind words! :)

I appreciate the inline descriptions of the data members and the detailed description on how to adapt existing code + the example (maybe it would be worth adding it to the code samples?)

good point, will add a basic example. Whats already there is the tests, but they can be hard to read

I think it's fine to have this on a per-partition basis. Or do you think the overhead of repeating the operations on all partitions will be expansive?

I don't think it will be honestly. A lot of the operations have to be per-clv_buffer anyway, and the overhead regarding the subtree sizes, for example, can just be done once per tree.

It would be really great to support repeats. But I guess it's also a bit more challenging. Maybe we can have a look together around mid January to do a proper encapsulation of both "modes".

I agree, I would really like to have it in. In epa-ng it currently induces quite some overhead, especially when having to recompute all possible branches, so anything that can speed that up would be great.

pierrebarbera · 2020-12-15T20:16:01Z

combination with site repeats would be really important, because they do not only save memory but also improve performance, so right now it's a bit unclear how often pure CLV recomputation mode will "win"

agreed, its high on the list.

thread safety issues can be avoided by replicating CLV manager for every thread (and partition), this of course means extra overhead but probably negligible in most cases (still, might be a reason for making "unnecessary" datastructs like unused_slots optional)

Thats a good quick way around it, but probably we would want to experiment if introducing a mutex for every slot would be worth it. Either way it should be optional to minimize unneeded overhead for differing par. schemes/no par.

obviously, tree topology and thus subtree_sizes will change after every move, so there should be a way to reset it after CLV manager creation (maybe add subtree_sizes_valid flag to keep track?)

and since subtree_sizes is always needed for LSF traversal, maybe it should be encapsulated in CLV manager? (would probably simplify the interface a bit)

This is a good point; I had it a bit too separate in my mind: its needed for the MRC strategy, but there it's just one possible form of "cost". But more importantly since you absolutely need it in the lsf-traversal, I guess theres no reason not to have it. This way the MRC_strategy_init may even become totally unnecessary in the full-default case, if we force the user to always supply a {u|r}tree. The one slightly annoying thing is that the lsf-traversal function would have to retain the pointer to the subtree sizes array, just in case someone wants to use it without the memsaver. Small price to pay I'd say.

computations · 2020-12-15T20:56:24Z

I have a bunch of bikeshedding comments that I will leave off. I think this is very well done, but I do have some technical questions:

Does the stack structure need the bool empty? I think you can change the logic such that you don't need this.
Do you need recompacting functions? Or maybe other functions which report some statistics for the state of the managed clvs?
Does locality matter/how does this impact the locality of libpll? I know in the past that I have done some experiments that showed even in the worst case, the impact wasn't that bad, but I think that it is worth checking again.
I think that this shares a lot of challenges and design choices with a garbage collector. No question there, just pointing out that we might be able to learn something from them.

pierrebarbera · 2020-12-15T21:12:54Z

I have a bunch of bikeshedding comments that I will leave off. I think this is very well done, but I do have some technical questions:

Does the stack structure need the bool empty? I think you can change the logic such that you don't need this.

True, and also safer.

Do you need recompacting functions? Or maybe other functions which report some statistics for the state of the managed clvs?

I don't understand what you mean by recompacting functions, elaborate please? As for statistics: I haven't needed anything like that yet, any specific suggestions? Like, number of pinned slots, total recomputation cost?

Does locality matter/how does this impact the locality of libpll? I know in the past that I have done some experiments that showed even in the worst case, the impact wasn't that bad, but I think that it is worth checking again.

Interesting point, perhaps some more advanced replacement strategy could prefer to put child and parent clvs in slots that are close to each other in terms of actual buffer address, since we are abstracting here already anyway. With normal allocation in the partition, probably the memory is somewhat fragmented, as we allocate one CLV at a time, right? As opposed to one large malloc.

I think that this shares a lot of challenges and design choices with a garbage collector. No question there, just pointing out that we might be able to learn something from them.

In a sense, yes, though the goal isn't to deallocate the buffers of the slots, but rather have the program never exceed some amount of memory. By the way, in the epa-ng codebase I have a bunch of functions to gauge the memory footprint of a partition/etc. maybe these could be useful in the library at some point, implemented in a more "plugged in" way (hand in hand with a refactor of pll_partition_create).

Again, thanks for the comments!

computations · 2020-12-15T21:35:02Z

I don't understand what you mean by recompacting functions, elaborate please? As for statistics: I haven't needed anything like that yet, any specific suggestions? Like, number of pinned slots, total recomputation cost?

I mean some function that just swaps the allocations such that we obtain a "better"[1] layout. To be honest, I was kinda just thinking in writing there, and that thought was the "proto"thought to the locality question. Basically, the motivation is that if you find that locality matters, then we might want to, occasionally, reallocate in a way that improves the locality.

With normal allocation in the partition, probably the memory is somewhat fragmented, as we allocate one CLV at a time, right? As opposed to one large malloc.

This is half true? because the CLV is actually as long as the alignment (except in site repeats), so the length usually saves us from any locality problems. But I think EPA-NG often has really short alignments, so maybe this is a bigger factor? I did the experiment with 1000 sites, and didn't try different numbers. I think that if you have time this should be looked into.

[1]: better is left as an exercise to the reader

…pll_errno for clv manager failure

pierrebarbera added 30 commits May 18, 2020 08:31

initial commit

2e53f57

I hereby unilaterally declare pll_bool_t as legacy

78b125e

defined C standard in cmake file

6221f5d

refactored clv allocation into its own function

e32aab2

added clv_manager.c, manager alloc, dealloc, and first use-case durin…

aca1a32

…g update_partial

deleted custom makefile from the tree for now as it would be in confl…

24ba11b

…ict with autogen build

added a simple stack datastructure, switched unpinnable array over to it

c63e44f

corrected spelling of 'address'

84e60fb

actually added new files to the build (duh), fixed some compile errors

7a4696d

updated test readme, added relative path to static lib in test makefile

35350ae

updated gitignore

61e8bf7

added memory management atrrib to tests

160b253

refactored manager to use a function pointer for replacement, added r…

f42622a

…eplacement data field, chagned terminology around re slottable and pinned

refactored the tracked size to tracking both slottable size as well a…

78a744e

…s the PLL_PATTERN_TIP dependant addressable_begin (added *_end also). Fixed wrong addressable_clvs number, now correctly set to tips+buffers

partial development of MRC strategy (EOD commit)

8be4143

changed MRC data to its own struct, containing cost per slot and clvi…

e615787

…d; made changes to alloc / dealloc / init accordingly. Various renamings

corrected setting of empty field

1009834

added update to pushed-out clv information, swapped order of calls to…

f074284

… replace and pop_stack

reverted uneccesary wrap around the tip case clv fetch

2a12695

changed test-makefile to support parallel compilation

d6a8768

added slot update function and relevant callback for the replacement …

0ebeaac

…strategy, updated callsite accordingly

changed replacement callback function to return the new slot of the g…

ff45a38

…iven clv index. Updated the update function accordingly. Fixed incorrect clv pointer calculation

added clv manager access wrap to remaining non-repeats partial functions

78ebd32

added pinning and unpinning functionality to partials update

b664377

added tree reordering function, very likely to be replaced by somethi…

a8819aa

…ng more systemic though

made pll_utree_every callback non-const

3a78f94

replaced the reorder function with a more heavily integrated approach…

233da1e

…, adding a subtree_size field to the unode_t, and adding functions for setting them, as well as a dedicated largest-subtree_first traversal function

fixed incorrect traversal, added subtree size to node info print

63d5bba

return failure if partition_create was called with memory saver but w…

0c5f854

…ithout tip-pattern

added manager wrap to likelihood functions

5bb0b34

pierrebarbera added 15 commits June 5, 2020 18:50

added clv_man to further tests

89f18f2

added strategy deallocation callback to the clv_manager (I miss c++)

1590d18

added clv manager wraps to the derivatives functions

8dfeb2c

updated usage of clv_man in the tests

2c17b60

changed lsf traversal to default to normal traversal behaviour on sub…

c9905d3

…tree size equality

adapted scaling test for memorz mode. skip tests that dont make sense…

fc7cdc0

… for memory mode

changed accounting of subtree sizes to be tracked externally instead …

bd4a9b4

…of in the nodes

added functions for using memory saver with rooted trees

102dc29

made traverse_lsf more const

e79637e

added new foreach function that visits every node ina given traversal…

6999ce1

… order and applies a callback function

refactored get_subtree_sized to use new foreach function, updated it …

29e9161

…to compute all possible subtree sizes

wrapped pinning functionality in functions

afc0885

added convenience function wrapping the check whether a clv_index is …

a70c718

…slotted

set the low_clv_num to the more general case

d20c2cd

more doxygen-style descriptions for the added functions. made lsf tra…

6df7673

…versal identifier const

pierrebarbera added 3 commits February 9, 2021 11:20

removed bool empty from stack, added a function instead

0a84699

added convenience function checking if clv manager is enabled. Added …

f0f638d

…pll_errno for clv manager failure

partial work on repeats integration

18261f8

pierrebarbera force-pushed the dev_lognclvs branch from 8992ddf to 18261f8 Compare May 26, 2021 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLV Management / Saving mode #19

CLV Management / Saving mode #19

pierrebarbera commented Dec 15, 2020 •

edited

Loading

BenoitMorel commented Dec 15, 2020

amkozlov commented Dec 15, 2020

pierrebarbera commented Dec 15, 2020

pierrebarbera commented Dec 15, 2020

computations commented Dec 15, 2020

pierrebarbera commented Dec 15, 2020

computations commented Dec 15, 2020

CLV Management / Saving mode #19

Are you sure you want to change the base?

CLV Management / Saving mode #19

Conversation

pierrebarbera commented Dec 15, 2020 • edited Loading

New structures

Lifetime management

Changes to existing code

clv access

pinning

changes to clv allocation

other minor changes and additions

Usage

Completeness TODOs

Other open questions

BenoitMorel commented Dec 15, 2020

amkozlov commented Dec 15, 2020

pierrebarbera commented Dec 15, 2020

pierrebarbera commented Dec 15, 2020

computations commented Dec 15, 2020

pierrebarbera commented Dec 15, 2020

computations commented Dec 15, 2020

pierrebarbera commented Dec 15, 2020 •

edited

Loading