Soroban performance notes & CAPs #1460

graydon · 2024-03-12T23:08:01Z

graydon
Mar 12, 2024
Collaborator

Soroban performance notes & CAPs

This note explains a bit of where we are and where we're likely to be going in the near term with Soroban's performance, and points to a few CAPs aiming at near-term improvements.

Summary

There is a lot of low-hanging fruit, it's not in the most obvious places you might expect, the opportunities are things we knew about all through the development of Soroban and simply deferred in the schedule in order to get the product out the door, and we're going to be working through these items over the next several protocol releases.

I'll discuss each of these issues below, but briefly the list is:

Improving VM instantiation (PR here)
- Tightening the cost model for VM instantiation (CAP 0054)
- Doing less overall work on VM instantiation
  - Only linking used host functions (CAP 0055)
  - Caching translated Wasm modules (CAP 0056)
  - Lazily translating Wasm functions
Parallelizing execution

Improving VM instantiation

If you look at a real-time profile of Soroban's current execution timing you will see something like this:

What this shows is that the construction of a Wasm VM (Vm::new) is regularly taking more time -- often as much as 4 or 5x more time -- than the time we spend actually running code on that VM (Vm::invoke_function_raw). And a lot of that time is actually parsing modules, which we currently redo for every invocation (even multiple sub-invocations within a single transaction: we re-parse each Wasm module every time it's called).

While this seems bad, it's not even the worst part. The worst part is that we don't charge real-time costs when calculating which transactions to admit. This is because we have to charge an identical cost on each node, to remain in strict consensus about the success or failure of any given transaction and the costs incurred, so we use a cost model for the work being done: an over-approximation based on worst-case estimates of the work we're about to do. And the cost model for VM instantiation is especially coarse in its worst-case estimate. If you look at the ratio of cost models, the picture is even more grim.

What this shows is the cost model is charging even more -- often as much as 6 or 7x as much -- for VM instantiation as all the rest of the work it charges for. And cost models control how many transactions we admit and how much we charge in fees. So addressing the over-estimate of the VM instantiation cost model is actually job #1.

Tightening the cost model fo VM instantiation

The current cost model for VM instantiation is a linear function of the number of bytes of Wasm bytecode provided as input to the VM instantiation process (which includes parsing, validating and instantiation of the Wasm module). This cost model functions in ignorance of the actual content of the Wasm module -- since it hasn't parsed it yet -- so it assumes the worst: that every byte in the input defines a new Wasm function. This is actually a bit worse than the worst case, in practice Wasm needs a few bytes to define a new function, but it's in the ballpark of the worst case.

However, once we've parsed the module once we know how many functions (and tables, imports, exports, instructions, etc.) the module defines, and we can in theory save that information to the ledger (as a set of numbers) along with the bytes that make up the module such that the next time we parse the module, we can run a "refined" cost model that takes those numbers as input rather than the byte-size of the module, and estimates a much tighter consensus cost model of the true cost of instantiating the module.

So this is what we're going to do (and aim to get into protocol 21): add such numeric fields to the ledger entry that stores the contract code, fill them in on contract upload, and reuse them in future instantiations. This should immediately improve the number of transactions we can admit to a given ledger, as well as lowering the fee for each, while not actually doing any less work. Just modeling the work we do more precisely.

Along the way, we'll be splitting the cost model in two -- essentially duplicating all instantiation-related cost types -- to separately charge for "parsing and validating" and "the rest of instantiation", because these are fairly distinct activities in wasmi and there are opportunities for separate improvement to each. If we make an improvement to either, we want to be able to immediately reflect the improvement in an improved cost model, so we wind up with quite a few new cost types at the end of this refinement.

Once we have a refined model, we can also start reducing the work done by some of the terms in the refined model, and they can be accurately reflected in the model due to its refined structure.

Doing less overall work on VM instantiation

There are a few possible ways to just do less overall work while instantiating a module. We're going to try to do all three eventually, but due scheduling it's likely these will happen in a few stages.

Only linking used host functions

This is the simplest change and it only takes a few lines of code once we have a refined cost model: it turns out wasmi's "linker" is fairly expensive to add host functions to, and we currently add all host functions to every linker, whereas if we are selective about linking only those host functions that get used by a module we can reduce this cost.

Again, this is only an improvement users can see if the cost model is refined first to separately account for imports as an input, but once the refinement is done this improvement is trivial to include (again aiming for protocol 21).

Caching translated Wasm modules

The simplest thing to do is just cache modules (in memory) that we've already parsed and validated, so we don't parse and validate them a second time. There are 3 levels we can do this at, each of which offers greater cost savings:

Intra-transaction: multiple calls to the same contract in a given transaction -- such as calling 2 or more methods on a token -- reuse a module, so we only pay the parsing cost on the first call. This type of caching is fairly simple and presents minimal changes to the fee structure, mainly just lower overall costs.
Inter-transaction: multiple transactions in the same ledger that invoke the same contracts can reuse a module that's already used elsewhere in the same ledger (i.e. in a single transaction set). While this seems simple in theory, it's complicated two things:
- We must be careful to limit the total set of contracts instantiated across a ledger, to avoid running out of memory. This is probably not very hard to ensure, but it's another limit to debate and establish.
- It is unclear which transaction or set of transactions to charge for the cost of the cached modules. If we charge "the first transaction in the ledger to touch the module" then everyone has to set their fees high in anticipation that they might get charged the full fee. If we divide the charge by the set of transactions using the module, it becomes hard for a transaction to anticipate its fee because it will change based on which other transactions they are bundled with. So there's some complex fee and queueing logic to work out to make this happen.
Inter-ledger: multiple ledgers touching the same contracts reuse at least some set of parsed modules across them. As with the inter-transaction case, this has some complexity around fees that make it hard to tell who to charge or how much to charge, in addition to some complexity around figuring out how to "prime the cache" during catchup.

So far we're aiming to do #1 in the near term -- along with the cost model refinement and linking improvement, aiming for protocol 21 -- because it's fairly easy to do, and the fee implications are straightforward. We also expect that #2 will bring large benefit for an acceptable level of additional fee and limit complexity, so are aiming to at least explore doing it too in the slightly-longer term (perhaps protocol 22 or 23?), though it'll a bit more take more time to work through the details and adapt the transaction queue to them. It's less clear that there'll be a big additional win by pursuing #3, but it's possible even further down the line.

There is one surprising wrinkle in fees even for case #1 brought about by constraints in the VM implementation, which is that we will wind up charging for parsing and validating all contracts in the footprint of the transaction immediately on transaction initialization, rather than slightly later, when the actual calls and VM instantiations occur. There are a few theoretical types of transaction that might wind up paying a higher fee as a result of this -- for example if you made a transaction that in simulation made multiple contract calls but in actual execution decided against making one of those calls -- but in practice we believe this is a fairly contrived case and the vast majority of transactions will see lower fees.

Lazily translating Wasm functions

Currently when we parse, validate and instantiate a module, wasmi translates all of the instructions in the module into an internal representation it uses for execution. It does this regardless of which functions the user is intending to call in a given transaction. If a transaction only calls 1, 2, or even a dozen functions in a contract that has 100 functions, that's quite a lot of wasted work.

Wasmi 0.32 (currently in beta) introduces an all-new register VM, which will run quite a bit faster in general, but it also supports a new "lazy translation" mode. In this mode the structure of the Wasm module is parsed but the translation of individual instructions is deferred until each executes, at which point the cost for translation is charged using the internal wasmi gas metering system.

Adopting this approach will complicate the inter-transaction caching story further, and require careful coordination with wasmi's internal gas metering to ensure a reasonable level of cost-model accuracy, but we anticipate that if we can make it work it will be a significant win on almost all transactions. We're unsure on the timeline right now: it will probably not make protocol 21, but will probably move to near the front of the queue shortly thereafter.

Parallelizing execution

I expect it's a surprise that parallelization is this far down the list but it's a much larger undertaking and there's a bunch of easier and more immediate wins to pursue first.

That said, Soroban was designed to support parallelization, and the approach we've been anticipating and are still currently intending to pursue should be relatively straightforward.

The key to our current intended approach is the static footprints accompanying each transaction. This will enable the transaction queueing logic to decide statically (before executing each transaction) on a schedule for a given transaction set that partitions groups of transactions from one another, such that each partition executes in complete isolation, touching unrelated groups of ledger entries. Since the partitions are independent we can be assured that they can execute in parallel with no runtime concurrency control mechanism.

Such an approach has the additional advantage that we can bound the total execution time of any given partition, and so ensure the schedule nominated by the transaction queue fits within the normal (5 second) ledger-close latency target, at least on any validator with as many true cores as the parallel-execution model used to form the schedule.

The exact partitioning function remains to be decided and will require a careful balance between simplicity, speed, incremental execution and integration with fee calculations, but we're currently looking at developing a variant of the Strife algorithm to meet these needs.

It is also possible that, either initially or in subsequent iterations of the parallel execution work, we may adopt a structure that retains the deterministic static scheduling property, but adopts some runtime concurrency control mechanism, such as in-memory multi-versioning of ledger entries. Doing so would enable potentially greater degrees of parallelism at the cost of higher coordination overheads and greater implementation complexity (and risk of bugs). A leading candidate in this case would be something similar to the BOHM algorithm; but doing so would be significantly more involved and it's not clear at this point that it's worth biting off that complexity up front rather than leaving it to future iterations. It's also not clear that it's possible to bound the worst-case execution of a given schedlue under such an algortihm, given the dynamic concurrency control. This question requires more study if we decide to get into it.

MonsieurNicolas · 2024-03-13T20:26:05Z

MonsieurNicolas
Mar 13, 2024
Maintainer

For CAP-0054 Soroban refined VM instantiation cost model

8 replies

graydon Mar 13, 2024
Collaborator Author

My intent was to do nothing about existing contracts, just let them age-out and expire as people upload updates, which I expect to happen fairly often (and is easy to do intentionally if someone wants to get the speedup). Though others have argued that it might be better to allow an in-place update/overwrite with the new ext, and I am happy to try to make that work too.

I believe that case -- allowing ext-rewriting point-updates -- is significantly less work and risk than trying to bulk-update the entire database full of wasms on the upgrade. Like I'm not even sure how we'd manage the bulk upgrade with the current code interfaces, and it is performance-risky since it involves potentially quite a lot of unmetered IO and processing.

I'm not aware of anything particularly tricky about caching here. If someone uploads an existing wasm (overwiting it) in a transaction that has that wasm already instantiated (i.e. it was in the read footprint and is therefore already in the parsed module cache) the presence of the module in the cache has no relevance to the write-to-the-ledger that's going to occur when updating its ext field; and the write-to-the-ledger has no relevance to the module in the cache. We could theoretically update the cache as well as the ledger with the new cost model, but there's not much to gain since the module in the cache is already parsed (was parsed under the old cost model at instantiation time). I would probably not bother updating it; either way the next tx will parse the module under the updated cost model.

MonsieurNicolas Mar 14, 2024
Maintainer

The type of problem I am talking about are around the footgun that we'd have in the code: you have the already in memory cache based off the old wasm cost model, if anything reads the wasm ledger entry and makes any decision based on that, the potential for sadness is pretty high.

If you're saying no upgrade at all, we should clearly spell it out in the CAP as it seems important as it implies that anybody planning to lock liquidity in a contract has to ensure that their contract is upgradable or they will be stuck in this cost model.
The interesting question there is: it's probably not the last time that this will occur, so is it acceptable to require contracts that hold liquidity to be upgradable?

graydon Mar 14, 2024
Collaborator Author

I really don't think there's any "potential for sadness" here. Think it through:

The only things that can differ between the cached and stored ledger entry are the stored cost inputs.
The only thing that even reads that LE is the invocation path (it's not general contract data).
The cache only lives for one transaction, and it's most likely that a transaction that writes to the code ledger entry won't also immediately invoke that entry.
If for some reason user code does invoke an entry it just wrote in a single transaction, the host will (deterministically, depending on how we code it) either charge the old cached or new stored-in-ledger cost model.
Either option is benign. It costs a bit more, or it costs a bit less. The user will see the correct fee from simulation and pay it as the cost of running the transaction.

graydon Mar 14, 2024
Collaborator Author

(I should also emphasize that even if there was some "read contract code" method -- which there isn't -- contracts have no general ability to surprise us and issue any unpredictable reads from the ledger on the fly. They "read" entries by retrieving entries their footprint already declared their intention to read, that we pulled from the LTX before the contract began execution and put in the storage map.)

MonsieurNicolas Mar 14, 2024
Maintainer

ok so per CAP meeting let's allow for populating the missing extension by reusing the upload path

MonsieurNicolas · 2024-03-14T19:13:43Z

MonsieurNicolas
Mar 14, 2024
Maintainer

re: https://github.com/stellar/stellar-protocol/blob/master/core/cap-0054.md#xdr-changes -- I think we need a placeholder to define the default values for the new settings?

0 replies

anupsdf · 2024-03-14T20:03:39Z

anupsdf
Mar 14, 2024
Collaborator

Regarding fees for Inter-transaction caching, one option could be that transactions declare a higher fees but we divide the charge among all transaction in the set and refund the excess fees.

6 replies

anupsdf Mar 14, 2024
Collaborator

I was referring to the Inter-trasaction section above contemplating how to charge the second and subsequent transactions that take advantage of previously cached module from the first transaction. So, all the transactions would declare the resource fees from running simulation as if they were not taking advantage of caching. But when we apply them, we could split the total cost (higher first transaction's cost plus lower subsequent) evenly among first and subsequent transactions and refund the excess fees. So everyone gets a little bit back.
Here is the snippet,
It is unclear which transaction or set of transactions to charge for the cost of the cached modules. If we charge "the first transaction in the ledger to touch the module" then everyone has to set their fees high in anticipation that they might get charged the full fee. If we divide the charge by the set of transactions using the module, it becomes hard for a transaction to anticipate its fee because it will change based on which other transactions they are bundled with. So there's some complex fee and queueing logic to work out to make this happen.

graydon Mar 14, 2024
Collaborator Author

Yeah we've discussed this approach before and I think @dmkozh indicated it's not really compatible with the way fees and/or queueing works, but I don't remember the exact reasoning.

dmkozh Mar 14, 2024
Collaborator

Fees are not the main issue really. FWIW I think it's generally more productive to talk about the resources (i.e. fee components) then about fees. E.g. for inter-transaction caching the issue is that it's tricky to compute per-ledger instruction count. Say, we have transactions A and B, both declaring 10M instructions. That's +20M towards the ledger instruction limit. Now in order to benefit from caching we need to subtract the cost of the 'overlapping' VM instantiations, e.g. if they both instantiate a contract that is worth 1M instructions, then we need to add just +19M instructions. The discounted fees will need to be based on the respective instruction discounts.

graydon Mar 15, 2024
Collaborator Author

Ah, so the tricky part is that the resources consumed are declared and the tx queue has no mental model for how it ought to discount some of those on the basis of overlaps in the declared resources? Like .. it'd require hoisting a bunch of logic up from soroban's budget code to core's tx queueing code even to know how much to discount?

dmkozh Mar 15, 2024
Collaborator

Yes, both tx queue and tx set building/validation logic need to be aware of the discount mechanism and the exact amount of instructions to discount by.
We've also discussed a completely free inter-ledger cache - that might make the logic somewhat simpler as we could get the discount values while we're building the cache itself (and these also will be stable for the whole duration of the ledger, which makes the tx queue logic somewhat simpler). However, that cache would need to be a part of the protocol (probably it should be built based on the few previously applied ledgers), which may or may not make it less efficient depending on network usage patterns.
I think we need to make some research based on phase 2 traffic in order to understand the Wasm usage patterns better.

jayz22 · 2024-03-15T15:45:14Z

jayz22
Mar 15, 2024
Collaborator

I don't think this statement regarding the current (worst-case) cost model is accurate. It's a relatively minor detail which doesn't affect the approaches in this proposal or takes away its message, just wanted to point it out:

This cost model functions in ignorance of the actual content of the Wasm module -- since it hasn't parsed it yet -- so it assumes the worst: that every byte in the input defines a new Wasm function. This is actually a bit worse than the worst case, in practice Wasm needs a few bytes to define a new function, but it's in the ballpark of the worst case.

We are defining the worst case (during calibration) by passing in n as number of empty functions. But when the calibration is run, it extracts the number of bytes and use it as the input (the input extraction is defined here and used here), not the number of functions (and assuming every byte is a function). So our input to the model calibration is at least authentic, not exaggerating the worst-case.

1 reply

graydon Mar 15, 2024
Collaborator Author

Hm. Ok, I think I see what you're saying, I'm .. not 100% certain, maybe I just made a similar mistake while I was calibrating the new cost types (it's easy to get mixed up). I'll try to confirm/convince myself of this while recalibrating, thanks!

For reference (I got half way through figuring this out while replying before I understood what you were saying, might as well finish writing it it down somewhere) I think the minimum number of bytes required for a function is 4. Not sure. I think each needs:

An entry in the function section referencing a type for the function. That's 1 LEB128-encoded u32 byte at minimum.
An entry in the code section containing at minimum:
- A 1-byte LEB128-encoded u32 for the size
- A 1-byte LEB128-encoded u32 for the length of the argument vector
- A 1-byte opcode like END

anupsdf · 2024-03-25T20:55:14Z

anupsdf
Mar 25, 2024
Collaborator

How about storing the "refined" cost model in a separate ledger entry, say Contract Cost entry type, instead of adding to the contract code ledger entry? Three benefits of this would be,

This will avoid adding the Contract Code entry to the ledger (WASM sizes are in kilobytes)
This new Contract Cost entry would be lightweight in terms of size
If we have to refine the model in a future protocol or make a mistake and have to fix the refined cost model, then we only add a new 'Contract Cost' ledger entry instead of appending Contract Code entry.

1 reply

dmkozh Mar 25, 2024
Collaborator

That's an interesting suggestion, but it seems too cumbersome to implement and (more importantly) use:

Adding a new ledger entry type is pretty involved and quite error-prone (it requires updating a lot of code paths in Core)
The entries need to be linked so there is quite a bit of work to link them (both from the implementation standpoint, but also for the test setup and the budget charges)
This would add +1 ledger entry for every Wasm in the footprint, which is an issue for both for the users (higher fees), for the network (less capacity) and for the Core performance (the intuition is that we want to fetch less entries)
Rent would need to be bumped for both entries at once. That's rather inconvenient for the users and pretty easy to miss (that's why we came up with the contract instance storage)

Given that the amount of additional data is <100 bytes and that we only have a few hundred Wasms on-chain at the moment, I don't think it's a big issue to ask users who care about the Wasm performance to reupload them paying slightly higher write fee.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soroban performance notes & CAPs #1460

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 16 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Soroban performance notes & CAPs #1460

graydon Mar 12, 2024 Collaborator

Soroban performance notes & CAPs

Summary

Improving VM instantiation

Tightening the cost model fo VM instantiation

Doing less overall work on VM instantiation

Only linking used host functions

Caching translated Wasm modules

Lazily translating Wasm functions

Parallelizing execution

Replies: 5 comments · 16 replies

MonsieurNicolas Mar 13, 2024 Maintainer

graydon Mar 13, 2024 Collaborator Author

MonsieurNicolas Mar 14, 2024 Maintainer

graydon Mar 14, 2024 Collaborator Author

graydon Mar 14, 2024 Collaborator Author

MonsieurNicolas Mar 14, 2024 Maintainer

MonsieurNicolas Mar 14, 2024 Maintainer

anupsdf Mar 14, 2024 Collaborator

anupsdf Mar 14, 2024 Collaborator

graydon Mar 14, 2024 Collaborator Author

dmkozh Mar 14, 2024 Collaborator

graydon Mar 15, 2024 Collaborator Author

dmkozh Mar 15, 2024 Collaborator

jayz22 Mar 15, 2024 Collaborator

graydon Mar 15, 2024 Collaborator Author

anupsdf Mar 25, 2024 Collaborator

dmkozh Mar 25, 2024 Collaborator

graydon
Mar 12, 2024
Collaborator

Replies: 5 comments 16 replies

MonsieurNicolas
Mar 13, 2024
Maintainer

graydon Mar 13, 2024
Collaborator Author

MonsieurNicolas Mar 14, 2024
Maintainer

graydon Mar 14, 2024
Collaborator Author

graydon Mar 14, 2024
Collaborator Author

MonsieurNicolas Mar 14, 2024
Maintainer

MonsieurNicolas
Mar 14, 2024
Maintainer

anupsdf
Mar 14, 2024
Collaborator

anupsdf Mar 14, 2024
Collaborator

graydon Mar 14, 2024
Collaborator Author

dmkozh Mar 14, 2024
Collaborator

graydon Mar 15, 2024
Collaborator Author

dmkozh Mar 15, 2024
Collaborator

jayz22
Mar 15, 2024
Collaborator

graydon Mar 15, 2024
Collaborator Author

anupsdf
Mar 25, 2024
Collaborator

dmkozh Mar 25, 2024
Collaborator