Soroban performance notes & CAPs #1460
Replies: 5 comments 16 replies
-
re: https://github.com/stellar/stellar-protocol/blob/master/core/cap-0054.md#xdr-changes -- I think we need a placeholder to define the default values for the new settings? |
Beta Was this translation helpful? Give feedback.
-
Regarding fees for Inter-transaction caching, one option could be that transactions declare a higher fees but we divide the charge among all transaction in the set and refund the excess fees. |
Beta Was this translation helpful? Give feedback.
-
I don't think this statement regarding the current (worst-case) cost model is accurate. It's a relatively minor detail which doesn't affect the approaches in this proposal or takes away its message, just wanted to point it out:
We are defining the worst case (during calibration) by passing in |
Beta Was this translation helpful? Give feedback.
-
How about storing the "refined" cost model in a separate ledger entry, say
|
Beta Was this translation helpful? Give feedback.
-
Soroban performance notes & CAPs
This note explains a bit of where we are and where we're likely to be going in the near term with Soroban's performance, and points to a few CAPs aiming at near-term improvements.
Summary
There is a lot of low-hanging fruit, it's not in the most obvious places you might expect, the opportunities are things we knew about all through the development of Soroban and simply deferred in the schedule in order to get the product out the door, and we're going to be working through these items over the next several protocol releases.
I'll discuss each of these issues below, but briefly the list is:
Improving VM instantiation
If you look at a real-time profile of Soroban's current execution timing you will see something like this:
What this shows is that the construction of a Wasm VM (Vm::new) is regularly taking more time -- often as much as 4 or 5x more time -- than the time we spend actually running code on that VM (Vm::invoke_function_raw). And a lot of that time is actually parsing modules, which we currently redo for every invocation (even multiple sub-invocations within a single transaction: we re-parse each Wasm module every time it's called).
While this seems bad, it's not even the worst part. The worst part is that we don't charge real-time costs when calculating which transactions to admit. This is because we have to charge an identical cost on each node, to remain in strict consensus about the success or failure of any given transaction and the costs incurred, so we use a cost model for the work being done: an over-approximation based on worst-case estimates of the work we're about to do. And the cost model for VM instantiation is especially coarse in its worst-case estimate. If you look at the ratio of cost models, the picture is even more grim.
What this shows is the cost model is charging even more -- often as much as 6 or 7x as much -- for VM instantiation as all the rest of the work it charges for. And cost models control how many transactions we admit and how much we charge in fees. So addressing the over-estimate of the VM instantiation cost model is actually job #1.
Tightening the cost model fo VM instantiation
The current cost model for VM instantiation is a linear function of the number of bytes of Wasm bytecode provided as input to the VM instantiation process (which includes parsing, validating and instantiation of the Wasm module). This cost model functions in ignorance of the actual content of the Wasm module -- since it hasn't parsed it yet -- so it assumes the worst: that every byte in the input defines a new Wasm function. This is actually a bit worse than the worst case, in practice Wasm needs a few bytes to define a new function, but it's in the ballpark of the worst case.
However, once we've parsed the module once we know how many functions (and tables, imports, exports, instructions, etc.) the module defines, and we can in theory save that information to the ledger (as a set of numbers) along with the bytes that make up the module such that the next time we parse the module, we can run a "refined" cost model that takes those numbers as input rather than the byte-size of the module, and estimates a much tighter consensus cost model of the true cost of instantiating the module.
So this is what we're going to do (and aim to get into protocol 21): add such numeric fields to the ledger entry that stores the contract code, fill them in on contract upload, and reuse them in future instantiations. This should immediately improve the number of transactions we can admit to a given ledger, as well as lowering the fee for each, while not actually doing any less work. Just modeling the work we do more precisely.
Along the way, we'll be splitting the cost model in two -- essentially duplicating all instantiation-related cost types -- to separately charge for "parsing and validating" and "the rest of instantiation", because these are fairly distinct activities in wasmi and there are opportunities for separate improvement to each. If we make an improvement to either, we want to be able to immediately reflect the improvement in an improved cost model, so we wind up with quite a few new cost types at the end of this refinement.
Once we have a refined model, we can also start reducing the work done by some of the terms in the refined model, and they can be accurately reflected in the model due to its refined structure.
Doing less overall work on VM instantiation
There are a few possible ways to just do less overall work while instantiating a module. We're going to try to do all three eventually, but due scheduling it's likely these will happen in a few stages.
Only linking used host functions
This is the simplest change and it only takes a few lines of code once we have a refined cost model: it turns out wasmi's "linker" is fairly expensive to add host functions to, and we currently add all host functions to every linker, whereas if we are selective about linking only those host functions that get used by a module we can reduce this cost.
Again, this is only an improvement users can see if the cost model is refined first to separately account for imports as an input, but once the refinement is done this improvement is trivial to include (again aiming for protocol 21).
Caching translated Wasm modules
The simplest thing to do is just cache modules (in memory) that we've already parsed and validated, so we don't parse and validate them a second time. There are 3 levels we can do this at, each of which offers greater cost savings:
So far we're aiming to do #1 in the near term -- along with the cost model refinement and linking improvement, aiming for protocol 21 -- because it's fairly easy to do, and the fee implications are straightforward. We also expect that #2 will bring large benefit for an acceptable level of additional fee and limit complexity, so are aiming to at least explore doing it too in the slightly-longer term (perhaps protocol 22 or 23?), though it'll a bit more take more time to work through the details and adapt the transaction queue to them. It's less clear that there'll be a big additional win by pursuing #3, but it's possible even further down the line.
There is one surprising wrinkle in fees even for case #1 brought about by constraints in the VM implementation, which is that we will wind up charging for parsing and validating all contracts in the footprint of the transaction immediately on transaction initialization, rather than slightly later, when the actual calls and VM instantiations occur. There are a few theoretical types of transaction that might wind up paying a higher fee as a result of this -- for example if you made a transaction that in simulation made multiple contract calls but in actual execution decided against making one of those calls -- but in practice we believe this is a fairly contrived case and the vast majority of transactions will see lower fees.
Lazily translating Wasm functions
Currently when we parse, validate and instantiate a module, wasmi translates all of the instructions in the module into an internal representation it uses for execution. It does this regardless of which functions the user is intending to call in a given transaction. If a transaction only calls 1, 2, or even a dozen functions in a contract that has 100 functions, that's quite a lot of wasted work.
Wasmi 0.32 (currently in beta) introduces an all-new register VM, which will run quite a bit faster in general, but it also supports a new "lazy translation" mode. In this mode the structure of the Wasm module is parsed but the translation of individual instructions is deferred until each executes, at which point the cost for translation is charged using the internal wasmi gas metering system.
Adopting this approach will complicate the inter-transaction caching story further, and require careful coordination with wasmi's internal gas metering to ensure a reasonable level of cost-model accuracy, but we anticipate that if we can make it work it will be a significant win on almost all transactions. We're unsure on the timeline right now: it will probably not make protocol 21, but will probably move to near the front of the queue shortly thereafter.
Parallelizing execution
I expect it's a surprise that parallelization is this far down the list but it's a much larger undertaking and there's a bunch of easier and more immediate wins to pursue first.
That said, Soroban was designed to support parallelization, and the approach we've been anticipating and are still currently intending to pursue should be relatively straightforward.
The key to our current intended approach is the static footprints accompanying each transaction. This will enable the transaction queueing logic to decide statically (before executing each transaction) on a schedule for a given transaction set that partitions groups of transactions from one another, such that each partition executes in complete isolation, touching unrelated groups of ledger entries. Since the partitions are independent we can be assured that they can execute in parallel with no runtime concurrency control mechanism.
Such an approach has the additional advantage that we can bound the total execution time of any given partition, and so ensure the schedule nominated by the transaction queue fits within the normal (5 second) ledger-close latency target, at least on any validator with as many true cores as the parallel-execution model used to form the schedule.
The exact partitioning function remains to be decided and will require a careful balance between simplicity, speed, incremental execution and integration with fee calculations, but we're currently looking at developing a variant of the Strife algorithm to meet these needs.
It is also possible that, either initially or in subsequent iterations of the parallel execution work, we may adopt a structure that retains the deterministic static scheduling property, but adopts some runtime concurrency control mechanism, such as in-memory multi-versioning of ledger entries. Doing so would enable potentially greater degrees of parallelism at the cost of higher coordination overheads and greater implementation complexity (and risk of bugs). A leading candidate in this case would be something similar to the BOHM algorithm; but doing so would be significantly more involved and it's not clear at this point that it's worth biting off that complexity up front rather than leaving it to future iterations. It's also not clear that it's possible to bound the worst-case execution of a given schedlue under such an algortihm, given the dynamic concurrency control. This question requires more study if we decide to get into it.
Beta Was this translation helpful? Give feedback.
All reactions